Class PDFParser

All Implemented Interfaces:
Watchable, Runnable

public class PDFParser extends BaseWatchable
PDFParser is the class that parses a PDF content stream and produces PDFCmds for a PDFPage. You should never ever see it run: it gets created by a PDFPage only if needed, and may even run in its own thread.
  • Field Details

    • DEBUG_DCTDECODE_DATA

      public static final String DEBUG_DCTDECODE_DATA
      emit a file of DCT stream data.
      See Also:
    • stack

      private Stack<Object> stack
    • parserStates

      private Stack<PDFParser.ParserState> parserStates
    • state

      private PDFParser.ParserState state
    • path

      private GeneralPath path
    • clip

      private int clip
    • loc

      private int loc
    • resend

      private boolean resend
    • tok

      private PDFParser.Tok tok
    • catchexceptions

      private boolean catchexceptions
    • pageRef

      private WeakReference pageRef
      a weak reference to the page we render into. For the page to remain available, some other code must retain a strong reference to it.
    • cmds

      private PDFPage cmds
      the actual command, for use within a singe iteration. Note that this must be released at the end of each iteration to assure the page can be collected if not in use
    • stream

      byte[] stream
    • resources

    • debuglevel

      public static int debuglevel
    • errorwritten

      boolean errorwritten
  • Constructor Details

    • PDFParser

      public PDFParser(PDFPage cmds, byte[] stream, HashMap<String,PDFObject> resources)
      Don't call this constructor directly. Instead, use PDFFile.getPage(int pagenum) to get a PDFPage. There should never be any reason for a user to create, access, or hold on to a PDFParser.
  • Method Details

    • debug

      public static void debug(String msg, int level)
    • escape

      public static String escape(String msg)
    • setDebugLevel

      public static void setDebugLevel(int level)
    • throwback

      private void throwback()
      put the current token back so that it is returned again by nextToken().
    • nextToken

      private PDFParser.Tok nextToken()
      get the next token. TODO: this creates a new token each time. Is this strictly necessary?
    • readName

      private String readName()
      read a name (sequence of non-PDF-delimiting characters) from the stream.
    • readNum

      private double readNum()
      read a floating point number from the stream
    • readString

      private String readString()

      read a String from the stream. Strings begin with a '(' character, which has already been read, and end with a balanced ')' character. A '\' character starts an escape sequence of up to three octal digits.

      Parenthesis must be enclosed by a balanced set of parenthesis, so a string may enclose balanced parenthesis.

      Returns:
      the string with escape sequences replaced with their values
    • readByteArray

      private String readByteArray()
      read a byte array from the stream. Byte arrays begin with a 'invalid input: '<'' character, which has already been read, and end with a '>' character. Each byte in the array is made up of two hex characters, the first being the high-order bit. We translate the byte arrays into char arrays by combining two bytes into a character, and then translate the character array into a string. [JK FIXME this is probably a really bad idea!]
      Returns:
      the byte array
    • setup

      public void setup()
      Called to prepare for some iterations
      Overrides:
      setup in class BaseWatchable
    • iterate

      public int iterate() throws Exception
      parse the stream. commands are added to the PDFPage initialized in the constructor as they are encountered.

      Page numbers in comments refer to the Adobe PDF specification.
      commands are listed in PDF spec 32000-1:2008 in Table A.1

      Specified by:
      iterate in class BaseWatchable
      Returns:
      • Watchable.RUNNING when there are commands to be processed
      • Watchable.COMPLETED when the page is done and all the commands have been processed
      • Watchable.STOPPED if the page we are rendering into is no longer available
      Throws:
      Exception
    • processQCmd

      private void processQCmd()
      abstracted command processing for Q command. Used directly and as part of processing of mushed QBT command.
    • processBTCmd

      private void processBTCmd()
      abstracted command processing for BT command. Used directly and as part of processing of mushed QBT command.
    • cleanup

      public void cleanup()
      Cleanup when iteration is done
      Overrides:
      cleanup in class BaseWatchable
    • dumpStreamToError

      public void dumpStreamToError()
    • dumpStream

      public String dumpStream()
    • emitDataFile

      public static void emitDataFile(byte[] ary, String name)
      take a byte array and write a temporary file with it's data. This is intended to capture data for analysis, like after decoders.
      Parameters:
      ary -
      name -
    • findResource

      private PDFObject findResource(String name, String inDict) throws IOException
      get a property from a named dictionary in the resources of this content stream.
      Parameters:
      name - the name of the property in the dictionary
      inDict - the name of the dictionary in the resources
      Returns:
      the value of the property in the dictionary
      Throws:
      IOException
    • doXObject

      private void doXObject(PDFObject obj) throws IOException
      Insert a PDF object into the command stream. The object must either be an Image or a Form, which is a set of PDF commands in a stream.
      Parameters:
      obj - the object to insert, an Image or a Form.
      Throws:
      IOException
    • doImage

      private void doImage(PDFObject obj) throws IOException
      Parse image data into a Java BufferedImage and add the image command to the page.
      Parameters:
      obj - contains the image data, and a dictionary describing the width, height and color space of the image.
      Throws:
      IOException
    • doForm

      private void doForm(PDFObject obj) throws IOException
      Inject a stream of PDF commands onto the page. Optimized to cache a parsed stream of commands, so that each Form object only needs to be parsed once.
      Parameters:
      obj - a stream containing the PDF commands, a transformation matrix, bounding box, and resources.
      Throws:
      IOException
    • doPattern

      private PDFPaint doPattern(PatternSpace patternSpace) throws IOException
      Set the values into a PatternSpace
      Throws:
      IOException
    • parseObject

      private Object parseObject() throws PDFParseException
      Parse the next object out of the PDF stream. This could be a Double, a String, a HashMap (dictionary), Object[] array, or a Tok containing a PDF command.
      Throws:
      PDFParseException
    • parseInlineImage

      private void parseInlineImage() throws IOException
      Parse an inline image. An inline image starts with BI (already read, contains a dictionary until ID, and then image data until EI.
      Throws:
      IOException
    • doShader

      private void doShader(PDFObject shaderObj) throws IOException
      build a shader from a dictionary.
      Throws:
      IOException
    • getFontFrom

      private PDFFont getFontFrom(String fontref) throws IOException
      get a PDFFont from the resources, given the resource name of the font.
      Parameters:
      fontref - the resource key for the font
      Throws:
      IOException
    • setGSState

      private void setGSState(String name) throws IOException
      add graphics state commands contained within a dictionary.
      Parameters:
      name - the resource name of the graphics state dictionary
      Throws:
      IOException
    • parseColorSpace

      private PDFColorSpace parseColorSpace(PDFObject csobj) throws IOException
      generate a PDFColorSpace description based on a PDFObject. The object could be a standard name, or the name of a resource in the ColorSpace dictionary, or a color space name with a defining dictionary or stream.
      Throws:
      IOException
    • popFloat

      private float popFloat() throws PDFParseException
      pop a single float value off the stack.
      Returns:
      the float value of the top of the stack
      Throws:
      PDFParseException - if the value on the top of the stack isn't a number
    • popFloat

      private float[] popFloat(int count) throws PDFParseException
      pop an array of float values off the stack. This is equivalent to filling an array from end to front by popping values off the stack.
      Parameters:
      count - the number of numbers to pop off the stack
      Returns:
      an array of length count
      Throws:
      PDFParseException - if any of the values popped off the stack are not numbers.
    • popInt

      private int popInt() throws PDFParseException
      pop a single integer value off the stack.
      Returns:
      the integer value of the top of the stack
      Throws:
      PDFParseException - if the top of the stack isn't a number.
    • popFloatArray

      private float[] popFloatArray() throws PDFParseException
      pop an array of integer values off the stack. This is equivalent to filling an array from end to front by popping values off the stack.
      Parameters:
      count - the number of numbers to pop off the stack
      Returns:
      an array of length count
      Throws:
      PDFParseException - if any of the values popped off the stack are not numbers.
    • popString

      private String popString() throws PDFParseException
      pop a String off the stack.
      Returns:
      the String from the top of the stack
      Throws:
      PDFParseException - if the top of the stack is not a NAME or STR.
    • popObject

      private PDFObject popObject() throws PDFParseException
      pop a PDFObject off the stack.
      Returns:
      the PDFObject from the top of the stack
      Throws:
      PDFParseException - if the top of the stack does not contain a PDFObject.
    • popArray

      private Object[] popArray() throws PDFParseException
      pop an array off the stack
      Returns:
      the array of objects that is the top element of the stack
      Throws:
      PDFParseException - if the top element of the stack does not contain an array.