Class PDFFile


  • public class PDFFile
    extends java.lang.Object
    An encapsulation of a .pdf file. The methods of this class can parse the contents of a PDF file, but those methods are hidden. Instead, the public methods of this class allow access to the pages in the PDF file. Typically, you create a new PDFFile, ask it for the number of pages, and then request one or more PDFPages.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) java.nio.ByteBuffer buf
      A ByteBuffer containing the file data
      (package private) Cache cache
      a mapping of page numbers to parsed PDF commands
      private PDFDecrypter defaultDecrypter
      The default decrypter for streams and strings.
      (package private) PDFObject encrypt
      the Encrypt PDFObject, from the trailer
      static int FF_CHAR  
      (package private) PDFObject info
      The Info PDFPbject, from the trailer, for simple metadata
      private int majorVersion  
      private int minorVersion  
      static int NUL_CHAR  
      (package private) PDFXref[] objIdx
      the cross reference table mapping object numbers to locations in the PDF file
      private boolean printable
      whether the file is printable or not (trailer -> Encrypt -> P & 0x4)
      (package private) PDFObject root
      the root PDFObject, as specified in the PDF file
      private boolean saveable
      whether the file is saveable or not (trailer -> Encrypt -> P & 0x10)
      private static java.lang.String VERSION_COMMENT
      the comment text to begin the file to determine it's version
      private java.lang.String versionString  
    • Constructor Summary

      Constructors 
      Constructor Description
      PDFFile​(java.nio.ByteBuffer buf)
      get a PDFFile from a .pdf file.
      PDFFile​(java.nio.ByteBuffer buf, PDFPassword password)
      get a PDFFile from a .pdf file.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private PDFPage createPage​(int pagenum, PDFObject pageObj)
      Create a PDF Page object by finding the relevant inherited properties
      PDFObject dereference​(PDFXref ref, PDFDecrypter decrypter)
      Used internally to track down PDFObject references.
      private PDFObject findPage​(PDFObject pagedict, int start, int getPage, java.util.Map<java.lang.String,​PDFObject> resources)
      Get the PDFObject representing the content of a particular page.
      private byte[] getContents​(PDFObject pageObj)
      get the stream representing the content of a particular page.
      PDFDecrypter getDefaultDecrypter()
      Get the default decrypter for the document
      private PDFObject getInheritedValue​(PDFObject pageObj, java.lang.String propName)
      Find a property value in a page that may be inherited.
      int getMajorVersion()
      return the major version of the PDF header.
      java.util.Iterator<java.lang.String> getMetadataKeys()
      Get the keys into the Info metadata, for use with getStringMetadata(String)
      int getMinorVersion()
      return the minor version of the PDF header.
      int getNumPages()
      return the number of pages in this PDFFile.
      OutlineNode getOutline()
      Gets the outline tree as a tree of OutlineNode, which is a subclass of DefaultMutableTreeNode.
      PDFPage getPage​(int pagenum)
      Get the page commands for a given page in a separate thread.
      PDFPage getPage​(int pagenum, boolean wait)
      Get the page commands for a given page.
      int getPageNumber​(PDFObject page)
      Gets the page number (starting from 1) of the page represented by a particular PDFObject.
      PDFObject getRoot()
      get the root PDFObject of this PDFFile.
      java.lang.String getStringMetadata​(java.lang.String name)
      Get metadata (e.g., Author, Title, Creator) from the Info dictionary as a string.
      java.lang.String getVersionString()
      return the version string from the PDF header.
      static boolean isDelimiter​(int c)
      Is the argument a delimiter according to the PDF spec?
      boolean isPrintable()
      Gets whether the owner of the file has given permission to print the file.
      static boolean isRegularCharacter​(int c)
      return true if the character is neither a whitespace or a delimiter.
      boolean isSaveable()
      Gets whether the owner of the file has given permission to save a copy of the file.
      static boolean isWhiteSpace​(int c)
      Is the argument a white space character according to the PDF spec?.
      private boolean nextItemIs​(java.lang.String match)
      requires the next few characters (after whitespace) to match the argument.
      private void parseFile​(PDFPassword password)
      build the PDFFile reference table.
      java.awt.geom.Rectangle2D.Float parseRect​(PDFObject obj)
      get a Rectangle2D.Float representation for a PDFObject that is an array of four Numbers.
      private void processVersion​(java.lang.String versionString)
      process a version string, to determine the major and minor versions of the file.
      private PDFObject readArray​(int objNum, int objGen, PDFDecrypter decrypter)
      read an [ array ].
      private PDFObject readDictionary​(int objNum, int objGen, PDFDecrypter decrypter)
      read an entire << dictionary >>.
      private int readHexDigit()
      read a character, and return its value as if it were a hexidecimal digit.
      private int readHexPair()
      return the 8-bit value represented by the next two hex characters.
      private PDFObject readHexString​(int objNum, int objGen, PDFDecrypter decrypter)
      read a < hex string >.
      private PDFObject readKeyword​(char start)
      read a bare keyword.
      private java.lang.String readLine()
      Read a line of text.
      private PDFObject readLiteralString​(int objNum, int objGen, PDFDecrypter decrypter)
      read a ( character string ).
      private PDFObject readName()
      read a /name.
      private PDFObject readNumber​(char start)
      read a number.
      private PDFObject readObject​(int objNum, int objGen, boolean numscan, PDFDecrypter decrypter)
      read the next object with a special catch for numbers
      private PDFObject readObject​(int objNum, int objGen, PDFDecrypter decrypter)
      read the next object from the file
      private PDFObject readObjectDescription​(int objNum, int objGen, PDFDecrypter decrypter)
      read an entire PDFObject.
      private java.nio.ByteBuffer readStream​(PDFObject dict)
      read the stream portion of a PDFObject.
      private void readTrailer​(PDFPassword password)
      read the cross reference table from a PDF file.
      void stop​(int pageNum)
      Stop the rendering of a particular image on this page
      private java.lang.String unicode​(java.lang.String input)
      take a string and determine if it is unicode by looking at the lead characters, and that the string must be a multiple of 2 chars long.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • versionString

        private java.lang.String versionString
      • majorVersion

        private int majorVersion
      • minorVersion

        private int minorVersion
      • VERSION_COMMENT

        private static final java.lang.String VERSION_COMMENT
        the comment text to begin the file to determine it's version
        See Also:
        Constant Field Values
      • buf

        java.nio.ByteBuffer buf
        A ByteBuffer containing the file data
      • objIdx

        PDFXref[] objIdx
        the cross reference table mapping object numbers to locations in the PDF file
      • root

        PDFObject root
        the root PDFObject, as specified in the PDF file
      • encrypt

        PDFObject encrypt
        the Encrypt PDFObject, from the trailer
      • info

        PDFObject info
        The Info PDFPbject, from the trailer, for simple metadata
      • cache

        Cache cache
        a mapping of page numbers to parsed PDF commands
      • printable

        private boolean printable
        whether the file is printable or not (trailer -> Encrypt -> P & 0x4)
      • saveable

        private boolean saveable
        whether the file is saveable or not (trailer -> Encrypt -> P & 0x10)
      • defaultDecrypter

        private PDFDecrypter defaultDecrypter
        The default decrypter for streams and strings. By default, no encryption is expected, and thus the IdentityDecrypter is used.
    • Constructor Detail

      • PDFFile

        public PDFFile​(java.nio.ByteBuffer buf)
                throws java.io.IOException
        get a PDFFile from a .pdf file. The file must me a random access file at the moment. It should really be a file mapping from the nio package.

        Use the getPage(...) methods to get a page from the PDF file.

        Parameters:
        buf - the RandomAccessFile containing the PDF.
        Throws:
        java.io.IOException - if there's a problem reading from the buffer
        PDFParseException - if the document appears to be malformed, or its features are unsupported. If the file is encrypted in a manner that the product or platform does not support then the exception's cause will be an instance of UnsupportedEncryptionException.
        PDFAuthenticationFailureException - if the file is password protected and requires a password
      • PDFFile

        public PDFFile​(java.nio.ByteBuffer buf,
                       PDFPassword password)
                throws java.io.IOException
        get a PDFFile from a .pdf file. The file must me a random access file at the moment. It should really be a file mapping from the nio package.

        Use the getPage(...) methods to get a page from the PDF file.

        Parameters:
        buf - the RandomAccessFile containing the PDF.
        password - the user or owner password
        Throws:
        java.io.IOException - if there's a problem reading from the buffer
        PDFParseException - if the document appears to be malformed, or its features are unsupported. If the file is encrypted in a manner that the product or platform does not support then the exception's cause will be an instance of UnsupportedEncryptionException.
        PDFAuthenticationFailureException - if the file is password protected and the supplied password does not decrypt the document
    • Method Detail

      • isPrintable

        public boolean isPrintable()
        Gets whether the owner of the file has given permission to print the file.
        Returns:
        true if it is okay to print the file
      • isSaveable

        public boolean isSaveable()
        Gets whether the owner of the file has given permission to save a copy of the file.
        Returns:
        true if it is okay to save the file
      • getRoot

        public PDFObject getRoot()
        get the root PDFObject of this PDFFile. You generally shouldn't need this, but we've left it open in case you want to go spelunking.
      • getNumPages

        public int getNumPages()
        return the number of pages in this PDFFile. The pages will be numbered from 1 to getNumPages(), inclusive.
      • getStringMetadata

        public java.lang.String getStringMetadata​(java.lang.String name)
                                           throws java.io.IOException
        Get metadata (e.g., Author, Title, Creator) from the Info dictionary as a string.
        Parameters:
        name - the name of the metadata key (e.g., Author)
        Returns:
        the info
        Throws:
        java.io.IOException - if the metadata cannot be read
      • getMetadataKeys

        public java.util.Iterator<java.lang.String> getMetadataKeys()
                                                             throws java.io.IOException
        Get the keys into the Info metadata, for use with getStringMetadata(String)
        Returns:
        the keys present into the Info dictionary
        Throws:
        java.io.IOException - if the keys cannot be read
      • dereference

        public PDFObject dereference​(PDFXref ref,
                                     PDFDecrypter decrypter)
                              throws java.io.IOException
        Used internally to track down PDFObject references. You should never need to call this.

        Since this is the only public method for tracking down PDF objects, it is synchronized. This means that the PDFFile can only hunt down one object at a time, preventing the file's location from getting messed around.

        This call stores the current buffer position before any changes are made and restores it afterwards, so callers need not know that the position has changed.

        Throws:
        java.io.IOException
      • isWhiteSpace

        public static boolean isWhiteSpace​(int c)
        Is the argument a white space character according to the PDF spec?. ISO Spec 32000-1:2008 - Table 1
      • isDelimiter

        public static boolean isDelimiter​(int c)
        Is the argument a delimiter according to the PDF spec?

        ISO 32000-1:2008 - Table 2

        Parameters:
        c - the character to test
      • isRegularCharacter

        public static boolean isRegularCharacter​(int c)
        return true if the character is neither a whitespace or a delimiter.
        Parameters:
        c - the character to test
        Returns:
        boolean
      • readObject

        private PDFObject readObject​(int objNum,
                                     int objGen,
                                     PDFDecrypter decrypter)
                              throws java.io.IOException
        read the next object from the file
        Parameters:
        objNum - the object number of the object containing the object being read; negative only if the object number is unavailable (e.g., if reading from the trailer, or reading at the top level, in which case we can expect to be reading an object description)
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • readObject

        private PDFObject readObject​(int objNum,
                                     int objGen,
                                     boolean numscan,
                                     PDFDecrypter decrypter)
                              throws java.io.IOException
        read the next object with a special catch for numbers
        Parameters:
        numscan - if true, don't bother trying to see if a number is an object reference (used when already in the middle of testing for an object reference, and not otherwise)
        objNum - the object number of the object containing the object being read; negative only if the object number is unavailable (e.g., if reading from the trailer, or reading at the top level, in which case we can expect to be reading an object description)
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • nextItemIs

        private boolean nextItemIs​(java.lang.String match)
                            throws java.io.IOException
        requires the next few characters (after whitespace) to match the argument.
        Parameters:
        match - the next few characters after any whitespace that must be in the file
        Returns:
        true if the next characters match; false otherwise.
        Throws:
        java.io.IOException
      • processVersion

        private void processVersion​(java.lang.String versionString)
        process a version string, to determine the major and minor versions of the file.
        Parameters:
        versionString -
      • getMajorVersion

        public int getMajorVersion()
        return the major version of the PDF header.
        Returns:
        int
      • getMinorVersion

        public int getMinorVersion()
        return the minor version of the PDF header.
        Returns:
        int
      • getVersionString

        public java.lang.String getVersionString()
        return the version string from the PDF header.
        Returns:
        String
      • readDictionary

        private PDFObject readDictionary​(int objNum,
                                         int objGen,
                                         PDFDecrypter decrypter)
                                  throws java.io.IOException
        read an entire << dictionary >>. The initial << has already been read.
        Parameters:
        objNum - the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a dictionary placed directly in the trailer
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Returns:
        the Dictionary as a PDFObject.
        Throws:
        java.io.IOException
      • readHexDigit

        private int readHexDigit()
                          throws java.io.IOException
        read a character, and return its value as if it were a hexidecimal digit.
        Returns:
        a number between 0 and 15 whose value matches the next hexidecimal character. Returns -1 if the next character isn't in [0-9a-fA-F]
        Throws:
        java.io.IOException
      • readHexPair

        private int readHexPair()
                         throws java.io.IOException
        return the 8-bit value represented by the next two hex characters. If the next two characters don't represent a hex value, return -1 and reset the read head. If there is only one hex character, return its value as if there were an implicit 0 after it.
        Throws:
        java.io.IOException
      • readHexString

        private PDFObject readHexString​(int objNum,
                                        int objGen,
                                        PDFDecrypter decrypter)
                                 throws java.io.IOException
        read a < hex string >. The initial < has already been read.
        Parameters:
        objNum - the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a string placed directly in the trailer
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • unicode

        private java.lang.String unicode​(java.lang.String input)
        take a string and determine if it is unicode by looking at the lead characters, and that the string must be a multiple of 2 chars long. Convert a unicoded string's characters into the true unicode.
        Parameters:
        input -
        Returns:
      • readLiteralString

        private PDFObject readLiteralString​(int objNum,
                                            int objGen,
                                            PDFDecrypter decrypter)
                                     throws java.io.IOException

        read a ( character string ). The initial ( has already been read. Read until a *balanced* ) appears.

        PDF Reference Section 3.8.1, Table 3.31 "PDF Data Types" defines String data as:

         "text string     Bytes that represent characters encoded
                          using either PDFDocEncoding or UTF-16BE with a
                          leading byte-order marker (as defined in
                          "Text String Type" on page 158.)
         

        Section 5.3.2 defines character sequences and escapes.
        "The strings must conform to the syntax for string objects. When a string is written by enclosing the data in parentheses, bytes whose values are the same as those of the ASCII characters left parenthesis (40), right parenthesis (41), and backslash (92) must be preceded by a backslash character. All other byte values between 0 and 255 may be used in a string object.
        These rules apply to each individual byte in a string object, whether the string is interpreted by the text-showing operators as single-byte or multiple-byte character codes."

        This only reads 8 bit basic 'strings' so as to avoid a text string interpretation when one is not desired (e.g., for byte strings). For a text string interpretation of a string, use PDFStringUtil.asTextString(java.lang.String) ()} or PDFObject.getTextStringValue()

        Parameters:
        objNum - the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a dictionary placed directly in the trailer
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • readLine

        private java.lang.String readLine()
        Read a line of text. This follows the semantics of readLine() in DataInput -- it reads character by character until a '/n' is encountered. If a '/r' is encountered, it is discarded.
      • readArray

        private PDFObject readArray​(int objNum,
                                    int objGen,
                                    PDFDecrypter decrypter)
                             throws java.io.IOException
        read an [ array ]. The initial [ has already been read. PDFObjects are read until ].
        Parameters:
        objNum - the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading an array placed directly in the trailer
        objGen - the object generation of the object containing the object being read; negative only if the objNum is unavailable
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • readName

        private PDFObject readName()
                            throws java.io.IOException
        read a /name. The / has already been read.
        Throws:
        java.io.IOException
      • readNumber

        private PDFObject readNumber​(char start)
                              throws java.io.IOException
        read a number. The initial digit or . or - is passed in as the argument.
        Throws:
        java.io.IOException
      • readKeyword

        private PDFObject readKeyword​(char start)
                               throws java.io.IOException
        read a bare keyword. The initial character is passed in as the argument.
        Throws:
        java.io.IOException
      • readObjectDescription

        private PDFObject readObjectDescription​(int objNum,
                                                int objGen,
                                                PDFDecrypter decrypter)
                                         throws java.io.IOException
        read an entire PDFObject. The intro line, which looks something like "4 0 obj" has already been read.
        Parameters:
        objNum - the object number of the object being read, being the first number in the intro line (4 in "4 0 obj")
        objGen - the object generation of the object being read, being the second number in the intro line (0 in "4 0 obj").
        decrypter - the decrypter to use
        Throws:
        java.io.IOException
      • readStream

        private java.nio.ByteBuffer readStream​(PDFObject dict)
                                        throws java.io.IOException
        read the stream portion of a PDFObject. Calls decodeStream to un-filter the stream as necessary.
        Parameters:
        dict - the dictionary associated with this stream.
        Returns:
        a ByteBuffer with the encoded stream data
        Throws:
        java.io.IOException
      • parseFile

        private void parseFile​(PDFPassword password)
                        throws java.io.IOException
        build the PDFFile reference table. Nothing in the PDFFile actually gets parsed, despite the name of this function. Things only get read and parsed when they're needed.
        Parameters:
        password -
        Throws:
        java.io.IOException
      • getOutline

        public OutlineNode getOutline()
                               throws java.io.IOException
        Gets the outline tree as a tree of OutlineNode, which is a subclass of DefaultMutableTreeNode. If there is no outline tree, this method returns null.
        Throws:
        java.io.IOException
      • getPageNumber

        public int getPageNumber​(PDFObject page)
                          throws java.io.IOException
        Gets the page number (starting from 1) of the page represented by a particular PDFObject. The PDFObject must be a Page dictionary or a destination description (or an action).
        Returns:
        a number between 1 and the number of pages indicating the page number, or 0 if the PDFObject is not in the page tree.
        Throws:
        java.io.IOException
      • getPage

        public PDFPage getPage​(int pagenum)
        Get the page commands for a given page in a separate thread.
        Parameters:
        pagenum - the number of the page to get commands for
      • getPage

        public PDFPage getPage​(int pagenum,
                               boolean wait)
        Get the page commands for a given page.
        Parameters:
        pagenum - the number of the page to get commands for
        wait - if true, do not exit until the page is complete.
      • stop

        public void stop​(int pageNum)
        Stop the rendering of a particular image on this page
      • getContents

        private byte[] getContents​(PDFObject pageObj)
                            throws java.io.IOException
        get the stream representing the content of a particular page.
        Parameters:
        pageObj - the page object to get the contents of
        Returns:
        a concatenation of any content streams for the requested page.
        Throws:
        java.io.IOException
      • createPage

        private PDFPage createPage​(int pagenum,
                                   PDFObject pageObj)
                            throws java.io.IOException
        Create a PDF Page object by finding the relevant inherited properties
        Parameters:
        pageObj - the PDF object for the page to be created
        Throws:
        java.io.IOException
      • findPage

        private PDFObject findPage​(PDFObject pagedict,
                                   int start,
                                   int getPage,
                                   java.util.Map<java.lang.String,​PDFObject> resources)
                            throws java.io.IOException
        Get the PDFObject representing the content of a particular page. Note that the number of the page need not have anything to do with the label on that page. If there are two blank pages, and then roman numerals for the page number, then passing in 6 will get page (iv).
        Parameters:
        pagedict - the top of the pages tree
        start - the page number of the first page in this dictionary
        getPage - the number of the page to find; NOT the page's label.
        resources - a HashMap that will be filled with any resource definitions encountered on the search for the page
        Throws:
        java.io.IOException
      • getInheritedValue

        private PDFObject getInheritedValue​(PDFObject pageObj,
                                            java.lang.String propName)
                                     throws java.io.IOException
        Find a property value in a page that may be inherited. If the value is not defined in the page itself, follow the page's "parent" links until the value is found or the top of the tree is reached.
        Parameters:
        pageObj - the object representing the page
        propName - the name of the property we are looking for
        Throws:
        java.io.IOException
      • parseRect

        public java.awt.geom.Rectangle2D.Float parseRect​(PDFObject obj)
                                                  throws java.io.IOException
        get a Rectangle2D.Float representation for a PDFObject that is an array of four Numbers.
        Parameters:
        obj - a PDFObject that represents an Array of exactly four Numbers.
        Throws:
        java.io.IOException
      • getDefaultDecrypter

        public PDFDecrypter getDefaultDecrypter()
        Get the default decrypter for the document
        Returns:
        the default decrypter; never null, even for documents that aren't encrypted