Class PDFFile
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) ByteBuffer
A ByteBuffer containing the file data(package private) Cache
a mapping of page numbers to parsed PDF commandsprivate PDFDecrypter
The default decrypter for streams and strings.(package private) PDFObject
the Encrypt PDFObject, from the trailerstatic final int
(package private) PDFObject
The Info PDFPbject, from the trailer, for simple metadataprivate int
private int
static final int
(package private) PDFXref[]
the cross reference table mapping object numbers to locations in the PDF fileprivate boolean
whether the file is printable or not (trailer -> Encrypt -> P invalid input: '&' 0x4)(package private) PDFObject
the root PDFObject, as specified in the PDF fileprivate boolean
whether the file is saveable or not (trailer -> Encrypt -> P invalid input: '&' 0x10)private static final String
the comment text to begin the file to determine it's versionprivate String
-
Constructor Summary
ConstructorsConstructorDescriptionPDFFile
(ByteBuffer buf) get a PDFFile from a .pdf file.PDFFile
(ByteBuffer buf, PDFPassword password) get a PDFFile from a .pdf file. -
Method Summary
Modifier and TypeMethodDescriptionprivate PDFPage
createPage
(int pagenum, PDFObject pageObj) Create a PDF Page object by finding the relevant inherited propertiesdereference
(PDFXref ref, PDFDecrypter decrypter) Used internally to track down PDFObject references.private PDFObject
Get the PDFObject representing the content of a particular page.private byte[]
getContents
(PDFObject pageObj) get the stream representing the content of a particular page.Get the default decrypter for the documentprivate PDFObject
getInheritedValue
(PDFObject pageObj, String propName) Find a property value in a page that may be inherited.int
return the major version of the PDF header.Get the keys into the Info metadata, for use withgetStringMetadata(String)
int
return the minor version of the PDF header.int
return the number of pages in this PDFFile.Gets the outline tree as a tree of OutlineNode, which is a subclass of DefaultMutableTreeNode.getPage
(int pagenum) Get the page commands for a given page in a separate thread.getPage
(int pagenum, boolean wait) Get the page commands for a given page.int
getPageNumber
(PDFObject page) Gets the page number (starting from 1) of the page represented by a particular PDFObject.getRoot()
get the root PDFObject of this PDFFile.getStringMetadata
(String name) Get metadata (e.g., Author, Title, Creator) from the Info dictionary as a string.return the version string from the PDF header.static boolean
isDelimiter
(int c) Is the argument a delimiter according to the PDF spec?boolean
Gets whether the owner of the file has given permission to print the file.static boolean
isRegularCharacter
(int c) return true if the character is neither a whitespace or a delimiter.boolean
Gets whether the owner of the file has given permission to save a copy of the file.static boolean
isWhiteSpace
(int c) Is the argument a white space character according to the PDF spec?.private boolean
nextItemIs
(String match) requires the next few characters (after whitespace) to match the argument.private void
parseFile
(PDFPassword password) build the PDFFile reference table.get a Rectangle2D.Float representation for a PDFObject that is an array of four Numbers.private void
processVersion
(String versionString) process a version string, to determine the major and minor versions of the file.private PDFObject
readArray
(int objNum, int objGen, PDFDecrypter decrypter) read an [ array ].private PDFObject
readDictionary
(int objNum, int objGen, PDFDecrypter decrypter) read an entire << dictionary >>.private int
read a character, and return its value as if it were a hexidecimal digit.private int
return the 8-bit value represented by the next two hex characters.private PDFObject
readHexString
(int objNum, int objGen, PDFDecrypter decrypter) read a invalid input: '<' hex string >.private PDFObject
readKeyword
(char start) read a bare keyword.private String
readLine()
Read a line of text.private PDFObject
readLiteralString
(int objNum, int objGen, PDFDecrypter decrypter) read a ( character string ).private PDFObject
readName()
read a /name.private PDFObject
readNumber
(char start) read a number.private PDFObject
readObject
(int objNum, int objGen, boolean numscan, PDFDecrypter decrypter) read the next object with a special catch for numbersprivate PDFObject
readObject
(int objNum, int objGen, PDFDecrypter decrypter) read the next object from the fileprivate PDFObject
readObjectDescription
(int objNum, int objGen, PDFDecrypter decrypter) read an entire PDFObject.private ByteBuffer
readStream
(PDFObject dict) read the stream portion of a PDFObject.private void
readTrailer
(PDFPassword password) read the cross reference table from a PDF file.void
stop
(int pageNum) Stop the rendering of a particular image on this pageprivate String
take a string and determine if it is unicode by looking at the lead characters, and that the string must be a multiple of 2 chars long.
-
Field Details
-
NUL_CHAR
public static final int NUL_CHAR- See Also:
-
FF_CHAR
public static final int FF_CHAR- See Also:
-
versionString
-
majorVersion
private int majorVersion -
minorVersion
private int minorVersion -
VERSION_COMMENT
the comment text to begin the file to determine it's version- See Also:
-
buf
ByteBuffer bufA ByteBuffer containing the file data -
objIdx
PDFXref[] objIdxthe cross reference table mapping object numbers to locations in the PDF file -
root
PDFObject rootthe root PDFObject, as specified in the PDF file -
encrypt
PDFObject encryptthe Encrypt PDFObject, from the trailer -
info
PDFObject infoThe Info PDFPbject, from the trailer, for simple metadata -
cache
Cache cachea mapping of page numbers to parsed PDF commands -
printable
private boolean printablewhether the file is printable or not (trailer -> Encrypt -> P invalid input: '&' 0x4) -
saveable
private boolean saveablewhether the file is saveable or not (trailer -> Encrypt -> P invalid input: '&' 0x10) -
defaultDecrypter
The default decrypter for streams and strings. By default, no encryption is expected, and thus the IdentityDecrypter is used.
-
-
Constructor Details
-
PDFFile
get a PDFFile from a .pdf file. The file must me a random access file at the moment. It should really be a file mapping from the nio package.Use the getPage(...) methods to get a page from the PDF file.
- Parameters:
buf
- the RandomAccessFile containing the PDF.- Throws:
IOException
- if there's a problem reading from the bufferPDFParseException
- if the document appears to be malformed, or its features are unsupported. If the file is encrypted in a manner that the product or platform does not support then the exception'scause
will be an instance ofUnsupportedEncryptionException
.PDFAuthenticationFailureException
- if the file is password protected and requires a password
-
PDFFile
get a PDFFile from a .pdf file. The file must me a random access file at the moment. It should really be a file mapping from the nio package.Use the getPage(...) methods to get a page from the PDF file.
- Parameters:
buf
- the RandomAccessFile containing the PDF.password
- the user or owner password- Throws:
IOException
- if there's a problem reading from the bufferPDFParseException
- if the document appears to be malformed, or its features are unsupported. If the file is encrypted in a manner that the product or platform does not support then the exception'scause
will be an instance ofUnsupportedEncryptionException
.PDFAuthenticationFailureException
- if the file is password protected and the supplied password does not decrypt the document
-
-
Method Details
-
isPrintable
public boolean isPrintable()Gets whether the owner of the file has given permission to print the file.- Returns:
- true if it is okay to print the file
-
isSaveable
public boolean isSaveable()Gets whether the owner of the file has given permission to save a copy of the file.- Returns:
- true if it is okay to save the file
-
getRoot
get the root PDFObject of this PDFFile. You generally shouldn't need this, but we've left it open in case you want to go spelunking. -
getNumPages
public int getNumPages()return the number of pages in this PDFFile. The pages will be numbered from 1 to getNumPages(), inclusive. -
getStringMetadata
Get metadata (e.g., Author, Title, Creator) from the Info dictionary as a string.- Parameters:
name
- the name of the metadata key (e.g., Author)- Returns:
- the info
- Throws:
IOException
- if the metadata cannot be read
-
getMetadataKeys
Get the keys into the Info metadata, for use withgetStringMetadata(String)
- Returns:
- the keys present into the Info dictionary
- Throws:
IOException
- if the keys cannot be read
-
dereference
Used internally to track down PDFObject references. You should never need to call this.Since this is the only public method for tracking down PDF objects, it is synchronized. This means that the PDFFile can only hunt down one object at a time, preventing the file's location from getting messed around.
This call stores the current buffer position before any changes are made and restores it afterwards, so callers need not know that the position has changed.
- Throws:
IOException
-
isWhiteSpace
public static boolean isWhiteSpace(int c) Is the argument a white space character according to the PDF spec?. ISO Spec 32000-1:2008 - Table 1 -
isDelimiter
public static boolean isDelimiter(int c) Is the argument a delimiter according to the PDF spec?ISO 32000-1:2008 - Table 2
- Parameters:
c
- the character to test
-
isRegularCharacter
public static boolean isRegularCharacter(int c) return true if the character is neither a whitespace or a delimiter.- Parameters:
c
- the character to test- Returns:
- boolean
-
readObject
read the next object from the file- Parameters:
objNum
- the object number of the object containing the object being read; negative only if the object number is unavailable (e.g., if reading from the trailer, or reading at the top level, in which case we can expect to be reading an object description)objGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailabledecrypter
- the decrypter to use- Throws:
IOException
-
readObject
private PDFObject readObject(int objNum, int objGen, boolean numscan, PDFDecrypter decrypter) throws IOException read the next object with a special catch for numbers- Parameters:
objNum
- the object number of the object containing the object being read; negative only if the object number is unavailable (e.g., if reading from the trailer, or reading at the top level, in which case we can expect to be reading an object description)objGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailablenumscan
- if true, don't bother trying to see if a number is an object reference (used when already in the middle of testing for an object reference, and not otherwise)decrypter
- the decrypter to use- Throws:
IOException
-
nextItemIs
requires the next few characters (after whitespace) to match the argument.- Parameters:
match
- the next few characters after any whitespace that must be in the file- Returns:
- true if the next characters match; false otherwise.
- Throws:
IOException
-
processVersion
process a version string, to determine the major and minor versions of the file.- Parameters:
versionString
-
-
getMajorVersion
public int getMajorVersion()return the major version of the PDF header.- Returns:
- int
-
getMinorVersion
public int getMinorVersion()return the minor version of the PDF header.- Returns:
- int
-
getVersionString
return the version string from the PDF header.- Returns:
- String
-
readDictionary
read an entire << dictionary >>. The initial << has already been read.- Parameters:
objNum
- the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a dictionary placed directly in the trailerobjGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailabledecrypter
- the decrypter to use- Returns:
- the Dictionary as a PDFObject.
- Throws:
IOException
-
readHexDigit
read a character, and return its value as if it were a hexidecimal digit.- Returns:
- a number between 0 and 15 whose value matches the next hexidecimal character. Returns -1 if the next character isn't in [0-9a-fA-F]
- Throws:
IOException
-
readHexPair
return the 8-bit value represented by the next two hex characters. If the next two characters don't represent a hex value, return -1 and reset the read head. If there is only one hex character, return its value as if there were an implicit 0 after it.- Throws:
IOException
-
readHexString
read a invalid input: '<' hex string >. The initial invalid input: '<' has already been read.- Parameters:
objNum
- the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a string placed directly in the trailerobjGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailabledecrypter
- the decrypter to use- Throws:
IOException
-
unicode
take a string and determine if it is unicode by looking at the lead characters, and that the string must be a multiple of 2 chars long. Convert a unicoded string's characters into the true unicode.- Parameters:
input
-- Returns:
-
readLiteralString
private PDFObject readLiteralString(int objNum, int objGen, PDFDecrypter decrypter) throws IOException read a ( character string ). The initial ( has already been read. Read until a *balanced* ) appears.
PDF Reference Section 3.8.1, Table 3.31 "PDF Data Types" defines String data as:
"text string Bytes that represent characters encoded using either PDFDocEncoding or UTF-16BE with a leading byte-order marker (as defined in "Text String Type" on page 158.)
Section 5.3.2 defines character sequences and escapes.
"The strings must conform to the syntax for string objects. When a string is written by enclosing the data in parentheses, bytes whose values are the same as those of the ASCII characters left parenthesis (40), right parenthesis (41), and backslash (92) must be preceded by a backslash character. All other byte values between 0 and 255 may be used in a string object.
These rules apply to each individual byte in a string object, whether the string is interpreted by the text-showing operators as single-byte or multiple-byte character codes."This only reads 8 bit basic 'strings' so as to avoid a text string interpretation when one is not desired (e.g., for byte strings). For a text string interpretation of a string, use
PDFStringUtil.asTextString(java.lang.String)
()} orPDFObject.getTextStringValue()
- Parameters:
objNum
- the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading a dictionary placed directly in the trailerobjGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailabledecrypter
- the decrypter to use- Throws:
IOException
-
readLine
Read a line of text. This follows the semantics of readLine() in DataInput -- it reads character by character until a '/n' is encountered. If a '/r' is encountered, it is discarded. -
readArray
read an [ array ]. The initial [ has already been read. PDFObjects are read until ].- Parameters:
objNum
- the object number of the object containing the dictionary being read; negative only if the object number is unavailable, which should only happen if we're reading an array placed directly in the trailerobjGen
- the object generation of the object containing the object being read; negative only if the objNum is unavailabledecrypter
- the decrypter to use- Throws:
IOException
-
readName
read a /name. The / has already been read.- Throws:
IOException
-
readNumber
read a number. The initial digit or . or - is passed in as the argument.- Throws:
IOException
-
readKeyword
read a bare keyword. The initial character is passed in as the argument.- Throws:
IOException
-
readObjectDescription
private PDFObject readObjectDescription(int objNum, int objGen, PDFDecrypter decrypter) throws IOException read an entire PDFObject. The intro line, which looks something like "4 0 obj" has already been read.- Parameters:
objNum
- the object number of the object being read, being the first number in the intro line (4 in "4 0 obj")objGen
- the object generation of the object being read, being the second number in the intro line (0 in "4 0 obj").decrypter
- the decrypter to use- Throws:
IOException
-
readStream
read the stream portion of a PDFObject. Calls decodeStream to un-filter the stream as necessary.- Parameters:
dict
- the dictionary associated with this stream.- Returns:
- a ByteBuffer with the encoded stream data
- Throws:
IOException
-
readTrailer
private void readTrailer(PDFPassword password) throws IOException, PDFAuthenticationFailureException, EncryptionUnsupportedByProductException, EncryptionUnsupportedByPlatformException read the cross reference table from a PDF file. When this method is called, the file pointer must point to the start of the word "xref" in the file. Reads the xref table and the trailer dictionary. If dictionary has a /Prev entry, move file pointer and read new trailer- Parameters:
password
-- Throws:
IOException
PDFAuthenticationFailureException
EncryptionUnsupportedByProductException
EncryptionUnsupportedByPlatformException
-
parseFile
build the PDFFile reference table. Nothing in the PDFFile actually gets parsed, despite the name of this function. Things only get read and parsed when they're needed.- Parameters:
password
-- Throws:
IOException
-
getOutline
Gets the outline tree as a tree of OutlineNode, which is a subclass of DefaultMutableTreeNode. If there is no outline tree, this method returns null.- Throws:
IOException
-
getPageNumber
Gets the page number (starting from 1) of the page represented by a particular PDFObject. The PDFObject must be a Page dictionary or a destination description (or an action).- Returns:
- a number between 1 and the number of pages indicating the page number, or 0 if the PDFObject is not in the page tree.
- Throws:
IOException
-
getPage
Get the page commands for a given page in a separate thread.- Parameters:
pagenum
- the number of the page to get commands for
-
getPage
Get the page commands for a given page.- Parameters:
pagenum
- the number of the page to get commands forwait
- if true, do not exit until the page is complete.
-
stop
public void stop(int pageNum) Stop the rendering of a particular image on this page -
getContents
get the stream representing the content of a particular page.- Parameters:
pageObj
- the page object to get the contents of- Returns:
- a concatenation of any content streams for the requested page.
- Throws:
IOException
-
createPage
Create a PDF Page object by finding the relevant inherited properties- Parameters:
pageObj
- the PDF object for the page to be created- Throws:
IOException
-
findPage
private PDFObject findPage(PDFObject pagedict, int start, int getPage, Map<String, PDFObject> resources) throws IOExceptionGet the PDFObject representing the content of a particular page. Note that the number of the page need not have anything to do with the label on that page. If there are two blank pages, and then roman numerals for the page number, then passing in 6 will get page (iv).- Parameters:
pagedict
- the top of the pages treestart
- the page number of the first page in this dictionarygetPage
- the number of the page to find; NOT the page's label.resources
- a HashMap that will be filled with any resource definitions encountered on the search for the page- Throws:
IOException
-
getInheritedValue
Find a property value in a page that may be inherited. If the value is not defined in the page itself, follow the page's "parent" links until the value is found or the top of the tree is reached.- Parameters:
pageObj
- the object representing the pagepropName
- the name of the property we are looking for- Throws:
IOException
-
parseRect
get a Rectangle2D.Float representation for a PDFObject that is an array of four Numbers.- Parameters:
obj
- a PDFObject that represents an Array of exactly four Numbers.- Throws:
IOException
-
getDefaultDecrypter
Get the default decrypter for the document- Returns:
- the default decrypter; never null, even for documents that aren't encrypted
-