Class PDFStringUtil
- java.lang.Object
-
- com.sun.pdfview.PDFStringUtil
-
public class PDFStringUtil extends java.lang.Object
Utility methods for dealing with PDF Strings, such as:
converting to text strings
converting to PDFDocEncoded strings
converting to UTF-16BE strings
- converting basic strings between
byte
andstring
representations
We refer to basic strings as those corresponding to the PDF 'string' type. PDFRenderer represents these as
.String
s, though this is somewhat deceiving, as they are, effectively, just sequences of bytes, although byte values <= 127 do correspond to the ASCII character set. Outside of this, the 'string' type, as repesented by basic strings do not possess any character set or encoding, and byte values >= 128 are entirely acceptable. For a basic string as represented by a String, each character has a value less than 256 and is represented in the String as if the bytes represented as it were in ISO-8859-1 encoding. This, however, is merely for convenience. For strings that are user visible, and that don't merely represent some identifying token, the PDF standard employs a 'text string' type that offers the basic string as an encoding of in either UTF-16BE (with a byte order marking) or a specific 8-byte encoding, PDFDocEncoding. Using a basic string without conversion when the actual type is a 'text string' is erroneous (though without consequence if the string consists only of ASCII alphanumeric values). Care must be taken to either convert basic strings to text strings (also expressed as a String) when appropriate, using either the methods in this class, orPDFObject.getTextStringValue()
}. For strings that are 'byte strings',asBytes(String)
orPDFObject.getStream()
should be used.
-
-
Field Summary
Fields Modifier and Type Field Description (package private) static char[]
PDF_DOC_ENCODING_MAP
Maps from PDFDocEncoding bytes to unicode characters.
-
Constructor Summary
Constructors Constructor Description PDFStringUtil()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.String
asBasicString(byte[] bytes)
Create a basic string from bytes.static java.lang.String
asBasicString(byte[] bytes, int offset, int length)
Create a basic string from bytes.static byte[]
asBytes(java.lang.String basicString)
Get the corresponding byte array for a basic string.static java.lang.String
asPDFDocEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding.static java.lang.String
asTextString(java.lang.String basicString)
Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM).static java.lang.String
asUTF16BEEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding.byte[]
toPDFDocEncoded(java.lang.String string)
-
-
-
Method Detail
-
asTextString
public static java.lang.String asTextString(java.lang.String basicString)
Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM). If it appears to be UTF-16BE, we return the string representation of the UTF-16BE encoding of those bytes. If the BOM is not present, the bytes from the input string are decoded using the PDFDocEncoding charset.
From the PDF Reference 1.7, p158:
The text string type is used for character strings that are encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Appendix D. UTF-16BE can encode all Unicode characters. UTF-16BE and Unicode character encoding are described in the Unicode Standard by the Unicode Consortium (see the Bibliography). Note that PDFDocEncoding does not support all Unicode characters whereas UTF-16BE does.
- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- either the original input, or the input decoded as UTF-16
-
asPDFDocEncoded
public static java.lang.String asPDFDocEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding. The PDFDocEncoding is described in the PDF Reference.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the decoding of the string's bytes in PDFDocEncoding
-
toPDFDocEncoded
public byte[] toPDFDocEncoded(java.lang.String string) throws java.nio.charset.CharacterCodingException
- Throws:
java.nio.charset.CharacterCodingException
-
asUTF16BEEncoded
public static java.lang.String asUTF16BEEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding. The first 2 bytes are presumed to be the big-endian byte markers, 0xFE and 0xFF; that is not checked by this method.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the decoding of the string's bytes in UTF16-BE
-
asBytes
public static byte[] asBytes(java.lang.String basicString)
Get the corresponding byte array for a basic string. This is effectively the char[] array cast to bytes[], as chars in basic strings only use the least significant byte.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the bytes corresponding to its characters
-
asBasicString
public static java.lang.String asBasicString(byte[] bytes, int offset, int length)
Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.- Parameters:
bytes
- the source of the bytes for the basic stringoffset
- the offset into butes where the string startslength
- the number of bytes to turn into a string- Returns:
- the corresponding string
-
asBasicString
public static java.lang.String asBasicString(byte[] bytes)
Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.- Parameters:
bytes
- the bytes, all of which are used- Returns:
- the corresponding string
-
-