Package com.lowagie.text.pdf.parser
Class ParsedText
java.lang.Object
com.lowagie.text.pdf.parser.ParsedTextImpl
com.lowagie.text.pdf.parser.ParsedText
- All Implemented Interfaces:
TextAssemblyBuffer
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final GraphicsState
private PdfString
retain original PdfString as we need to distinguish between the code points contained there, and the standard Java (Unicode strings) that actually represent the content of this text.private final Matrix
-
Constructor Summary
ConstructorsModifierConstructorDescription(package private)
ParsedText
(PdfString text, GraphicsState graphicsState, Matrix textMatrix) This constructor should only be called when the origin for text display is at (0,0) and the graphical state reflects all transformations of the baseline.private
ParsedText
(PdfString text, GraphicsState graphicsState, Matrix textMatrix, float unscaledWidth) Internal constructor for a parsed text item. -
Method Summary
Modifier and TypeMethodDescriptionvoid
accumulate
(TextAssembler textAssembler, String contextName) We pass ourselves to the assembler, which is a visitor, so that it can accumulate information on this text depending on its type.void
assemble
(TextAssembler textAssembler) boolean
private static float
convertHeightToUser
(float height, Matrix textToUserSpaceTransformMatrix) private static float
convertWidthToUser
(float width, Matrix textToUserSpaceTransformMatrix) private Word
createWord
(StringBuffer wordAccum, float wordStartOffset, float wordEndOffset, Vector baseline, boolean wordsAreComplete, boolean currentBreakBefore) Create a word to represent a broken substring at a space.protected String
This constructor should only be called when the origin for text display is at (0,0) and the graphical state reflects all transformations of the baseline.protected String
Decodes a Java String containing glyph ids encoded in the font's encoding, and determine the unicode equivalentprivate static float
Break this string if there are spaces within it.getFinalText
(PdfReader reader, int page, TextAssembler assembler, boolean useMarkup) private static float
getStringWidth
(String string, GraphicsState graphicsState) Gets the width of a String in text space unitsgetText()
when returning the text from this item, we need to decode the code points we have.private static float
getUnscaledFontSpaceWidth
(GraphicsState graphicsState) Calculates the width of a space character.float
private static Vector
pointToUserSpace
(float xOffset, float yOffset, Matrix textToUserSpaceTransformMatrix) private boolean
preprocessString
(char[] chars, boolean[] hasSpace) Calculate whether individual character positions (after font decoding from code to a character), contain spaces and break words, and whether the resulting words should be treated as complete (i.e.boolean
toString()
Methods inherited from class com.lowagie.text.pdf.parser.ParsedTextImpl
getAscent, getBaseline, getDescent, getEndPoint, getSingleSpaceWidth, getStartPoint, getWidth
-
Field Details
-
textToUserSpaceTransformMatrix
-
graphicsState
-
pdfText
retain original PdfString as we need to distinguish between the code points contained there, and the standard Java (Unicode strings) that actually represent the content of this text.
-
-
Constructor Details
-
ParsedText
ParsedText(PdfString text, GraphicsState graphicsState, Matrix textMatrix) This constructor should only be called when the origin for text display is at (0,0) and the graphical state reflects all transformations of the baseline. This is in text space units.- Parameters:
text
- stringgraphicsState
- graphical statetextMatrix
- transform from text space to graphics (drawing space)
-
ParsedText
private ParsedText(PdfString text, GraphicsState graphicsState, Matrix textMatrix, float unscaledWidth) Internal constructor for a parsed text item. The constructors that call it gather some information from the graphical state first.- Parameters:
text
- This is a PdfString containing code points for the current font, not actually characters. If the font has multiByte glyphs, (Identity-H encoding) we reparse the string so that the code points don't get split into multiple characters.graphicsState
- graphical statetextMatrix
- transform from text space to graphics (drawing space)unscaledWidth
- width of the space character in the font.
-
-
Method Details
-
pointToUserSpace
private static Vector pointToUserSpace(float xOffset, float yOffset, Matrix textToUserSpaceTransformMatrix) - Parameters:
xOffset
- offset in x directionyOffset
- offset in y directiontextToUserSpaceTransformMatrix
- transform from text space to graphics (drawing space)- Returns:
- the cross product of the offset and the textToUserSpaceTransformMatrix
-
getUnscaledFontSpaceWidth
Calculates the width of a space character. If the font does not define a width for a standard space character , we also attempt to use the width of (a non-breaking space in many fonts)- Parameters:
graphicsState
- graphic state including current transformation to page coordinates from text measurement- Returns:
- the width of a single space character in text space units
-
getStringWidth
Gets the width of a String in text space units- Parameters:
string
- the string that needs measuringgraphicsState
- graphic state including current transformation to page coordinates from text measurement- Returns:
- the width of a String in text space units
-
convertWidthToUser
- Parameters:
width
- which should be converted to user spacetextToUserSpaceTransformMatrix
- transform from text space to graphics (drawing space)- Returns:
- distance between start and end position
-
distance
- Parameters:
startPos
- of the vectorendPos
- of the vector- Returns:
- (endPos - startPos).length
-
convertHeightToUser
- Parameters:
height
- which should be converted to user spacetextToUserSpaceTransformMatrix
- transform from text space to graphics (drawing space)- Returns:
- distance between start and end position
-
decode
Decodes a Java String containing glyph ids encoded in the font's encoding, and determine the unicode equivalent- Parameters:
in
- the String that needs to be decoded- Returns:
- the decoded String
-
decode
This constructor should only be called when the origin for text display is at (0,0) and the graphical state reflects all transformations of the baseline. This is in text space units.Decodes a PdfString (which will contain glyph ids encoded in the font's encoding) based on the active font, and determine the unicode equivalent
- Parameters:
pdfString
- the String that needs to be encoded- Returns:
- the encoded String
- Since:
- 2.1.7
-
getAsPartialWords
Break this string if there are spaces within it. If so, we mark the new Words appropriately for later assembly.We are guaranteed that every space (internal word break) in this parsed text object will create a new word in the result of this method. We are not guaranteed that these Word objects are actually words until they have been assembled.
The word following any space preserves that space in its string value, so that the assembler will not erroneously merge words that should be separate, regardless of the spacing.
- Returns:
- list of Word objects.
-
preprocessString
private boolean preprocessString(char[] chars, boolean[] hasSpace) Calculate whether individual character positions (after font decoding from code to a character), contain spaces and break words, and whether the resulting words should be treated as complete (i.e. if any spaces were found.- Parameters:
chars
- to checkhasSpace
- array same length as chars, each position representing whether it breaks a word- Returns:
- true if any spaces were found.
-
createWord
private Word createWord(StringBuffer wordAccum, float wordStartOffset, float wordEndOffset, Vector baseline, boolean wordsAreComplete, boolean currentBreakBefore) Create a word to represent a broken substring at a space. As spaces have zero "word length" make sure that they also have a baseline to check- Parameters:
wordAccum
- buffer of characterswordStartOffset
- intial x-offsetwordEndOffset
- ending x offset.baseline
- baseline of this word, so direction of progress can be measured in line ending determination.wordsAreComplete
- true means characters in this word won't be split apart graphicallycurrentBreakBefore
- true if this word fragment represents a word boundary, and any preceding fragment is complete.- Returns:
- the new word
-
getUnscaledTextWidth
- Parameters:
gs
- graphic state including current transformation to page coordinates from text measurement- Returns:
- the unscaled (i.e. in Text space) width of our text
-
accumulate
We pass ourselves to the assembler, which is a visitor, so that it can accumulate information on this text depending on its type. The result is calculated by a final "assembly" phase, after accumulation is done. This is because we may have non-contiguous items in a PDF text stream.- Parameters:
textAssembler
- the assembler that is visiting us.contextName
- Name of the surrounding markup element/"context" if we're generating tagged output.- See Also:
-
assemble
- Parameters:
textAssembler
- we may pass ourselves to this assembler again during the final assembly process.- See Also:
-
getText
when returning the text from this item, we need to decode the code points we have.- Specified by:
getText
in interfaceTextAssemblyBuffer
- Overrides:
getText
in classParsedTextImpl
- Returns:
- the text to render
- See Also:
-
getFontCodes
- Returns:
- a string whose characters represent code points in a possibly two-byte font
-
getFinalText
public FinalText getFinalText(PdfReader reader, int page, TextAssembler assembler, boolean useMarkup) - Parameters:
reader
- pdfReader that knows about our document. (size, etc. available here).page
- which page are we extracting text from.assembler
- Builds result by accepting content from text components of various sorts.useMarkup
- Should we generate tagged text, or just plain text.- Returns:
- the final text ready to concatenate into result string.
- See Also:
-
toString
-
shouldNotSplit
public boolean shouldNotSplit()- Specified by:
shouldNotSplit
in classParsedTextImpl
- Returns:
- true if this was extracted from a string containing spaces, in which case, we assume further splitting is not needed.
- See Also:
-
breakBefore
public boolean breakBefore()- Specified by:
breakBefore
in classParsedTextImpl
- Returns:
- a boolean value
- See Also:
-