Package com.itextpdf.text.pdf.parser
Class LocationTextExtractionStrategy
java.lang.Object
com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy
- All Implemented Interfaces:
RenderListener
,TextExtractionStrategy
Development preview - this class (and all of the parser classes) are still experiencing
heavy development, and are subject to change both behavior and interface.
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is treated as being on the same line.
This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is treated as being on the same line.
This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
- Since:
- 5.0.2
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Represents a chunk of text, it's orientation, and location relative to the orientation vectorstatic interface
Specifies a filter for filteringLocationTextExtractionStrategy.TextChunk
objects during text extractionstatic interface
static class
static interface
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) static boolean
set to true for debuggingprivate final List
<LocationTextExtractionStrategy.TextChunk> a summary of all found textprivate final LocationTextExtractionStrategy.TextChunkLocationStrategy
-
Constructor Summary
ConstructorsConstructorDescriptionCreates a new text extraction renderer.Creates a new text extraction renderer, with a custom strategy for creating new TextChunkLocation objects based on the input of the TextRenderInfo. -
Method Summary
Modifier and TypeMethodDescriptionvoid
Called when a new text block is beginning (i.e.private static int
compareInts
(int int1, int int2) private void
Used for debugging onlyprivate boolean
endsWithSpace
(String str) void
Called when a text block has ended (i.e.filterTextChunks
(List<LocationTextExtractionStrategy.TextChunk> textChunks, LocationTextExtractionStrategy.TextChunkFilter filter) Filters the provided list with the provided filterReturns the result so far.Gets text that meets the specified filter If multiple text extractions will be performed for the same page (i.e.protected boolean
isChunkAtWordBoundary
(LocationTextExtractionStrategy.TextChunk chunk, LocationTextExtractionStrategy.TextChunk previousChunk) Determines if a space character should be inserted between a previous chunk and the current chunk.void
renderImage
(ImageRenderInfo renderInfo) no-op method - this renderer isn't interested in image eventsvoid
renderText
(TextRenderInfo renderInfo) Called when text should be renderedprivate boolean
startsWithSpace
(String str)
-
Field Details
-
DUMP_STATE
static boolean DUMP_STATEset to true for debugging -
locationalResult
a summary of all found text -
tclStrat
-
-
Constructor Details
-
LocationTextExtractionStrategy
public LocationTextExtractionStrategy()Creates a new text extraction renderer. -
LocationTextExtractionStrategy
public LocationTextExtractionStrategy(LocationTextExtractionStrategy.TextChunkLocationStrategy strat) Creates a new text extraction renderer, with a custom strategy for creating new TextChunkLocation objects based on the input of the TextRenderInfo.- Parameters:
strat
- the custom strategy
-
-
Method Details
-
beginTextBlock
public void beginTextBlock()Description copied from interface:RenderListener
Called when a new text block is beginning (i.e. BT)- Specified by:
beginTextBlock
in interfaceRenderListener
- See Also:
-
endTextBlock
public void endTextBlock()Description copied from interface:RenderListener
Called when a text block has ended (i.e. ET)- Specified by:
endTextBlock
in interfaceRenderListener
- See Also:
-
startsWithSpace
- Parameters:
str
-- Returns:
- true if the string starts with a space character, false if the string is empty or starts with a non-space character
-
endsWithSpace
- Parameters:
str
-- Returns:
- true if the string ends with a space character, false if the string is empty or ends with a non-space character
-
filterTextChunks
private List<LocationTextExtractionStrategy.TextChunk> filterTextChunks(List<LocationTextExtractionStrategy.TextChunk> textChunks, LocationTextExtractionStrategy.TextChunkFilter filter) Filters the provided list with the provided filter- Parameters:
textChunks
- a list of all TextChunks that this strategy found during processingfilter
- the filter to apply. If null, filtering will be skipped.- Returns:
- the filtered list
- Since:
- 5.3.3
-
isChunkAtWordBoundary
protected boolean isChunkAtWordBoundary(LocationTextExtractionStrategy.TextChunk chunk, LocationTextExtractionStrategy.TextChunk previousChunk) Determines if a space character should be inserted between a previous chunk and the current chunk. This method is exposed as a callback so subclasses can fine time the algorithm for determining whether a space should be inserted or not. By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the previous chunk and the beginning of the current chunk. It will also indicate that a space is needed if the starting point of the new chunk appears *before* the end of the previous chunk (i.e. overlapping text).- Parameters:
chunk
- the new chunk being evaluatedpreviousChunk
- the chunk that appeared immediately before the current chunk- Returns:
- true if the two chunks represent different words (i.e. should have a space between them). False otherwise.
-
getResultantText
Gets text that meets the specified filter If multiple text extractions will be performed for the same page (i.e. for different physical regions of the page), filtering at this level is more efficient than filtering usingFilteredRenderListener
- but not nearly as powerful because most of the RenderInfo state is not captured inLocationTextExtractionStrategy.TextChunk
- Parameters:
chunkFilter
- the filter to to apply- Returns:
- the text results so far, filtered using the specified filter
-
getResultantText
Returns the result so far.- Specified by:
getResultantText
in interfaceTextExtractionStrategy
- Returns:
- a String with the resulting text.
-
dumpState
private void dumpState()Used for debugging only -
renderText
Description copied from interface:RenderListener
Called when text should be rendered- Specified by:
renderText
in interfaceRenderListener
- Parameters:
renderInfo
- information specifying what to render- See Also:
-
compareInts
private static int compareInts(int int1, int int2) - Parameters:
int1
-int2
-- Returns:
- comparison of the two integers
-
renderImage
no-op method - this renderer isn't interested in image events- Specified by:
renderImage
in interfaceRenderListener
- Parameters:
renderInfo
- information specifying what to render- Since:
- 5.0.1
- See Also:
-