Class RegexBasedLocationExtractionStrategy

  • All Implemented Interfaces:
    IEventListener, ILocationExtractionStrategy

    public class RegexBasedLocationExtractionStrategy
    extends java.lang.Object
    implements ILocationExtractionStrategy
    This class is designed to search for the occurrences of a regular expression and return the resultant rectangles. Do note that this class holds all text locations and can't be used for processing multiple pages. If you want to extract text from several pages of pdf document you have to create a new instance of RegexBasedLocationExtractionStrategy for each page.

    Here is an example of usage with new instance per each page: PdfDocument document = new PdfDocument(new PdfReader("...")); for (int i = 1; i <= document.getNumberOfPages(); ++i) { RegexBasedLocationExtractionStrategy extractionStrategy = new RegexBasedLocationExtractionStrategy(""); PdfCanvasProcessor processor = new PdfCanvasProcessor(extractionStrategy); processor.processPageContent(document.getPage(i)); for (IPdfTextLocation location : extractionStrategy.getResultantLocations()) { //process locations ... } }

    • Constructor Detail

      • RegexBasedLocationExtractionStrategy

        public RegexBasedLocationExtractionStrategy​(java.lang.String regex)
      • RegexBasedLocationExtractionStrategy

        public RegexBasedLocationExtractionStrategy​(java.util.regex.Pattern pattern)
    • Method Detail

      • eventOccurred

        public void eventOccurred​(IEventData data,
                                  EventType type)
        Called when some event occurs during parsing a content stream.
        Specified by:
        eventOccurred in interface IEventListener
        Parameters:
        data - Combines the data required for processing corresponding event type.
        type - Event type.
      • getSupportedEvents

        public java.util.Set<EventType> getSupportedEvents()
        Provides the set of event types this listener supports. Returns null if all possible event types are supported.
        Specified by:
        getSupportedEvents in interface IEventListener
        Returns:
        Set of event types supported by this listener or null if all possible event types are supported.
      • toRectangles

        protected java.util.List<Rectangle> toRectangles​(java.util.List<CharacterRenderInfo> cris)
        Converts CharacterRenderInfo objects to Rectangles This method is protected and not final so that custom implementations can choose to override it. E.g. other implementations may choose to add padding/margin to the Rectangles. This method also offers a convenient access point to the mapping of CharacterRenderInfo to Rectangle. This mapping enables (custom implementations) to match color of text in redacted Rectangles, or match color of background, by the mere virtue of offering access to the CharacterRenderInfo objects that generated the Rectangle.
        Parameters:
        cris - list of CharacterRenderInfo objects
        Returns:
        an array containing the elements of this list
      • removeDuplicates

        private void removeDuplicates​(java.util.List<IPdfTextLocation> sortedList)
      • getStartIndex

        private static java.lang.Integer getStartIndex​(java.util.Map<java.lang.Integer,​java.lang.Integer> indexMap,
                                                       int index,
                                                       java.lang.String txt)
      • getEndIndex

        private static java.lang.Integer getEndIndex​(java.util.Map<java.lang.Integer,​java.lang.Integer> indexMap,
                                                     int index)