Class PDFTextStripperByArea


  • public class PDFTextStripperByArea
    extends PDFTextStripper
    This will extract text from a specified region in the PDF.
    Author:
    Ben Litchfield
    • Constructor Detail

      • PDFTextStripperByArea

        public PDFTextStripperByArea()
                              throws java.io.IOException
        Constructor.
        Throws:
        java.io.IOException - If there is an error loading properties.
    • Method Detail

      • setShouldSeparateByBeads

        public final void setShouldSeparateByBeads​(boolean aShouldSeparateByBeads)
        This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.
        Overrides:
        setShouldSeparateByBeads in class PDFTextStripper
        Parameters:
        aShouldSeparateByBeads - The new grouping of beads.
      • addRegion

        public void addRegion​(java.lang.String regionName,
                              java.awt.geom.Rectangle2D rect)
        Add a new region to group text by.
        Parameters:
        regionName - The name of the region.
        rect - The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
      • removeRegion

        public void removeRegion​(java.lang.String regionName)
        Delete a region to group text by. If the region does not exist, this method does nothing.
        Parameters:
        regionName - The name of the region to delete.
      • getRegions

        public java.util.List<java.lang.String> getRegions()
        Get the list of regions that have been setup.
        Returns:
        A list of java.lang.String objects to identify the region names.
      • getTextForRegion

        public java.lang.String getTextForRegion​(java.lang.String regionName)
        Get the text for the region, this should be called after extractRegions().
        Parameters:
        regionName - The name of the region to get the text from.
        Returns:
        The text that was identified in that region.
      • extractRegions

        public void extractRegions​(PDPage page)
                            throws java.io.IOException
        Process the page to extract the region text.
        Parameters:
        page - The page to extract the regions from.
        Throws:
        java.io.IOException - If there is an error while extracting text.
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class PDFTextStripper
        Parameters:
        text - The text to process.
      • writePage

        protected void writePage()
                          throws java.io.IOException
        This will print the processed page text to the output stream.
        Overrides:
        writePage in class PDFTextStripper
        Throws:
        java.io.IOException - If there is an error writing the text.
      • showGlyph

        protected void showGlyph​(Matrix textRenderingMatrix,
                                 PDFont font,
                                 int code,
                                 java.lang.String unicode,
                                 Vector displacement)
                          throws java.io.IOException
        Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
        Overrides:
        showGlyph in class PDFStreamEngine
        Parameters:
        textRenderingMatrix - the current text rendering matrix, Trm
        font - the current font
        code - internal PDF character code for the glyph
        unicode - the Unicode text for this glyph, or null if the PDF does provide it
        displacement - the displacement (i.e. advance) of the glyph in text space
        Throws:
        java.io.IOException - if the glyph cannot be processed
      • computeFontHeight

        protected float computeFontHeight​(PDFont font)
                                   throws java.io.IOException
        Compute the font height. Override this if you want to use own calculations.
        Parameters:
        font - the font.
        Returns:
        the font height.
        Throws:
        java.io.IOException - if there is an error while getting the font bounding box.