Class PDFMarkedContentExtractor


  • public class PDFMarkedContentExtractor
    extends PDFStreamEngine
    This is an stream engine to extract the marked content of a pdf.
    Author:
    Johannes Koch
    • Constructor Detail

      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor()
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object.
        Throws:
        java.io.IOException
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.lang.String encoding)
                                  throws java.io.IOException
        Constructor. Will apply encoding-specific conversions to the output text.
        Parameters:
        encoding - The encoding that the output will be written in.
        Throws:
        java.io.IOException
    • Method Detail

      • isSuppressDuplicateOverlappingText

        public boolean isSuppressDuplicateOverlappingText()
        Returns:
        the suppressDuplicateOverlappingText setting.
      • setSuppressDuplicateOverlappingText

        public void setSuppressDuplicateOverlappingText​(boolean suppressDuplicateOverlappingText)
        By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
        Parameters:
        suppressDuplicateOverlappingText - The suppressDuplicateOverlappingText setting to set.
      • xobject

        public void xobject​(PDXObject xobject)
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Parameters:
        text - The text to process.
      • getMarkedContents

        public java.util.List<PDMarkedContent> getMarkedContents()
      • processPage

        public void processPage​(PDPage page)
                         throws java.io.IOException
        This will initialize and process the contents of the stream.
        Overrides:
        processPage in class PDFStreamEngine
        Parameters:
        page - the page to process
        Throws:
        java.io.IOException - if there is an error accessing the stream.
      • showGlyph

        protected void showGlyph​(Matrix textRenderingMatrix,
                                 PDFont font,
                                 int code,
                                 java.lang.String unicode,
                                 Vector displacement)
                          throws java.io.IOException
        Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
        Overrides:
        showGlyph in class PDFStreamEngine
        Parameters:
        textRenderingMatrix - the current text rendering matrix, Trm
        font - the current font
        code - internal PDF character code for the glyph
        unicode - the Unicode text for this glyph, or null if the PDF does provide it
        displacement - the displacement (i.e. advance) of the glyph in text space
        Throws:
        java.io.IOException - if the glyph cannot be processed
      • computeFontHeight

        protected float computeFontHeight​(PDFont font)
                                   throws java.io.IOException
        Compute the font height. Override this if you want to use own calculations.
        Parameters:
        font - the font.
        Returns:
        the font height.
        Throws:
        java.io.IOException - if there is an error while getting the font bounding box.