Class PDFMarkedContentExtractor

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFMarkedContentExtractor

public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
Author:
Johannes Koch
  • Constructor Details

    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor() throws IOException
      Instantiate a new PDFTextStripper object.
      Throws:
      IOException
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(String encoding) throws IOException
      Constructor. Will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException
  • Method Details

    • isSuppressDuplicateOverlappingText

      public boolean isSuppressDuplicateOverlappingText()
      Returns:
      the suppressDuplicateOverlappingText setting.
    • setSuppressDuplicateOverlappingText

      public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
      By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
      Parameters:
      suppressDuplicateOverlappingText - The suppressDuplicateOverlappingText setting to set.
    • beginMarkedContentSequence

      public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
      Description copied from class: PDFStreamEngine
      Called when a marked content group begins
      Overrides:
      beginMarkedContentSequence in class PDFStreamEngine
      Parameters:
      tag - indicates the role or significance of the sequence
      properties - optional properties
    • endMarkedContentSequence

      public void endMarkedContentSequence()
      Description copied from class: PDFStreamEngine
      Called when a marked content group ends
      Overrides:
      endMarkedContentSequence in class PDFStreamEngine
    • xobject

      public void xobject(PDXObject xobject)
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Parameters:
      text - The text to process.
    • getMarkedContents

      public List<PDMarkedContent> getMarkedContents()
    • processPage

      public void processPage(PDPage page) throws IOException
      This will initialize and process the contents of the stream.
      Overrides:
      processPage in class PDFStreamEngine
      Parameters:
      page - the page to process
      Throws:
      IOException - if there is an error accessing the stream.
    • showGlyph

      protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
      Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
      Overrides:
      showGlyph in class PDFStreamEngine
      Parameters:
      textRenderingMatrix - the current text rendering matrix, Trm
      font - the current font
      code - internal PDF character code for the glyph
      unicode - the Unicode text for this glyph, or null if the PDF does provide it
      displacement - the displacement (i.e. advance) of the glyph in text space
      Throws:
      IOException - if the glyph cannot be processed
    • computeFontHeight

      protected float computeFontHeight(PDFont font) throws IOException
      Compute the font height. Override this if you want to use own calculations.
      Parameters:
      font - the font.
      Returns:
      the font height.
      Throws:
      IOException - if there is an error while getting the font bounding box.