Class PDFMarkedContentExtractor

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFMarkedContentExtractor

public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
Author:
Johannes Koch
  • Constructor Details

    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor() throws IOException
      Instantiate a new PDFTextStripper object.
      Throws:
      IOException
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(String encoding) throws IOException
      Constructor. Will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException
  • Method Details

    • beginMarkedContentSequence

      public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
      Description copied from class: PDFStreamEngine
      Called when a marked content group begins
      Overrides:
      beginMarkedContentSequence in class PDFStreamEngine
      Parameters:
      tag - indicates the role or significance of the sequence
      properties - optional properties
    • endMarkedContentSequence

      public void endMarkedContentSequence()
      Description copied from class: PDFStreamEngine
      Called when a a marked content group ends
      Overrides:
      endMarkedContentSequence in class PDFStreamEngine
    • xobject

      public void xobject(PDXObject xobject)
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Parameters:
      text - The text to process.
    • getMarkedContents

      public List<PDMarkedContent> getMarkedContents()
    • processPage

      public void processPage(PDPage page) throws IOException
      This will initialize and process the contents of the stream.
      Overrides:
      processPage in class PDFStreamEngine
      Parameters:
      page - the page to process
      Throws:
      IOException - if there is an error accessing the stream.
    • showGlyph

      protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
      Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
      Overrides:
      showGlyph in class PDFStreamEngine
      Parameters:
      textRenderingMatrix - the current text rendering matrix, Trm
      font - the current font
      code - internal PDF character code for the glyph
      unicode - the Unicode text for this glyph, or null if the PDF does provide it
      displacement - the displacement (i.e. advance) of the glyph in text space
      Throws:
      IOException - if the glyph cannot be processed
    • computeFontHeight

      protected float computeFontHeight(PDFont font) throws IOException
      Compute the font height. Override this if you want to use own calculations.
      Parameters:
      font - the font.
      Returns:
      the font height.
      Throws:
      IOException - if there is an error while getting the font bounding box.