Class PdfTextExtractor

java.lang.Object
com.lowagie.text.pdf.parser.PdfTextExtractor

public class PdfTextExtractor extends Object
Extracts text from a PDF file.
Since:
2.1.4
  • Field Details

    • reader

      private final PdfReader reader
      The PdfReader that holds the PDF file.
    • renderListener

      private final TextAssembler renderListener
      The TextAssembler that will receive render notifications and provide resultant text
  • Constructor Details

    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader)
      Creates a new Text Extractor object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements)
      Creates a new Text Extractor object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
      usePdfMarkupElements - should we use higher level tags for PDF markup entities?
    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader, TextAssembler renderListener)
      Creates a new Text Extractor object.
      Parameters:
      reader - the reader with the PDF
      renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
  • Method Details

    • getContentBytesForPage

      private byte[] getContentBytesForPage(int pageNum) throws IOException
      Gets the content bytes of a page.
      Parameters:
      pageNum - the 1-based page number of page you want get the content stream from
      Returns:
      a byte array with the effective content stream of a page
      Throws:
      IOException
    • getContentBytesFromContentObject

      private byte[] getContentBytesFromContentObject(PdfObject contentObject) throws IOException
      Gets the content bytes from a content object, which may be a reference a stream or an array.
      Parameters:
      contentObject - the object to read bytes from
      Returns:
      the content bytes
      Throws:
      IOException
    • getTextFromPage

      public String getTextFromPage(int page) throws IOException
      Gets the text from a page.
      Parameters:
      page - the 1-based page number of page
      Returns:
      a String with the content as plain text (without PDF syntax)
      Throws:
      IOException - on error
    • getTextFromPage

      public String getTextFromPage(int page, boolean useContainerMarkup) throws IOException
      get the text from the page
      Parameters:
      page - page number we are interested in
      useContainerMarkup - should we put tags in for PDf markup container elements (not really HTML at the moment).
      Returns:
      result of extracting the text, with tags as requested.
      Throws:
      IOException - on error
    • processContent

      public void processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler)
      Processes PDF syntax
      Parameters:
      contentBytes - the bytes of a content stream
      resources - the resources that come with the content stream
      handler - interprets events caused by recognition of operations in a content stream.