Class PdfTextExtractor


  • public class PdfTextExtractor
    extends java.lang.Object
    Extracts text from a PDF file.
    Since:
    2.1.4
    • Field Detail

      • reader

        private final PdfReader reader
        The PdfReader that holds the PDF file.
      • renderListener

        private final TextAssembler renderListener
        The TextAssembler that will receive render notifications and provide resultant text
    • Constructor Detail

      • PdfTextExtractor

        public PdfTextExtractor​(PdfReader reader)
        Creates a new Text Extractor object, using a TextAssembler as the render listener
        Parameters:
        reader - the reader with the PDF
      • PdfTextExtractor

        public PdfTextExtractor​(PdfReader reader,
                                boolean usePdfMarkupElements)
        Creates a new Text Extractor object, using a TextAssembler as the render listener
        Parameters:
        reader - the reader with the PDF
        usePdfMarkupElements - should we use higher level tags for PDF markup entities?
      • PdfTextExtractor

        public PdfTextExtractor​(PdfReader reader,
                                TextAssembler renderListener)
        Creates a new Text Extractor object.
        Parameters:
        reader - the reader with the PDF
        renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
    • Method Detail

      • getContentBytesForPage

        private byte[] getContentBytesForPage​(int pageNum)
                                       throws java.io.IOException
        Gets the content bytes of a page.
        Parameters:
        pageNum - the 1-based page number of page you want get the content stream from
        Returns:
        a byte array with the effective content stream of a page
        Throws:
        java.io.IOException
      • getContentBytesFromContentObject

        private byte[] getContentBytesFromContentObject​(PdfObject contentObject)
                                                 throws java.io.IOException
        Gets the content bytes from a content object, which may be a reference a stream or an array.
        Parameters:
        contentObject - the object to read bytes from
        Returns:
        the content bytes
        Throws:
        java.io.IOException
      • getTextFromPage

        public java.lang.String getTextFromPage​(int page)
                                         throws java.io.IOException
        Gets the text from a page.
        Parameters:
        page - the 1-based page number of page
        Returns:
        a String with the content as plain text (without PDF syntax)
        Throws:
        java.io.IOException - on error
      • getTextFromPage

        public java.lang.String getTextFromPage​(int page,
                                                boolean useContainerMarkup)
                                         throws java.io.IOException
        get the text from the page
        Parameters:
        page - page number we are interested in
        useContainerMarkup - should we put tags in for PDf markup container elements (not really HTML at the moment).
        Returns:
        result of extracting the text, with tags as requested.
        Throws:
        java.io.IOException - on error
      • processContent

        public void processContent​(byte[] contentBytes,
                                   PdfDictionary resources,
                                   PdfContentStreamHandler handler)
        Processes PDF syntax
        Parameters:
        contentBytes - the bytes of a content stream
        resources - the resources that come with the content stream
        handler - interprets events caused by recognition of operations in a content stream.