Package com.lowagie.text.pdf.parser
Class PdfTextExtractor
- java.lang.Object
-
- com.lowagie.text.pdf.parser.PdfTextExtractor
-
public class PdfTextExtractor extends java.lang.Object
Extracts text from a PDF file.- Since:
- 2.1.4
-
-
Field Summary
Fields Modifier and Type Field Description private PdfReader
reader
The PdfReader that holds the PDF file.private TextAssembler
renderListener
TheTextAssembler
that will receive render notifications and provide resultant text
-
Constructor Summary
Constructors Constructor Description PdfTextExtractor(PdfReader reader)
Creates a new Text Extractor object, using aTextAssembler
as the render listenerPdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements)
Creates a new Text Extractor object, using aTextAssembler
as the render listenerPdfTextExtractor(PdfReader reader, TextAssembler renderListener)
Creates a new Text Extractor object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private byte[]
getContentBytesForPage(int pageNum)
Gets the content bytes of a page.private byte[]
getContentBytesFromContentObject(PdfObject contentObject)
Gets the content bytes from a content object, which may be a reference a stream or an array.java.lang.String
getTextFromPage(int page)
Gets the text from a page.java.lang.String
getTextFromPage(int page, boolean useContainerMarkup)
get the text from the pagevoid
processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler)
Processes PDF syntax
-
-
-
Field Detail
-
reader
private final PdfReader reader
The PdfReader that holds the PDF file.
-
renderListener
private final TextAssembler renderListener
TheTextAssembler
that will receive render notifications and provide resultant text
-
-
Constructor Detail
-
PdfTextExtractor
public PdfTextExtractor(PdfReader reader)
Creates a new Text Extractor object, using aTextAssembler
as the render listener- Parameters:
reader
- the reader with the PDF
-
PdfTextExtractor
public PdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements)
Creates a new Text Extractor object, using aTextAssembler
as the render listener- Parameters:
reader
- the reader with the PDFusePdfMarkupElements
- should we use higher level tags for PDF markup entities?
-
PdfTextExtractor
public PdfTextExtractor(PdfReader reader, TextAssembler renderListener)
Creates a new Text Extractor object.- Parameters:
reader
- the reader with the PDFrenderListener
- the render listener that will be used to analyze renderText operations and provide resultant text
-
-
Method Detail
-
getContentBytesForPage
private byte[] getContentBytesForPage(int pageNum) throws java.io.IOException
Gets the content bytes of a page.- Parameters:
pageNum
- the 1-based page number of page you want get the content stream from- Returns:
- a byte array with the effective content stream of a page
- Throws:
java.io.IOException
-
getContentBytesFromContentObject
private byte[] getContentBytesFromContentObject(PdfObject contentObject) throws java.io.IOException
Gets the content bytes from a content object, which may be a reference a stream or an array.- Parameters:
contentObject
- the object to read bytes from- Returns:
- the content bytes
- Throws:
java.io.IOException
-
getTextFromPage
public java.lang.String getTextFromPage(int page) throws java.io.IOException
Gets the text from a page.- Parameters:
page
- the 1-based page number of page- Returns:
- a String with the content as plain text (without PDF syntax)
- Throws:
java.io.IOException
- on error
-
getTextFromPage
public java.lang.String getTextFromPage(int page, boolean useContainerMarkup) throws java.io.IOException
get the text from the page- Parameters:
page
- page number we are interested inuseContainerMarkup
- should we put tags in for PDf markup container elements (not really HTML at the moment).- Returns:
- result of extracting the text, with tags as requested.
- Throws:
java.io.IOException
- on error
-
processContent
public void processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler)
Processes PDF syntax- Parameters:
contentBytes
- the bytes of a content streamresources
- the resources that come with the content streamhandler
- interprets events caused by recognition of operations in a content stream.
-
-