Class MarkedUpTextAssembler

  • All Implemented Interfaces:
    TextAssembler

    public class MarkedUpTextAssembler
    extends java.lang.Object
    implements TextAssembler
    We'll get called on a variety of marked section content (perhaps including the results of nested sections), and will assemble it into an order as we can.
    • Field Detail

      • result

        java.util.List<FinalText> result
        our result may be partially processed already, in which case we'll just add things to it, once ready.
      • page

        private int page
      • wordIdCounter

        private int wordIdCounter
      • usePdfMarkupElements

        private boolean usePdfMarkupElements
      • partialWords

        private java.util.List<TextAssemblyBuffer> partialWords
        as we get new content (final or not), we accumulate it until we reach the end of a parsing unit

        Each parsing unit may have a tag name that should wrap its content

    • Constructor Detail

      • MarkedUpTextAssembler

        MarkedUpTextAssembler​(PdfReader reader)
      • MarkedUpTextAssembler

        MarkedUpTextAssembler​(PdfReader reader,
                              boolean usePdfMarkupElements)
    • Method Detail

      • process

        public void process​(ParsedText unassembled,
                            java.lang.String contextName)
        Remember an unassembled chunk until we hit the end of this element, or we hit an assembled chunk, and need to pull things together.
        Specified by:
        process in interface TextAssembler
        Parameters:
        unassembled - chunk of text rendering instruction to contribute to final text
        contextName - Name of the element context we are in. Null value if it's an Artifact.
      • process

        public void process​(FinalText completed,
                            java.lang.String contextName)
        Slot fully-assembled chunk into our result at the current location. If there are unassembled chunks waiting, assemble them first.
        Specified by:
        process in interface TextAssembler
        Parameters:
        completed - This is a chunk from a nested element
        contextName - Name of the element context we are in. Null value if it's an Artifact.
      • process

        public void process​(Word completed,
                            java.lang.String contextName)
        Specified by:
        process in interface TextAssembler
        Parameters:
        completed - process a complete chunk -- just add this subsection into the proper place.
        contextName - Name of the element context we are in. Null value if it's an Artifact.
        See Also:
        TextAssembler.process(Word, String)
      • clearAccumulator

        private void clearAccumulator()
      • concatenateResult

        private FinalText concatenateResult​(java.lang.String containingElementName)
      • endParsingContext

        public FinalText endParsingContext​(java.lang.String containingElementName)
        Specified by:
        endParsingContext in interface TextAssembler
        Parameters:
        containingElementName - This is an element name to surround the extracted text
        Returns:
        the final text for the set of fragments and fully parsed items we were passed during processing.
        See Also:
        TextAssembler.endParsingContext(String)
      • renderText

        public void renderText​(FinalText finalText)
        Specified by:
        renderText in interface TextAssembler
        Parameters:
        finalText - process a complete chunk -- just add this subsection into the proper place.
      • renderText

        public void renderText​(ParsedTextImpl partialWord)
        Captures text using a simplified algorithm for inserting hard returns and spaces
        Specified by:
        renderText in interface TextAssembler
        Parameters:
        partialWord - process one of a number of raw pdf text chunks, with placement, font, etc.
        See Also:
        GraphicsState, Matrix
      • getReader

        protected PdfReader getReader()
        Getter.
        Returns:
        reader
      • getWordId

        public java.lang.String getWordId()
        assembler can calculate an identifier for each word on a page, for use in markup.
        Specified by:
        getWordId in interface TextAssembler
        Returns:
        the new unique id.
        See Also:
        TextAssembler.getWordId()