Class TextExtractor

  • All Implemented Interfaces:
    Callback

    @Deprecated
    public class TextExtractor
    extends DefaultCallback
    Deprecated.
    This class is obsolete and kept around for backward compatibility only.
    A callback extracting text and titles.

    This callbacks extracts all text in the page, and the title. The resulting text is available through text, and the title through title.

    Note that text and title are never trimmed.

    • Field Detail

      • text

        public final MutableString text
        Deprecated.
        The text resulting from the parsing process.
      • title

        public final MutableString title
        Deprecated.
        The title resulting from the parsing process.
    • Constructor Detail

      • TextExtractor

        public TextExtractor()
        Deprecated.
    • Method Detail

      • startDocument

        public void startDocument()
        Deprecated.
        Description copied from interface: Callback
        Receive notification of the beginning of the document.

        The callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.

        Specified by:
        startDocument in interface Callback
        Overrides:
        startDocument in class DefaultCallback
      • characters

        public boolean characters​(char[] characters,
                                  int offset,
                                  int length,
                                  boolean flowBroken)
        Deprecated.
        Description copied from interface: Callback
        Receive notification of character data inside an element.

        You must not write into text, as it could be passed around to many callbacks.

        flowBroken will be true iff the flow was broken before text. This feature makes it possible to extract quickly the text in a document without looking at the elements.

        Specified by:
        characters in interface Callback
        Overrides:
        characters in class DefaultCallback
        Parameters:
        characters - an array containing the character data.
        offset - the start position in the array.
        length - the number of characters to read from the array.
        flowBroken - whether the flow is broken at the start of text.
        Returns:
        true to keep the parser parsing, false to stop it.
      • endElement

        public boolean endElement​(Element element)
        Deprecated.
        Description copied from interface: Callback
        Receive notification of the end of an element. Warning: unless specific decorators are used, in general a callback will just receive notifications for elements whose closing tag appears explicitly in the document.

        This method will never be called for element without closing tags, even if such a tag is found.

        Specified by:
        endElement in interface Callback
        Overrides:
        endElement in class DefaultCallback
        Parameters:
        element - the element whose closing tag was found.
        Returns:
        true to keep the parser parsing, false to stop it.
      • startElement

        public boolean startElement​(Element element,
                                    java.util.Map<Attribute,​MutableString> attrMapUnused)
        Deprecated.
        Description copied from interface: Callback
        Receive notification of the start of an element.

        For simple elements, this is the only notification that the callback will ever receive.

        Specified by:
        startElement in interface Callback
        Overrides:
        startElement in class DefaultCallback
        Parameters:
        element - the element whose opening tag was found.
        attrMapUnused - a map from Attributes to MutableStrings.
        Returns:
        true to keep the parser parsing, false to stop it.