Class HTMLScanner

  • All Implemented Interfaces:
    HTMLComponent, XMLComponent, XMLDocumentScanner, XMLDocumentSource, XMLLocator

    public class HTMLScanner
    extends java.lang.Object
    implements XMLDocumentScanner, XMLLocator, HTMLComponent
    A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

    This component recognizes the following features:

    • http://cyberneko.org/html/features/augmentations
    • http://cyberneko.org/html/features/report-errors
    • http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/script/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/style/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/ignore-specified-charset
    • http://cyberneko.org/html/features/scanner/cdata-sections
    • http://cyberneko.org/html/features/override-doctype
    • http://cyberneko.org/html/features/insert-doctype
    • http://cyberneko.org/html/features/parse-noscript-content
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
    • http://cyberneko.org/html/features/scanner/normalize-attrs
    • http://cyberneko.org/html/features/scanner/plain-attr-values

    This component recognizes the following properties:

    • http://cyberneko.org/html/properties/names/elems
    • http://cyberneko.org/html/properties/names/attrs
    • http://cyberneko.org/html/properties/default-encoding
    • http://cyberneko.org/html/properties/error-reporter
    • http://cyberneko.org/html/properties/encoding-translator
    • http://cyberneko.org/html/properties/doctype/pubid
    • http://cyberneko.org/html/properties/doctype/sysid
    See Also:
    HTMLElements
    • Field Detail

      • HTML_4_01_STRICT_PUBID

        public static final java.lang.String HTML_4_01_STRICT_PUBID
        HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_STRICT_SYSID

        public static final java.lang.String HTML_4_01_STRICT_SYSID
        HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_PUBID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
        HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_SYSID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
        HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_PUBID

        public static final java.lang.String HTML_4_01_FRAMESET_PUBID
        HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_SYSID

        public static final java.lang.String HTML_4_01_FRAMESET_SYSID
        HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
        See Also:
        Constant Field Values
      • AUGMENTATIONS

        public static final java.lang.String AUGMENTATIONS
        Include infoset augmentations.
        See Also:
        Constant Field Values
      • REPORT_ERRORS

        public static final java.lang.String REPORT_ERRORS
        Report errors.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_COMMENT_DELIMS

        public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_CDATA_DELIMS

        public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_COMMENT_DELIMS

        public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_CDATA_DELIMS

        public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • IGNORE_SPECIFIED_CHARSET

        public static final java.lang.String IGNORE_SPECIFIED_CHARSET
        Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction
        See Also:
        Constant Field Values
      • CDATA_SECTIONS

        public static final java.lang.String CDATA_SECTIONS
        Scan CDATA sections.
        See Also:
        Constant Field Values
      • OVERRIDE_DOCTYPE

        public static final java.lang.String OVERRIDE_DOCTYPE
        Override doctype declaration public and system identifiers.
        See Also:
        Constant Field Values
      • INSERT_DOCTYPE

        public static final java.lang.String INSERT_DOCTYPE
        Insert document type declaration.
        See Also:
        Constant Field Values
      • PARSE_NOSCRIPT_CONTENT

        public static final java.lang.String PARSE_NOSCRIPT_CONTENT
        Parse <noscript>...</noscript> content
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_IFRAME

        public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
        Allows self closing <iframe/> tag
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_TAGS

        public static final java.lang.String ALLOW_SELFCLOSING_TAGS
        Allows self closing tags e.g. <div/> (XHTML)
        See Also:
        Constant Field Values
      • NORMALIZE_ATTRIBUTES

        public static final java.lang.String NORMALIZE_ATTRIBUTES
        Normalize attribute values.
        See Also:
        Constant Field Values
      • PLAIN_ATTRIBUTE_VALUES

        public static final java.lang.String PLAIN_ATTRIBUTE_VALUES
        Store the plain attribute values also.
        See Also:
        Constant Field Values
      • RECOGNIZED_FEATURES

        private static final java.lang.String[] RECOGNIZED_FEATURES
        Recognized features.
      • RECOGNIZED_FEATURES_DEFAULTS

        private static final java.lang.Boolean[] RECOGNIZED_FEATURES_DEFAULTS
        Recognized features defaults.
      • NAMES_ELEMS

        public static final java.lang.String NAMES_ELEMS
        Modify HTML element names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • NAMES_ATTRS

        public static final java.lang.String NAMES_ATTRS
        Modify HTML attribute names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • DEFAULT_ENCODING

        public static final java.lang.String DEFAULT_ENCODING
        Default encoding.
        See Also:
        Constant Field Values
      • ERROR_REPORTER

        public static final java.lang.String ERROR_REPORTER
        Error reporter.
        See Also:
        Constant Field Values
      • ENCODING_TRANSLATOR

        public static final java.lang.String ENCODING_TRANSLATOR
        Encoding translator.
        See Also:
        Constant Field Values
      • DOCTYPE_PUBID

        public static final java.lang.String DOCTYPE_PUBID
        Doctype declaration public identifier.
        See Also:
        Constant Field Values
      • DOCTYPE_SYSID

        public static final java.lang.String DOCTYPE_SYSID
        Doctype declaration system identifier.
        See Also:
        Constant Field Values
      • RECOGNIZED_PROPERTIES

        private static final java.lang.String[] RECOGNIZED_PROPERTIES
        Recognized properties.
      • RECOGNIZED_PROPERTIES_DEFAULTS

        private static final java.lang.Object[] RECOGNIZED_PROPERTIES_DEFAULTS
        Recognized properties defaults.
      • STATE_CONTENT

        protected static final short STATE_CONTENT
        State: content.
        See Also:
        Constant Field Values
      • STATE_MARKUP_BRACKET

        protected static final short STATE_MARKUP_BRACKET
        State: markup bracket.
        See Also:
        Constant Field Values
      • STATE_START_DOCUMENT

        protected static final short STATE_START_DOCUMENT
        State: start document.
        See Also:
        Constant Field Values
      • STATE_END_DOCUMENT

        protected static final short STATE_END_DOCUMENT
        State: end document.
        See Also:
        Constant Field Values
      • NAMES_NO_CHANGE

        protected static final short NAMES_NO_CHANGE
        Don't modify HTML names.
        See Also:
        Constant Field Values
      • NAMES_UPPERCASE

        protected static final short NAMES_UPPERCASE
        Uppercase HTML names.
        See Also:
        Constant Field Values
      • NAMES_LOWERCASE

        protected static final short NAMES_LOWERCASE
        Lowercase HTML names.
        See Also:
        Constant Field Values
      • DEBUG_SCANNER

        private static final boolean DEBUG_SCANNER
        Set to true to debug changes in the scanner.
        See Also:
        Constant Field Values
      • DEBUG_SCANNER_STATE

        private static final boolean DEBUG_SCANNER_STATE
        Set to true to debug changes in the scanner state.
        See Also:
        Constant Field Values
      • DEBUG_BUFFER

        private static final boolean DEBUG_BUFFER
        Set to true to debug the buffer.
        See Also:
        Constant Field Values
      • DEBUG_CHARSET

        private static final boolean DEBUG_CHARSET
        Set to true to debug character encoding handling.
        See Also:
        Constant Field Values
      • DEBUG_CALLBACKS

        protected static final boolean DEBUG_CALLBACKS
        Set to true to debug callbacks.
        See Also:
        Constant Field Values
      • SYNTHESIZED_ITEM

        protected static final HTMLEventInfo SYNTHESIZED_ITEM
        Synthesized event info item.
      • fAugmentations_

        private boolean fAugmentations_
        Augmentations.
      • fReportErrors_

        boolean fReportErrors_
        Report errors.
      • fScriptStripCDATADelims_

        boolean fScriptStripCDATADelims_
        Strip CDATA delimiters from SCRIPT tags.
      • fScriptStripCommentDelims_

        boolean fScriptStripCommentDelims_
        Strip comment delimiters from SCRIPT tags.
      • fStyleStripCDATADelims_

        boolean fStyleStripCDATADelims_
        Strip CDATA delimiters from STYLE tags.
      • fStyleStripCommentDelims_

        boolean fStyleStripCommentDelims_
        Strip comment delimiters from STYLE tags.
      • fIgnoreSpecifiedCharset_

        boolean fIgnoreSpecifiedCharset_
        Ignore specified character set.
      • fCDATASections_

        boolean fCDATASections_
        CDATA sections.
      • fOverrideDoctype_

        private boolean fOverrideDoctype_
        Override doctype declaration public and system identifiers.
      • fInsertDoctype_

        boolean fInsertDoctype_
        Insert document type declaration.
      • fNormalizeAttributes_

        boolean fNormalizeAttributes_
        Normalize attribute values.
      • fPlainAttributeValues_

        boolean fPlainAttributeValues_
        Store the plain attribute values also.
      • fParseNoScriptContent_

        boolean fParseNoScriptContent_
        Parse noscript content.
      • fAllowSelfclosingIframe_

        boolean fAllowSelfclosingIframe_
        Allows self closing iframe tags.
      • fAllowSelfclosingTags_

        boolean fAllowSelfclosingTags_
        Allows self closing tags.
      • fNamesElems

        protected short fNamesElems
        Modify HTML element names.
      • fNamesAttrs

        protected short fNamesAttrs
        Modify HTML attribute names.
      • fDefaultIANAEncoding

        protected java.lang.String fDefaultIANAEncoding
        Default encoding.
      • fDoctypePubid

        protected java.lang.String fDoctypePubid
        Doctype declaration public identifier.
      • fDoctypeSysid

        protected java.lang.String fDoctypeSysid
        Doctype declaration system identifier.
      • fBeginLineNumber

        protected int fBeginLineNumber
        Beginning line number.
      • fBeginColumnNumber

        protected int fBeginColumnNumber
        Beginning column number.
      • fBeginCharacterOffset

        protected int fBeginCharacterOffset
        Beginning character offset in the file.
      • fEndLineNumber

        protected int fEndLineNumber
        Ending line number.
      • fEndColumnNumber

        protected int fEndColumnNumber
        Ending column number.
      • fEndCharacterOffset

        protected int fEndCharacterOffset
        Ending character offset in the file.
      • fScannerState

        protected short fScannerState
        The current scanner state.
      • fIANAEncoding

        protected java.lang.String fIANAEncoding
        Auto-detected IANA encoding.
      • fJavaEncoding

        protected java.lang.String fJavaEncoding
        Auto-detected Java encoding.
      • fElementCount

        protected int fElementCount
        Element count.
      • fElementDepth

        protected int fElementDepth
        Element depth.
      • fSpecialScanner

        protected final HTMLScanner.SpecialScanner fSpecialScanner
        Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
      • fStringBuffer

        protected final XMLString fStringBuffer
        String buffer.
      • fStringBufferEntiyRef

        final XMLString fStringBufferEntiyRef
        String buffer used when resolving entity refs.
      • fStringBufferPlainAttribValue

        final XMLString fStringBufferPlainAttribValue
      • fScanScriptContent

        final XMLString fScanScriptContent
        String buffer, larger because scripts areas are larger
      • fScanUntilEndTag

        final XMLString fScanUntilEndTag
      • fScanLiteral

        private final XMLString fScanLiteral
      • fSingleBoolean

        final boolean[] fSingleBoolean
        Single boolean array.
    • Constructor Detail

      • HTMLScanner

        HTMLScanner​(HTMLConfiguration htmlConfiguration)
        Creates a new HTMLScanner with the given configuration
        Parameters:
        htmlConfiguration - the configuration to use
    • Method Detail

      • pushInputSource

        public void pushInputSource​(XMLInputSource inputSource)
        Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

        Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

        Parameters:
        inputSource - The new input source to start scanning.
        See Also:
        evaluateInputSource(XMLInputSource)
      • getReader

        private java.io.Reader getReader​(XMLInputSource inputSource)
      • evaluateInputSource

        public void evaluateInputSource​(XMLInputSource inputSource)
        Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).
        Parameters:
        inputSource - The new input source to start evaluating.
        See Also:
        pushInputSource(XMLInputSource)
      • cleanup

        public void cleanup​(boolean closeall)
        Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.
        Parameters:
        closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
      • getEncoding

        public java.lang.String getEncoding()
        Returns the encoding.
        Specified by:
        getEncoding in interface XMLLocator
        Returns:
        the encoding of the current entity. Note that, for a given entity, this value can only be considered final once the encoding declaration has been read (or once it has been determined that there is no such declaration) since, no encoding having been specified on the XMLInputSource, the parser will make an initial "guess" which could be in error.
      • getPublicId

        public java.lang.String getPublicId()
        Returns the public identifier.
        Specified by:
        getPublicId in interface XMLLocator
        Returns:
        the public identifier.
      • getBaseSystemId

        public java.lang.String getBaseSystemId()
        Returns the base system identifier.
        Specified by:
        getBaseSystemId in interface XMLLocator
        Returns:
        the base system identifier.
      • getLiteralSystemId

        public java.lang.String getLiteralSystemId()
        Returns the literal system identifier.
        Specified by:
        getLiteralSystemId in interface XMLLocator
        Returns:
        the literal system identifier.
      • getExpandedSystemId

        public java.lang.String getExpandedSystemId()
        Returns the expanded system identifier.
        Specified by:
        getExpandedSystemId in interface XMLLocator
        Returns:
        the expanded system identifier.
      • getLineNumber

        public int getLineNumber()
        Returns the current line number.
        Specified by:
        getLineNumber in interface XMLLocator
        Returns:
        the line number, or -1 if no line number is available.
      • getColumnNumber

        public int getColumnNumber()
        Returns the current column number.
        Specified by:
        getColumnNumber in interface XMLLocator
        Returns:
        the column number, or -1 if no column number is available.
      • getXMLVersion

        public java.lang.String getXMLVersion()
        Returns the XML version.
        Specified by:
        getXMLVersion in interface XMLLocator
        Returns:
        the XML version of the current entity. This will normally be the value from the XML or text declaration or defaulted by the parser. Note that that this value may be different than the version of the processing rules applied to the current entity. For instance, an XML 1.1 document may refer to XML 1.0 entities. In such a case the rules of XML 1.1 are applied to the entire document. Also note that, for a given entity, this value can only be considered final once the XML or text declaration has been read or once it has been determined that there is no such declaration.
      • getCharacterOffset

        public int getCharacterOffset()
        Returns the character offset.
        Specified by:
        getCharacterOffset in interface XMLLocator
        Returns:
        the character offset, or -1 if no character offset is available.
      • getFeatureDefault

        public java.lang.Boolean getFeatureDefault​(java.lang.String featureId)
        Returns the default state for a feature.
        Specified by:
        getFeatureDefault in interface HTMLComponent
        Specified by:
        getFeatureDefault in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        Returns:
        the default state for a feature, or null if this component does not want to report a default value for this feature.
      • getPropertyDefault

        public java.lang.Object getPropertyDefault​(java.lang.String propertyId)
        Returns the default state for a property.
        Specified by:
        getPropertyDefault in interface HTMLComponent
        Specified by:
        getPropertyDefault in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        Returns:
        the default state for a property, or null if this component does not want to report a default value for this property
      • getRecognizedFeatures

        public java.lang.String[] getRecognizedFeatures()
        Returns recognized features.
        Specified by:
        getRecognizedFeatures in interface XMLComponent
        Returns:
        an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
      • getRecognizedProperties

        public java.lang.String[] getRecognizedProperties()
        Returns recognized properties.
        Specified by:
        getRecognizedProperties in interface XMLComponent
        Returns:
        an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
      • setFeature

        public void setFeature​(java.lang.String featureId,
                               boolean state)
        Sets a feature.
        Specified by:
        setFeature in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        state - The state of the feature.
      • setProperty

        public void setProperty​(java.lang.String propertyId,
                                java.lang.Object value)
                         throws XMLConfigurationException
        Sets a property.
        Specified by:
        setProperty in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        value - The value of the property.
        Throws:
        XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
      • setInputSource

        public void setInputSource​(XMLInputSource source)
                            throws java.io.IOException
        Sets the input source.
        Specified by:
        setInputSource in interface XMLDocumentScanner
        Parameters:
        source - The input source.
        Throws:
        java.io.IOException - Thrown on i/o error.
      • scanDocument

        public boolean scanDocument​(boolean complete)
                             throws XNIException,
                                    java.io.IOException
        Scans the document.
        Specified by:
        scanDocument in interface XMLDocumentScanner
        Parameters:
        complete - True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.
        Returns:
        True if there is more to scan, false otherwise.
        Throws:
        XNIException - on error.
        java.io.IOException - Thrown on i/o error.
      • getValue

        protected static java.lang.String getValue​(XMLAttributes attrs,
                                                   java.lang.String aname)
      • expandSystemId

        public static java.lang.String expandSystemId​(java.lang.String systemId,
                                                      java.lang.String baseSystemId)
        Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.
        Parameters:
        systemId - The systemId to be expanded.
        baseSystemId - baseSystemId
        Returns:
        Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
      • fixURI

        protected static java.lang.String fixURI​(java.lang.String str)
        Fixes a platform dependent filename to standard URI form.
        Parameters:
        str - The string to fix.
        Returns:
        Returns the fixed URI string.
      • modifyName

        protected static java.lang.String modifyName​(java.lang.String name,
                                                     short mode)
      • getNamesValue

        protected static short getNamesValue​(java.lang.String value)
      • setScannerState

        protected void setScannerState​(short state)
      • scanDoctype

        protected void scanDoctype()
                            throws java.io.IOException
        Throws:
        java.io.IOException
      • scanLiteral

        protected java.lang.String scanLiteral()
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • scanName

        protected java.lang.String scanName​(boolean strict)
                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • scanTagName

        protected java.lang.String scanTagName()
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • scanEntityRef

        protected int scanEntityRef​(XMLString str,
                                    XMLString plainValue,
                                    boolean content)
                             throws java.io.IOException
        Throws:
        java.io.IOException
      • returnEntityRefString

        private int returnEntityRefString​(XMLString str,
                                          boolean content)
      • skip

        protected boolean skip​(java.lang.String s,
                               boolean caseSensitive)
                        throws java.io.IOException
        Throws:
        java.io.IOException
      • skipMarkup

        protected boolean skipMarkup​(boolean balance)
                              throws java.io.IOException
        Throws:
        java.io.IOException
      • skipSpaces

        protected boolean skipSpaces()
                              throws java.io.IOException
        Throws:
        java.io.IOException
      • skipNewlines

        protected int skipNewlines()
                            throws java.io.IOException
        Throws:
        java.io.IOException
      • synthesizedAugs

        protected final Augmentations synthesizedAugs()
      • builtinXmlRef

        protected static boolean builtinXmlRef​(java.lang.String name)
      • isEncodingCompatible

        static boolean isEncodingCompatible​(java.lang.String encoding1,
                                            java.lang.String encoding2)
        To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings
      • canRoundtrip

        private static boolean canRoundtrip​(java.lang.String encodeCharset,
                                            java.lang.String decodeCharset)
                                     throws java.io.UnsupportedEncodingException
        Throws:
        java.io.UnsupportedEncodingException
      • readPreservingBufferContent

        protected int readPreservingBufferContent()
                                           throws java.io.IOException
        Throws:
        java.io.IOException