Class HTMLScanner

  • All Implemented Interfaces:
    org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLLocator, HTMLComponent

    public class HTMLScanner
    extends java.lang.Object
    implements org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.XMLLocator, HTMLComponent
    A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

    This component recognizes the following features:

    • http://cyberneko.org/html/features/augmentations
    • http://cyberneko.org/html/features/report-errors
    • http://apache.org/xml/features/scanner/notify-char-refs
    • http://apache.org/xml/features/scanner/notify-builtin-refs
    • http://cyberneko.org/html/features/scanner/notify-builtin-refs
    • http://cyberneko.org/html/features/scanner/fix-mswindows-refs
    • http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/script/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
    • http://cyberneko.org/html/features/scanner/style/strip-comment-delims
    • http://cyberneko.org/html/features/scanner/ignore-specified-charset
    • http://cyberneko.org/html/features/scanner/cdata-sections
    • http://cyberneko.org/html/features/override-doctype
    • http://cyberneko.org/html/features/insert-doctype
    • http://cyberneko.org/html/features/parse-noscript-content
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
    • http://cyberneko.org/html/features/scanner/allow-selfclosing-tags

    This component recognizes the following properties:

    • http://cyberneko.org/html/properties/names/elems
    • http://cyberneko.org/html/properties/names/attrs
    • http://cyberneko.org/html/properties/default-encoding
    • http://cyberneko.org/html/properties/error-reporter
    • http://cyberneko.org/html/properties/doctype/pubid
    • http://cyberneko.org/html/properties/doctype/sysid
    Version:
    $Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
    Author:
    Andy Clark, Marc Guillemot, Ahmed Ashour
    See Also:
    HTMLElements, HTMLEntities
    • Field Detail

      • HTML_4_01_STRICT_PUBID

        public static final java.lang.String HTML_4_01_STRICT_PUBID
        HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_STRICT_SYSID

        public static final java.lang.String HTML_4_01_STRICT_SYSID
        HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_PUBID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
        HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_TRANSITIONAL_SYSID

        public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
        HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_PUBID

        public static final java.lang.String HTML_4_01_FRAMESET_PUBID
        HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
        See Also:
        Constant Field Values
      • HTML_4_01_FRAMESET_SYSID

        public static final java.lang.String HTML_4_01_FRAMESET_SYSID
        HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
        See Also:
        Constant Field Values
      • AUGMENTATIONS

        protected static final java.lang.String AUGMENTATIONS
        Include infoset augmentations.
        See Also:
        Constant Field Values
      • REPORT_ERRORS

        protected static final java.lang.String REPORT_ERRORS
        Report errors.
        See Also:
        Constant Field Values
      • NOTIFY_CHAR_REFS

        public static final java.lang.String NOTIFY_CHAR_REFS
        Notify character entity references (e.g.  ,  , etc).
        See Also:
        Constant Field Values
      • NOTIFY_XML_BUILTIN_REFS

        public static final java.lang.String NOTIFY_XML_BUILTIN_REFS
        Notify handler of built-in entity references (e.g. &, <, etc).

        Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.

        To be notified of the built-in entity references in HTML, set the http://cyberneko.org/html/features/scanner/notify-builtin-refs feature to true.

        See Also:
        Constant Field Values
      • NOTIFY_HTML_BUILTIN_REFS

        public static final java.lang.String NOTIFY_HTML_BUILTIN_REFS
        Notify handler of built-in entity references (e.g. &nobr;, ©, etc).

        Note: This includes the five pre-defined XML general entities.

        See Also:
        Constant Field Values
      • FIX_MSWINDOWS_REFS

        public static final java.lang.String FIX_MSWINDOWS_REFS
        Fix Microsoft Windows® character entity references.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_COMMENT_DELIMS

        public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • SCRIPT_STRIP_CDATA_DELIMS

        public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_COMMENT_DELIMS

        public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
        Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • STYLE_STRIP_CDATA_DELIMS

        public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
        Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
        See Also:
        Constant Field Values
      • IGNORE_SPECIFIED_CHARSET

        public static final java.lang.String IGNORE_SPECIFIED_CHARSET
        Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction
        See Also:
        Constant Field Values
      • CDATA_SECTIONS

        public static final java.lang.String CDATA_SECTIONS
        Scan CDATA sections.
        See Also:
        Constant Field Values
      • OVERRIDE_DOCTYPE

        public static final java.lang.String OVERRIDE_DOCTYPE
        Override doctype declaration public and system identifiers.
        See Also:
        Constant Field Values
      • INSERT_DOCTYPE

        public static final java.lang.String INSERT_DOCTYPE
        Insert document type declaration.
        See Also:
        Constant Field Values
      • PARSE_NOSCRIPT_CONTENT

        public static final java.lang.String PARSE_NOSCRIPT_CONTENT
        Parse <noscript>...</noscript> content
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_IFRAME

        public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
        Allows self closing <iframe/> tag
        See Also:
        Constant Field Values
      • ALLOW_SELFCLOSING_TAGS

        public static final java.lang.String ALLOW_SELFCLOSING_TAGS
        Allows self closing tags e.g. <div/> (XHTML)
        See Also:
        Constant Field Values
      • NORMALIZE_ATTRIBUTES

        protected static final java.lang.String NORMALIZE_ATTRIBUTES
        Normalize attribute values.
        See Also:
        Constant Field Values
      • NAMES_ELEMS

        protected static final java.lang.String NAMES_ELEMS
        Modify HTML element names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • NAMES_ATTRS

        protected static final java.lang.String NAMES_ATTRS
        Modify HTML attribute names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • DEFAULT_ENCODING

        protected static final java.lang.String DEFAULT_ENCODING
        Default encoding.
        See Also:
        Constant Field Values
      • ERROR_REPORTER

        protected static final java.lang.String ERROR_REPORTER
        Error reporter.
        See Also:
        Constant Field Values
      • DOCTYPE_PUBID

        protected static final java.lang.String DOCTYPE_PUBID
        Doctype declaration public identifier.
        See Also:
        Constant Field Values
      • DOCTYPE_SYSID

        protected static final java.lang.String DOCTYPE_SYSID
        Doctype declaration system identifier.
        See Also:
        Constant Field Values
      • STATE_CONTENT

        protected static final short STATE_CONTENT
        State: content.
        See Also:
        Constant Field Values
      • STATE_MARKUP_BRACKET

        protected static final short STATE_MARKUP_BRACKET
        State: markup bracket.
        See Also:
        Constant Field Values
      • STATE_START_DOCUMENT

        protected static final short STATE_START_DOCUMENT
        State: start document.
        See Also:
        Constant Field Values
      • STATE_END_DOCUMENT

        protected static final short STATE_END_DOCUMENT
        State: end document.
        See Also:
        Constant Field Values
      • NAMES_NO_CHANGE

        protected static final short NAMES_NO_CHANGE
        Don't modify HTML names.
        See Also:
        Constant Field Values
      • NAMES_UPPERCASE

        protected static final short NAMES_UPPERCASE
        Uppercase HTML names.
        See Also:
        Constant Field Values
      • NAMES_LOWERCASE

        protected static final short NAMES_LOWERCASE
        Lowercase HTML names.
        See Also:
        Constant Field Values
      • DEFAULT_BUFFER_SIZE

        protected static final int DEFAULT_BUFFER_SIZE
        Default buffer size.
        See Also:
        Constant Field Values
      • DEBUG_CALLBACKS

        protected static final boolean DEBUG_CALLBACKS
        Set to true to debug callbacks.
        See Also:
        Constant Field Values
      • SYNTHESIZED_ITEM

        protected static final HTMLEventInfo SYNTHESIZED_ITEM
        Synthesized event info item.
      • fAugmentations

        protected boolean fAugmentations
        Augmentations.
      • fReportErrors

        protected boolean fReportErrors
        Report errors.
      • fNotifyCharRefs

        protected boolean fNotifyCharRefs
        Notify character entity references.
      • fNotifyXmlBuiltinRefs

        protected boolean fNotifyXmlBuiltinRefs
        Notify XML built-in general entity references.
      • fNotifyHtmlBuiltinRefs

        protected boolean fNotifyHtmlBuiltinRefs
        Notify HTML built-in general entity references.
      • fFixWindowsCharRefs

        protected boolean fFixWindowsCharRefs
        Fix Microsoft Windows® character entity references.
      • fScriptStripCDATADelims

        protected boolean fScriptStripCDATADelims
        Strip CDATA delimiters from SCRIPT tags.
      • fScriptStripCommentDelims

        protected boolean fScriptStripCommentDelims
        Strip comment delimiters from SCRIPT tags.
      • fStyleStripCDATADelims

        protected boolean fStyleStripCDATADelims
        Strip CDATA delimiters from STYLE tags.
      • fStyleStripCommentDelims

        protected boolean fStyleStripCommentDelims
        Strip comment delimiters from STYLE tags.
      • fIgnoreSpecifiedCharset

        protected boolean fIgnoreSpecifiedCharset
        Ignore specified character set.
      • fCDATASections

        protected boolean fCDATASections
        CDATA sections.
      • fOverrideDoctype

        protected boolean fOverrideDoctype
        Override doctype declaration public and system identifiers.
      • fInsertDoctype

        protected boolean fInsertDoctype
        Insert document type declaration.
      • fNormalizeAttributes

        protected boolean fNormalizeAttributes
        Normalize attribute values.
      • fParseNoScriptContent

        protected boolean fParseNoScriptContent
        Parse noscript content.
      • fParseNoFramesContent

        protected boolean fParseNoFramesContent
        Parse noframes content.
      • fAllowSelfclosingIframe

        protected boolean fAllowSelfclosingIframe
        Allows self closing iframe tags.
      • fAllowSelfclosingTags

        protected boolean fAllowSelfclosingTags
        Allows self closing tags.
      • fNamesElems

        protected short fNamesElems
        Modify HTML element names.
      • fNamesAttrs

        protected short fNamesAttrs
        Modify HTML attribute names.
      • fDefaultIANAEncoding

        protected java.lang.String fDefaultIANAEncoding
        Default encoding.
      • fDoctypePubid

        protected java.lang.String fDoctypePubid
        Doctype declaration public identifier.
      • fDoctypeSysid

        protected java.lang.String fDoctypeSysid
        Doctype declaration system identifier.
      • fBeginLineNumber

        protected int fBeginLineNumber
        Beginning line number.
      • fBeginColumnNumber

        protected int fBeginColumnNumber
        Beginning column number.
      • fBeginCharacterOffset

        protected int fBeginCharacterOffset
        Beginning character offset in the file.
      • fEndLineNumber

        protected int fEndLineNumber
        Ending line number.
      • fEndColumnNumber

        protected int fEndColumnNumber
        Ending column number.
      • fEndCharacterOffset

        protected int fEndCharacterOffset
        Ending character offset in the file.
      • fCurrentEntityStack

        protected final java.util.Stack fCurrentEntityStack
        The current entity stack.
      • fScannerState

        protected short fScannerState
        The current scanner state.
      • fDocumentHandler

        protected org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
        The document handler.
      • fIANAEncoding

        protected java.lang.String fIANAEncoding
        Auto-detected IANA encoding.
      • fJavaEncoding

        protected java.lang.String fJavaEncoding
        Auto-detected Java encoding.
      • fIso8859Encoding

        protected boolean fIso8859Encoding
        True if the encoding matches "ISO-8859-*".
      • fElementCount

        protected int fElementCount
        Element count.
      • fElementDepth

        protected int fElementDepth
        Element depth.
      • fSpecialScanner

        protected HTMLScanner.SpecialScanner fSpecialScanner
        Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
      • fStringBuffer

        protected final org.apache.xerces.util.XMLStringBuffer fStringBuffer
        String buffer.
    • Constructor Detail

      • HTMLScanner

        public HTMLScanner()
    • Method Detail

      • pushInputSource

        public void pushInputSource​(org.apache.xerces.xni.parser.XMLInputSource inputSource)
        Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

        Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

        Parameters:
        inputSource - The new input source to start scanning.
        See Also:
        evaluateInputSource(XMLInputSource)
      • evaluateInputSource

        public void evaluateInputSource​(org.apache.xerces.xni.parser.XMLInputSource inputSource)
        Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).
        Parameters:
        inputSource - The new input source to start evaluating.
        See Also:
        pushInputSource(XMLInputSource)
      • cleanup

        public void cleanup​(boolean closeall)
        Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.
        Parameters:
        closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
      • getEncoding

        public java.lang.String getEncoding()
        Returns the encoding.
        Specified by:
        getEncoding in interface org.apache.xerces.xni.XMLLocator
      • getPublicId

        public java.lang.String getPublicId()
        Returns the public identifier.
        Specified by:
        getPublicId in interface org.apache.xerces.xni.XMLLocator
      • getBaseSystemId

        public java.lang.String getBaseSystemId()
        Returns the base system identifier.
        Specified by:
        getBaseSystemId in interface org.apache.xerces.xni.XMLLocator
      • getLiteralSystemId

        public java.lang.String getLiteralSystemId()
        Returns the literal system identifier.
        Specified by:
        getLiteralSystemId in interface org.apache.xerces.xni.XMLLocator
      • getExpandedSystemId

        public java.lang.String getExpandedSystemId()
        Returns the expanded system identifier.
        Specified by:
        getExpandedSystemId in interface org.apache.xerces.xni.XMLLocator
      • getLineNumber

        public int getLineNumber()
        Returns the current line number.
        Specified by:
        getLineNumber in interface org.apache.xerces.xni.XMLLocator
      • getColumnNumber

        public int getColumnNumber()
        Returns the current column number.
        Specified by:
        getColumnNumber in interface org.apache.xerces.xni.XMLLocator
      • getXMLVersion

        public java.lang.String getXMLVersion()
        Returns the XML version.
        Specified by:
        getXMLVersion in interface org.apache.xerces.xni.XMLLocator
      • getCharacterOffset

        public int getCharacterOffset()
        Returns the character offset.
        Specified by:
        getCharacterOffset in interface org.apache.xerces.xni.XMLLocator
      • getFeatureDefault

        public java.lang.Boolean getFeatureDefault​(java.lang.String featureId)
        Returns the default state for a feature.
        Specified by:
        getFeatureDefault in interface HTMLComponent
        Specified by:
        getFeatureDefault in interface org.apache.xerces.xni.parser.XMLComponent
      • getPropertyDefault

        public java.lang.Object getPropertyDefault​(java.lang.String propertyId)
        Returns the default state for a property.
        Specified by:
        getPropertyDefault in interface HTMLComponent
        Specified by:
        getPropertyDefault in interface org.apache.xerces.xni.parser.XMLComponent
      • getRecognizedFeatures

        public java.lang.String[] getRecognizedFeatures()
        Returns recognized features.
        Specified by:
        getRecognizedFeatures in interface org.apache.xerces.xni.parser.XMLComponent
      • getRecognizedProperties

        public java.lang.String[] getRecognizedProperties()
        Returns recognized properties.
        Specified by:
        getRecognizedProperties in interface org.apache.xerces.xni.parser.XMLComponent
      • reset

        public void reset​(org.apache.xerces.xni.parser.XMLComponentManager manager)
                   throws org.apache.xerces.xni.parser.XMLConfigurationException
        Resets the component.
        Specified by:
        reset in interface org.apache.xerces.xni.parser.XMLComponent
        Throws:
        org.apache.xerces.xni.parser.XMLConfigurationException
      • setFeature

        public void setFeature​(java.lang.String featureId,
                               boolean state)
        Sets a feature.
        Specified by:
        setFeature in interface org.apache.xerces.xni.parser.XMLComponent
      • setProperty

        public void setProperty​(java.lang.String propertyId,
                                java.lang.Object value)
                         throws org.apache.xerces.xni.parser.XMLConfigurationException
        Sets a property.
        Specified by:
        setProperty in interface org.apache.xerces.xni.parser.XMLComponent
        Throws:
        org.apache.xerces.xni.parser.XMLConfigurationException
      • setInputSource

        public void setInputSource​(org.apache.xerces.xni.parser.XMLInputSource source)
                            throws java.io.IOException
        Sets the input source.
        Specified by:
        setInputSource in interface org.apache.xerces.xni.parser.XMLDocumentScanner
        Throws:
        java.io.IOException
      • scanDocument

        public boolean scanDocument​(boolean complete)
                             throws org.apache.xerces.xni.XNIException,
                                    java.io.IOException
        Scans the document.
        Specified by:
        scanDocument in interface org.apache.xerces.xni.parser.XMLDocumentScanner
        Throws:
        org.apache.xerces.xni.XNIException
        java.io.IOException
      • setDocumentHandler

        public void setDocumentHandler​(org.apache.xerces.xni.XMLDocumentHandler handler)
        Sets the document handler.
        Specified by:
        setDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource
      • getDocumentHandler

        public org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
        Returns the document handler.
        Specified by:
        getDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource
      • getValue

        protected static java.lang.String getValue​(org.apache.xerces.xni.XMLAttributes attrs,
                                                   java.lang.String aname)
        Returns the value of the specified attribute, ignoring case.
      • expandSystemId

        public static java.lang.String expandSystemId​(java.lang.String systemId,
                                                      java.lang.String baseSystemId)
        Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.
        Parameters:
        systemId - The systemId to be expanded.
        Returns:
        Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
      • fixURI

        protected static java.lang.String fixURI​(java.lang.String str)
        Fixes a platform dependent filename to standard URI form.
        Parameters:
        str - The string to fix.
        Returns:
        Returns the fixed URI string.
      • modifyName

        protected static final java.lang.String modifyName​(java.lang.String name,
                                                           short mode)
        Modifies the given name based on the specified mode.
      • read

        protected int read()
                    throws java.io.IOException
        Reads a single character.
        Throws:
        java.io.IOException
      • setScannerState

        protected void setScannerState​(short state)
        Sets the scanner state.
      • scanDoctype

        protected void scanDoctype()
                            throws java.io.IOException
        Scans a DOCTYPE line.
        Throws:
        java.io.IOException
      • scanLiteral

        protected java.lang.String scanLiteral()
                                        throws java.io.IOException
        Scans a quoted literal.
        Throws:
        java.io.IOException
      • scanName

        protected java.lang.String scanName​(boolean strict)
                                     throws java.io.IOException
        Scans a name.
        Throws:
        java.io.IOException
      • scanEntityRef

        protected int scanEntityRef​(org.apache.xerces.util.XMLStringBuffer str,
                                    boolean content)
                             throws java.io.IOException
        Scans an entity reference.
        Throws:
        java.io.IOException
      • skip

        protected boolean skip​(java.lang.String s,
                               boolean caseSensitive)
                        throws java.io.IOException
        Returns true if the specified text is present and is skipped.
        Throws:
        java.io.IOException
      • skipMarkup

        protected boolean skipMarkup​(boolean balance)
                              throws java.io.IOException
        Skips markup.
        Throws:
        java.io.IOException
      • skipSpaces

        protected boolean skipSpaces()
                              throws java.io.IOException
        Skips whitespace.
        Throws:
        java.io.IOException
      • skipNewlines

        protected int skipNewlines()
                            throws java.io.IOException
        Skips newlines and returns the number of newlines skipped.
        Throws:
        java.io.IOException
      • locationAugs

        protected final org.apache.xerces.xni.Augmentations locationAugs()
        Returns an augmentations object with a location item added.
      • synthesizedAugs

        protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
        Returns an augmentations object with a synthesized item added.
      • resourceId

        protected final org.apache.xerces.xni.XMLResourceIdentifier resourceId()
        Returns an empty resource identifier.
      • builtinXmlRef

        protected static boolean builtinXmlRef​(java.lang.String name)
        Returns true if the name is a built-in XML general entity reference.
      • readPreservingBufferContent

        protected int readPreservingBufferContent()
                                           throws java.io.IOException
        Reads a single character, preserving the old buffer content
        Throws:
        java.io.IOException