Class HTMLScanner

java.lang.Object
org.htmlunit.cyberneko.HTMLScanner
All Implemented Interfaces:
HTMLComponent, XMLComponent, XMLDocumentScanner, XMLDocumentSource, XMLLocator

public class HTMLScanner extends Object implements XMLDocumentScanner, XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

This component recognizes the following features:

  • http://cyberneko.org/html/features/augmentations
  • http://cyberneko.org/html/features/report-errors
  • http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
  • http://cyberneko.org/html/features/scanner/script/strip-comment-delims
  • http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
  • http://cyberneko.org/html/features/scanner/style/strip-comment-delims
  • http://cyberneko.org/html/features/scanner/ignore-specified-charset
  • http://cyberneko.org/html/features/scanner/cdata-sections
  • http://cyberneko.org/html/features/override-doctype
  • http://cyberneko.org/html/features/insert-doctype
  • http://cyberneko.org/html/features/parse-noscript-content
  • http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
  • http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
  • http://cyberneko.org/html/features/scanner/normalize-attrs
  • http://cyberneko.org/html/features/scanner/plain-attr-values

This component recognizes the following properties:

  • http://cyberneko.org/html/properties/names/elems
  • http://cyberneko.org/html/properties/names/attrs
  • http://cyberneko.org/html/properties/default-encoding
  • http://cyberneko.org/html/properties/error-reporter
  • http://cyberneko.org/html/properties/encoding-translator
  • http://cyberneko.org/html/properties/doctype/pubid
  • http://cyberneko.org/html/properties/doctype/sysid
See Also:
  • Field Details

    • HTML_4_01_STRICT_PUBID

      public static final String HTML_4_01_STRICT_PUBID
      HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
      See Also:
    • HTML_4_01_STRICT_SYSID

      public static final String HTML_4_01_STRICT_SYSID
      HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
      See Also:
    • HTML_4_01_TRANSITIONAL_PUBID

      public static final String HTML_4_01_TRANSITIONAL_PUBID
      HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
      See Also:
    • HTML_4_01_TRANSITIONAL_SYSID

      public static final String HTML_4_01_TRANSITIONAL_SYSID
      HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
      See Also:
    • HTML_4_01_FRAMESET_PUBID

      public static final String HTML_4_01_FRAMESET_PUBID
      HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
      See Also:
    • HTML_4_01_FRAMESET_SYSID

      public static final String HTML_4_01_FRAMESET_SYSID
      HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
      See Also:
    • AUGMENTATIONS

      public static final String AUGMENTATIONS
      Include infoset augmentations.
      See Also:
    • REPORT_ERRORS

      public static final String REPORT_ERRORS
      Report errors.
      See Also:
    • SCRIPT_STRIP_COMMENT_DELIMS

      public static final String SCRIPT_STRIP_COMMENT_DELIMS
      Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
      See Also:
    • SCRIPT_STRIP_CDATA_DELIMS

      public static final String SCRIPT_STRIP_CDATA_DELIMS
      Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
      See Also:
    • STYLE_STRIP_COMMENT_DELIMS

      public static final String STYLE_STRIP_COMMENT_DELIMS
      Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
      See Also:
    • STYLE_STRIP_CDATA_DELIMS

      public static final String STYLE_STRIP_CDATA_DELIMS
      Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
      See Also:
    • IGNORE_SPECIFIED_CHARSET

      public static final String IGNORE_SPECIFIED_CHARSET
      Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction
      See Also:
    • CDATA_SECTIONS

      public static final String CDATA_SECTIONS
      Scan CDATA sections.
      See Also:
    • OVERRIDE_DOCTYPE

      public static final String OVERRIDE_DOCTYPE
      Override doctype declaration public and system identifiers.
      See Also:
    • INSERT_DOCTYPE

      public static final String INSERT_DOCTYPE
      Insert document type declaration.
      See Also:
    • PARSE_NOSCRIPT_CONTENT

      public static final String PARSE_NOSCRIPT_CONTENT
      Parse <noscript>...</noscript> content
      See Also:
    • ALLOW_SELFCLOSING_IFRAME

      public static final String ALLOW_SELFCLOSING_IFRAME
      Allows self closing <iframe/> tag
      See Also:
    • ALLOW_SELFCLOSING_TAGS

      public static final String ALLOW_SELFCLOSING_TAGS
      Allows self closing tags e.g. <div/> (XHTML)
      See Also:
    • NORMALIZE_ATTRIBUTES

      public static final String NORMALIZE_ATTRIBUTES
      Normalize attribute values.
      See Also:
    • PLAIN_ATTRIBUTE_VALUES

      public static final String PLAIN_ATTRIBUTE_VALUES
      Store the plain attribute values also.
      See Also:
    • RECOGNIZED_FEATURES

      private static final String[] RECOGNIZED_FEATURES
      Recognized features.
    • RECOGNIZED_FEATURES_DEFAULTS

      private static final Boolean[] RECOGNIZED_FEATURES_DEFAULTS
      Recognized features defaults.
    • NAMES_ELEMS

      public static final String NAMES_ELEMS
      Modify HTML element names: { "upper", "lower", "default" }.
      See Also:
    • NAMES_ATTRS

      public static final String NAMES_ATTRS
      Modify HTML attribute names: { "upper", "lower", "default" }.
      See Also:
    • DEFAULT_ENCODING

      public static final String DEFAULT_ENCODING
      Default encoding.
      See Also:
    • ERROR_REPORTER

      public static final String ERROR_REPORTER
      Error reporter.
      See Also:
    • ENCODING_TRANSLATOR

      public static final String ENCODING_TRANSLATOR
      Encoding translator.
      See Also:
    • DOCTYPE_PUBID

      public static final String DOCTYPE_PUBID
      Doctype declaration public identifier.
      See Also:
    • DOCTYPE_SYSID

      public static final String DOCTYPE_SYSID
      Doctype declaration system identifier.
      See Also:
    • RECOGNIZED_PROPERTIES

      private static final String[] RECOGNIZED_PROPERTIES
      Recognized properties.
    • RECOGNIZED_PROPERTIES_DEFAULTS

      private static final Object[] RECOGNIZED_PROPERTIES_DEFAULTS
      Recognized properties defaults.
    • STATE_CONTENT

      protected static final short STATE_CONTENT
      State: content.
      See Also:
    • STATE_MARKUP_BRACKET

      protected static final short STATE_MARKUP_BRACKET
      State: markup bracket.
      See Also:
    • STATE_START_DOCUMENT

      protected static final short STATE_START_DOCUMENT
      State: start document.
      See Also:
    • STATE_END_DOCUMENT

      protected static final short STATE_END_DOCUMENT
      State: end document.
      See Also:
    • NAMES_NO_CHANGE

      protected static final short NAMES_NO_CHANGE
      Don't modify HTML names.
      See Also:
    • NAMES_UPPERCASE

      protected static final short NAMES_UPPERCASE
      Uppercase HTML names.
      See Also:
    • NAMES_LOWERCASE

      protected static final short NAMES_LOWERCASE
      Lowercase HTML names.
      See Also:
    • DEFAULT_BUFFER_SIZE

      protected static final int DEFAULT_BUFFER_SIZE
      See Also:
    • DEBUG_SCANNER

      private static final boolean DEBUG_SCANNER
      Set to true to debug changes in the scanner.
      See Also:
    • DEBUG_SCANNER_STATE

      private static final boolean DEBUG_SCANNER_STATE
      Set to true to debug changes in the scanner state.
      See Also:
    • DEBUG_BUFFER

      private static final boolean DEBUG_BUFFER
      Set to true to debug the buffer.
      See Also:
    • DEBUG_CHARSET

      private static final boolean DEBUG_CHARSET
      Set to true to debug character encoding handling.
      See Also:
    • DEBUG_CALLBACKS

      protected static final boolean DEBUG_CALLBACKS
      Set to true to debug callbacks.
      See Also:
    • SYNTHESIZED_ITEM

      protected static final HTMLEventInfo SYNTHESIZED_ITEM
      Synthesized event info item.
    • fAugmentations_

      private boolean fAugmentations_
      Augmentations.
    • fReportErrors_

      boolean fReportErrors_
      Report errors.
    • fScriptStripCDATADelims_

      boolean fScriptStripCDATADelims_
      Strip CDATA delimiters from SCRIPT tags.
    • fScriptStripCommentDelims_

      boolean fScriptStripCommentDelims_
      Strip comment delimiters from SCRIPT tags.
    • fStyleStripCDATADelims_

      boolean fStyleStripCDATADelims_
      Strip CDATA delimiters from STYLE tags.
    • fStyleStripCommentDelims_

      boolean fStyleStripCommentDelims_
      Strip comment delimiters from STYLE tags.
    • fIgnoreSpecifiedCharset_

      boolean fIgnoreSpecifiedCharset_
      Ignore specified character set.
    • fCDATASections_

      boolean fCDATASections_
      CDATA sections.
    • fOverrideDoctype_

      private boolean fOverrideDoctype_
      Override doctype declaration public and system identifiers.
    • fInsertDoctype_

      boolean fInsertDoctype_
      Insert document type declaration.
    • fNormalizeAttributes_

      boolean fNormalizeAttributes_
      Normalize attribute values.
    • fPlainAttributeValues_

      boolean fPlainAttributeValues_
      Store the plain attribute values also.
    • fParseNoScriptContent_

      boolean fParseNoScriptContent_
      Parse noscript content.
    • fAllowSelfclosingIframe_

      boolean fAllowSelfclosingIframe_
      Allows self closing iframe tags.
    • fAllowSelfclosingTags_

      boolean fAllowSelfclosingTags_
      Allows self closing tags.
    • fNamesElems

      protected short fNamesElems
      Modify HTML element names.
    • fNamesAttrs

      protected short fNamesAttrs
      Modify HTML attribute names.
    • fDefaultIANAEncoding

      protected String fDefaultIANAEncoding
      Default encoding.
    • fErrorReporter

      protected HTMLErrorReporter fErrorReporter
      Error reporter.
    • fEncodingTranslator

      protected EncodingTranslator fEncodingTranslator
      Error reporter.
    • fDoctypePubid

      protected String fDoctypePubid
      Doctype declaration public identifier.
    • fDoctypeSysid

      protected String fDoctypeSysid
      Doctype declaration system identifier.
    • fBeginLineNumber

      protected int fBeginLineNumber
      Beginning line number.
    • fBeginColumnNumber

      protected int fBeginColumnNumber
      Beginning column number.
    • fBeginCharacterOffset

      protected int fBeginCharacterOffset
      Beginning character offset in the file.
    • fEndLineNumber

      protected int fEndLineNumber
      Ending line number.
    • fEndColumnNumber

      protected int fEndColumnNumber
      Ending column number.
    • fEndCharacterOffset

      protected int fEndCharacterOffset
      Ending character offset in the file.
    • fByteStream

      protected PlaybackInputStream fByteStream
      The playback byte stream.
    • fCurrentEntity

      Current entity.
    • fCurrentEntityStack

      protected final MiniStack<HTMLScanner.CurrentEntity> fCurrentEntityStack
      The current entity stack.
    • fScanner

      protected HTMLScanner.Scanner fScanner
      The current scanner.
    • fScannerState

      protected short fScannerState
      The current scanner state.
    • fDocumentHandler

      protected XMLDocumentHandler fDocumentHandler
      The document handler.
    • fIANAEncoding

      protected String fIANAEncoding
      Auto-detected IANA encoding.
    • fJavaEncoding

      protected String fJavaEncoding
      Auto-detected Java encoding.
    • fElementCount

      protected int fElementCount
      Element count.
    • fElementDepth

      protected int fElementDepth
      Element depth.
    • fContentScanner

      protected HTMLScanner.Scanner fContentScanner
      Content scanner.
    • fSpecialScanner

      protected final HTMLScanner.SpecialScanner fSpecialScanner
      Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
    • fStringBuffer

      protected final XMLString fStringBuffer
      String buffer.
    • fStringBufferEntiyRef

      final XMLString fStringBufferEntiyRef
      String buffer used when resolving entity refs.
    • fStringBufferPlainAttribValue

      final XMLString fStringBufferPlainAttribValue
    • fScanScriptContent

      final XMLString fScanScriptContent
      String buffer, larger because scripts areas are larger
    • fScanUntilEndTag

      final XMLString fScanUntilEndTag
    • fScanComment

      final XMLString fScanComment
    • fScanLiteral

      private final XMLString fScanLiteral
    • fSingleBoolean

      final boolean[] fSingleBoolean
      Single boolean array.
    • htmlConfiguration_

      final HTMLConfiguration htmlConfiguration_
    • fLocationItem

      private final HTMLScanner.LocationItem fLocationItem
      Our location item, to be reused because Augmentations says so, so let's save on memory
  • Constructor Details

    • HTMLScanner

      HTMLScanner(HTMLConfiguration htmlConfiguration)
      Creates a new HTMLScanner with the given configuration
      Parameters:
      htmlConfiguration - the configuration to use
  • Method Details

    • pushInputSource

      public void pushInputSource(XMLInputSource inputSource)
      Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

      Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

      Parameters:
      inputSource - The new input source to start scanning.
      See Also:
    • getReader

      private Reader getReader(XMLInputSource inputSource)
    • evaluateInputSource

      public void evaluateInputSource(XMLInputSource inputSource)
      Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).
      Parameters:
      inputSource - The new input source to start evaluating.
      See Also:
    • cleanup

      public void cleanup(boolean closeall)
      Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.
      Parameters:
      closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
    • getEncoding

      public String getEncoding()
      Returns the encoding.
      Specified by:
      getEncoding in interface XMLLocator
      Returns:
      the encoding of the current entity. Note that, for a given entity, this value can only be considered final once the encoding declaration has been read (or once it has been determined that there is no such declaration) since, no encoding having been specified on the XMLInputSource, the parser will make an initial "guess" which could be in error.
    • getPublicId

      public String getPublicId()
      Returns the public identifier.
      Specified by:
      getPublicId in interface XMLLocator
      Returns:
      the public identifier.
    • getBaseSystemId

      public String getBaseSystemId()
      Returns the base system identifier.
      Specified by:
      getBaseSystemId in interface XMLLocator
      Returns:
      the base system identifier.
    • getLiteralSystemId

      public String getLiteralSystemId()
      Returns the literal system identifier.
      Specified by:
      getLiteralSystemId in interface XMLLocator
      Returns:
      the literal system identifier.
    • getExpandedSystemId

      public String getExpandedSystemId()
      Returns the expanded system identifier.
      Specified by:
      getExpandedSystemId in interface XMLLocator
      Returns:
      the expanded system identifier.
    • getLineNumber

      public int getLineNumber()
      Returns the current line number.
      Specified by:
      getLineNumber in interface XMLLocator
      Returns:
      the line number, or -1 if no line number is available.
    • getColumnNumber

      public int getColumnNumber()
      Returns the current column number.
      Specified by:
      getColumnNumber in interface XMLLocator
      Returns:
      the column number, or -1 if no column number is available.
    • getXMLVersion

      public String getXMLVersion()
      Returns the XML version.
      Specified by:
      getXMLVersion in interface XMLLocator
      Returns:
      the XML version of the current entity. This will normally be the value from the XML or text declaration or defaulted by the parser. Note that that this value may be different than the version of the processing rules applied to the current entity. For instance, an XML 1.1 document may refer to XML 1.0 entities. In such a case the rules of XML 1.1 are applied to the entire document. Also note that, for a given entity, this value can only be considered final once the XML or text declaration has been read or once it has been determined that there is no such declaration.
    • getCharacterOffset

      public int getCharacterOffset()
      Returns the character offset.
      Specified by:
      getCharacterOffset in interface XMLLocator
      Returns:
      the character offset, or -1 if no character offset is available.
    • getFeatureDefault

      public Boolean getFeatureDefault(String featureId)
      Returns the default state for a feature.
      Specified by:
      getFeatureDefault in interface HTMLComponent
      Specified by:
      getFeatureDefault in interface XMLComponent
      Parameters:
      featureId - The feature identifier.
      Returns:
      the default state for a feature, or null if this component does not want to report a default value for this feature.
    • getPropertyDefault

      public Object getPropertyDefault(String propertyId)
      Returns the default state for a property.
      Specified by:
      getPropertyDefault in interface HTMLComponent
      Specified by:
      getPropertyDefault in interface XMLComponent
      Parameters:
      propertyId - The property identifier.
      Returns:
      the default state for a property, or null if this component does not want to report a default value for this property
    • getRecognizedFeatures

      public String[] getRecognizedFeatures()
      Returns recognized features.
      Specified by:
      getRecognizedFeatures in interface XMLComponent
      Returns:
      an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
    • getRecognizedProperties

      public String[] getRecognizedProperties()
      Returns recognized properties.
      Specified by:
      getRecognizedProperties in interface XMLComponent
      Returns:
      an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
    • reset

      public void reset(XMLComponentManager manager) throws XMLConfigurationException
      Resets the component.
      Specified by:
      reset in interface XMLComponent
      Parameters:
      manager - The component manager.
      Throws:
      XMLConfigurationException
    • setFeature

      public void setFeature(String featureId, boolean state)
      Sets a feature.
      Specified by:
      setFeature in interface XMLComponent
      Parameters:
      featureId - The feature identifier.
      state - The state of the feature.
    • setProperty

      public void setProperty(String propertyId, Object value) throws XMLConfigurationException
      Sets a property.
      Specified by:
      setProperty in interface XMLComponent
      Parameters:
      propertyId - The property identifier.
      value - The value of the property.
      Throws:
      XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
    • setInputSource

      public void setInputSource(XMLInputSource source) throws IOException
      Sets the input source.
      Specified by:
      setInputSource in interface XMLDocumentScanner
      Parameters:
      source - The input source.
      Throws:
      IOException - Thrown on i/o error.
    • scanDocument

      public boolean scanDocument(boolean complete) throws XNIException, IOException
      Scans the document.
      Specified by:
      scanDocument in interface XMLDocumentScanner
      Parameters:
      complete - True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.
      Returns:
      True if there is more to scan, false otherwise.
      Throws:
      XNIException - on error.
      IOException - Thrown on i/o error.
    • setDocumentHandler

      public void setDocumentHandler(XMLDocumentHandler handler)
      Sets the document handler.
      Specified by:
      setDocumentHandler in interface XMLDocumentSource
      Parameters:
      handler - the new handler
    • getDocumentHandler

      public XMLDocumentHandler getDocumentHandler()
      Returns the document handler.
      Specified by:
      getDocumentHandler in interface XMLDocumentSource
      Returns:
      the document handler
    • getValue

      protected static String getValue(XMLAttributes attrs, String aname)
    • expandSystemId

      public static String expandSystemId(String systemId, String baseSystemId)
      Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.
      Parameters:
      systemId - The systemId to be expanded.
      baseSystemId - baseSystemId
      Returns:
      Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
    • fixURI

      protected static String fixURI(String str)
      Fixes a platform dependent filename to standard URI form.
      Parameters:
      str - The string to fix.
      Returns:
      Returns the fixed URI string.
    • modifyName

      protected static String modifyName(String name, short mode)
    • getNamesValue

      protected static short getNamesValue(String value)
    • setScanner

      protected void setScanner(HTMLScanner.Scanner scanner)
    • setScannerState

      protected void setScannerState(short state)
    • scanDoctype

      protected void scanDoctype() throws IOException
      Throws:
      IOException
    • scanLiteral

      protected String scanLiteral() throws IOException
      Throws:
      IOException
    • scanName

      protected String scanName(boolean strict) throws IOException
      Throws:
      IOException
    • scanTagName

      protected String scanTagName() throws IOException
      Throws:
      IOException
    • scanEntityRef

      protected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws IOException
      Throws:
      IOException
    • returnEntityRefString

      private int returnEntityRefString(XMLString str, boolean content)
    • skip

      protected boolean skip(String s, boolean caseSensitive) throws IOException
      Throws:
      IOException
    • skipMarkup

      protected boolean skipMarkup(boolean balance) throws IOException
      Throws:
      IOException
    • skipSpaces

      protected boolean skipSpaces() throws IOException
      Throws:
      IOException
    • skipNewlines

      protected int skipNewlines() throws IOException
      Throws:
      IOException
    • locationAugs

      protected final Augmentations locationAugs()
    • synthesizedAugs

      protected final Augmentations synthesizedAugs()
    • builtinXmlRef

      protected static boolean builtinXmlRef(String name)
    • isEncodingCompatible

      static boolean isEncodingCompatible(String encoding1, String encoding2)
      To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings
    • canRoundtrip

      private static boolean canRoundtrip(String encodeCharset, String decodeCharset) throws UnsupportedEncodingException
      Throws:
      UnsupportedEncodingException
    • readPreservingBufferContent

      protected int readPreservingBufferContent() throws IOException
      Throws:
      IOException