Package org.htmlunit.cyberneko
Class HTMLScanner
- java.lang.Object
-
- org.htmlunit.cyberneko.HTMLScanner
-
- All Implemented Interfaces:
HTMLComponent
,XMLComponent
,XMLDocumentScanner
,XMLDocumentSource
,XMLLocator
public class HTMLScanner extends java.lang.Object implements XMLDocumentScanner, XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
- http://cyberneko.org/html/features/scanner/normalize-attrs
- http://cyberneko.org/html/features/scanner/plain-attr-values
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/encoding-translator
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- See Also:
HTMLElements
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
HTMLScanner.ContentScanner
The primary HTML document scanner.private static class
HTMLScanner.CurrentEntity
Current entity.(package private) static class
HTMLScanner.LocationItem
Location infoset item.class
HTMLScanner.PlainTextScanner
Special scanner used forPLAINTEXT
static interface
HTMLScanner.Scanner
Basic scanner interface.private static class
HTMLScanner.ScanScriptState
class
HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tagstatic java.lang.String
ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g.static java.lang.String
AUGMENTATIONS
Include infoset augmentations.static java.lang.String
CDATA_SECTIONS
Scan CDATA sections.private static boolean
DEBUG_BUFFER
Set to true to debug the buffer.protected static boolean
DEBUG_CALLBACKS
Set to true to debug callbacks.private static boolean
DEBUG_CHARSET
Set to true to debug character encoding handling.private static boolean
DEBUG_SCANNER
Set to true to debug changes in the scanner.private static boolean
DEBUG_SCANNER_STATE
Set to true to debug changes in the scanner state.protected static int
DEFAULT_BUFFER_SIZE
static java.lang.String
DEFAULT_ENCODING
Default encoding.static java.lang.String
DOCTYPE_PUBID
Doctype declaration public identifier.static java.lang.String
DOCTYPE_SYSID
Doctype declaration system identifier.static java.lang.String
ENCODING_TRANSLATOR
Encoding translator.static java.lang.String
ERROR_REPORTER
Error reporter.(package private) boolean
fAllowSelfclosingIframe_
Allows self closing iframe tags.(package private) boolean
fAllowSelfclosingTags_
Allows self closing tags.private boolean
fAugmentations_
Augmentations.protected int
fBeginCharacterOffset
Beginning character offset in the file.protected int
fBeginColumnNumber
Beginning column number.protected int
fBeginLineNumber
Beginning line number.protected PlaybackInputStream
fByteStream
The playback byte stream.(package private) boolean
fCDATASections_
CDATA sections.protected HTMLScanner.Scanner
fContentScanner
Content scanner.(package private) HTMLScanner.CurrentEntity
fCurrentEntity
Current entity.protected MiniStack<HTMLScanner.CurrentEntity>
fCurrentEntityStack
The current entity stack.protected java.lang.String
fDefaultIANAEncoding
Default encoding.protected java.lang.String
fDoctypePubid
Doctype declaration public identifier.protected java.lang.String
fDoctypeSysid
Doctype declaration system identifier.protected XMLDocumentHandler
fDocumentHandler
The document handler.protected int
fElementCount
Element count.protected int
fElementDepth
Element depth.protected EncodingTranslator
fEncodingTranslator
Error reporter.protected int
fEndCharacterOffset
Ending character offset in the file.protected int
fEndColumnNumber
Ending column number.protected int
fEndLineNumber
Ending line number.protected HTMLErrorReporter
fErrorReporter
Error reporter.protected java.lang.String
fIANAEncoding
Auto-detected IANA encoding.(package private) boolean
fIgnoreSpecifiedCharset_
Ignore specified character set.(package private) boolean
fInsertDoctype_
Insert document type declaration.protected java.lang.String
fJavaEncoding
Auto-detected Java encoding.private HTMLScanner.LocationItem
fLocationItem
Our location item, to be reused becauseAugmentations
says so, so let's save on memoryprotected short
fNamesAttrs
Modify HTML attribute names.protected short
fNamesElems
Modify HTML element names.(package private) boolean
fNormalizeAttributes_
Normalize attribute values.private boolean
fOverrideDoctype_
Override doctype declaration public and system identifiers.(package private) boolean
fParseNoScriptContent_
Parse noscript content.(package private) boolean
fPlainAttributeValues_
Store the plain attribute values also.(package private) boolean
fReportErrors_
Report errors.(package private) XMLString
fScanComment
private XMLString
fScanLiteral
protected HTMLScanner.Scanner
fScanner
The current scanner.protected short
fScannerState
The current scanner state.(package private) XMLString
fScanScriptContent
String buffer, larger because scripts areas are larger(package private) XMLString
fScanUntilEndTag
(package private) boolean
fScriptStripCDATADelims_
Strip CDATA delimiters from SCRIPT tags.(package private) boolean
fScriptStripCommentDelims_
Strip comment delimiters from SCRIPT tags.(package private) boolean[]
fSingleBoolean
Single boolean array.protected HTMLScanner.SpecialScanner
fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected XMLString
fStringBuffer
String buffer.(package private) XMLString
fStringBufferEntiyRef
String buffer used when resolving entity refs.(package private) XMLString
fStringBufferPlainAttribValue
(package private) boolean
fStyleStripCDATADelims_
Strip CDATA delimiters from STYLE tags.(package private) boolean
fStyleStripCommentDelims_
Strip comment delimiters from STYLE tags.static java.lang.String
HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static java.lang.String
HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static java.lang.String
HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static java.lang.String
HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static java.lang.String
HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static java.lang.String
HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").(package private) HTMLConfiguration
htmlConfiguration_
static java.lang.String
IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instructionstatic java.lang.String
INSERT_DOCTYPE
Insert document type declaration.static java.lang.String
NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.static java.lang.String
NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.protected static short
NAMES_LOWERCASE
Lowercase HTML names.protected static short
NAMES_NO_CHANGE
Don't modify HTML names.protected static short
NAMES_UPPERCASE
Uppercase HTML names.static java.lang.String
NORMALIZE_ATTRIBUTES
Normalize attribute values.static java.lang.String
OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.static java.lang.String
PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> contentstatic java.lang.String
PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.private static java.lang.String[]
RECOGNIZED_FEATURES
Recognized features.private static java.lang.Boolean[]
RECOGNIZED_FEATURES_DEFAULTS
Recognized features defaults.private static java.lang.String[]
RECOGNIZED_PROPERTIES
Recognized properties.private static java.lang.Object[]
RECOGNIZED_PROPERTIES_DEFAULTS
Recognized properties defaults.static java.lang.String
REPORT_ERRORS
Report errors.static java.lang.String
SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.static java.lang.String
SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.protected static short
STATE_CONTENT
State: content.protected static short
STATE_END_DOCUMENT
State: end document.protected static short
STATE_MARKUP_BRACKET
State: markup bracket.protected static short
STATE_START_DOCUMENT
State: start document.static java.lang.String
STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.static java.lang.String
STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.protected static HTMLEventInfo
SYNTHESIZED_ITEM
Synthesized event info item.
-
Constructor Summary
Constructors Constructor Description HTMLScanner(HTMLConfiguration htmlConfiguration)
Creates a new HTMLScanner with the given configuration
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected static boolean
builtinXmlRef(java.lang.String name)
private static boolean
canRoundtrip(java.lang.String encodeCharset, java.lang.String decodeCharset)
void
cleanup(boolean closeall)
Cleans up used resources.void
evaluateInputSource(XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g.static java.lang.String
expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded.protected static java.lang.String
fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.java.lang.String
getBaseSystemId()
Returns the base system identifier.int
getCharacterOffset()
Returns the character offset.int
getColumnNumber()
Returns the current column number.XMLDocumentHandler
getDocumentHandler()
Returns the document handler.java.lang.String
getEncoding()
Returns the encoding.java.lang.String
getExpandedSystemId()
Returns the expanded system identifier.java.lang.Boolean
getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.int
getLineNumber()
Returns the current line number.java.lang.String
getLiteralSystemId()
Returns the literal system identifier.protected static short
getNamesValue(java.lang.String value)
java.lang.Object
getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.java.lang.String
getPublicId()
Returns the public identifier.private java.io.Reader
getReader(XMLInputSource inputSource)
java.lang.String[]
getRecognizedFeatures()
Returns recognized features.java.lang.String[]
getRecognizedProperties()
Returns recognized properties.protected static java.lang.String
getValue(XMLAttributes attrs, java.lang.String aname)
java.lang.String
getXMLVersion()
Returns the XML version.(package private) static boolean
isEncodingCompatible(java.lang.String encoding1, java.lang.String encoding2)
To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding.protected Augmentations
locationAugs()
protected static java.lang.String
modifyName(java.lang.String name, short mode)
void
pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack.protected int
readPreservingBufferContent()
void
reset(XMLComponentManager manager)
Resets the component.private int
returnEntityRefString(XMLString str, boolean content)
protected void
scanDoctype()
boolean
scanDocument(boolean complete)
Scans the document.protected int
scanEntityRef(XMLString str, XMLString plainValue, boolean content)
protected java.lang.String
scanLiteral()
protected java.lang.String
scanName(boolean strict)
protected java.lang.String
scanTagName()
void
setDocumentHandler(XMLDocumentHandler handler)
Sets the document handler.void
setFeature(java.lang.String featureId, boolean state)
Sets a feature.void
setInputSource(XMLInputSource source)
Sets the input source.void
setProperty(java.lang.String propertyId, java.lang.Object value)
Sets a property.protected void
setScanner(HTMLScanner.Scanner scanner)
protected void
setScannerState(short state)
protected boolean
skip(java.lang.String s, boolean caseSensitive)
protected boolean
skipMarkup(boolean balance)
protected int
skipNewlines()
protected boolean
skipSpaces()
protected Augmentations
synthesizedAugs()
-
-
-
Field Detail
-
HTML_4_01_STRICT_PUBID
public static final java.lang.String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
- Constant Field Values
-
HTML_4_01_STRICT_SYSID
public static final java.lang.String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_PUBID
public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_SYSID
public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_PUBID
public static final java.lang.String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_SYSID
public static final java.lang.String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
- Constant Field Values
-
AUGMENTATIONS
public static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
REPORT_ERRORS
public static final java.lang.String REPORT_ERRORS
Report errors.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_COMMENT_DELIMS
public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_CDATA_DELIMS
public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_COMMENT_DELIMS
public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_CDATA_DELIMS
public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
- Constant Field Values
-
IGNORE_SPECIFIED_CHARSET
public static final java.lang.String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction- See Also:
- Constant Field Values
-
CDATA_SECTIONS
public static final java.lang.String CDATA_SECTIONS
Scan CDATA sections.- See Also:
- Constant Field Values
-
OVERRIDE_DOCTYPE
public static final java.lang.String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
- Constant Field Values
-
INSERT_DOCTYPE
public static final java.lang.String INSERT_DOCTYPE
Insert document type declaration.- See Also:
- Constant Field Values
-
PARSE_NOSCRIPT_CONTENT
public static final java.lang.String PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_IFRAME
public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_TAGS
public static final java.lang.String ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
- Constant Field Values
-
NORMALIZE_ATTRIBUTES
public static final java.lang.String NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
- Constant Field Values
-
PLAIN_ATTRIBUTE_VALUES
public static final java.lang.String PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.- See Also:
- Constant Field Values
-
RECOGNIZED_FEATURES
private static final java.lang.String[] RECOGNIZED_FEATURES
Recognized features.
-
RECOGNIZED_FEATURES_DEFAULTS
private static final java.lang.Boolean[] RECOGNIZED_FEATURES_DEFAULTS
Recognized features defaults.
-
NAMES_ELEMS
public static final java.lang.String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
NAMES_ATTRS
public static final java.lang.String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
DEFAULT_ENCODING
public static final java.lang.String DEFAULT_ENCODING
Default encoding.- See Also:
- Constant Field Values
-
ERROR_REPORTER
public static final java.lang.String ERROR_REPORTER
Error reporter.- See Also:
- Constant Field Values
-
ENCODING_TRANSLATOR
public static final java.lang.String ENCODING_TRANSLATOR
Encoding translator.- See Also:
- Constant Field Values
-
DOCTYPE_PUBID
public static final java.lang.String DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
- Constant Field Values
-
DOCTYPE_SYSID
public static final java.lang.String DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
- Constant Field Values
-
RECOGNIZED_PROPERTIES
private static final java.lang.String[] RECOGNIZED_PROPERTIES
Recognized properties.
-
RECOGNIZED_PROPERTIES_DEFAULTS
private static final java.lang.Object[] RECOGNIZED_PROPERTIES_DEFAULTS
Recognized properties defaults.
-
STATE_CONTENT
protected static final short STATE_CONTENT
State: content.- See Also:
- Constant Field Values
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKET
State: markup bracket.- See Also:
- Constant Field Values
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENT
State: start document.- See Also:
- Constant Field Values
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENT
State: end document.- See Also:
- Constant Field Values
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGE
Don't modify HTML names.- See Also:
- Constant Field Values
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASE
Uppercase HTML names.- See Also:
- Constant Field Values
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASE
Lowercase HTML names.- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE
- See Also:
- Constant Field Values
-
DEBUG_SCANNER
private static final boolean DEBUG_SCANNER
Set to true to debug changes in the scanner.- See Also:
- Constant Field Values
-
DEBUG_SCANNER_STATE
private static final boolean DEBUG_SCANNER_STATE
Set to true to debug changes in the scanner state.- See Also:
- Constant Field Values
-
DEBUG_BUFFER
private static final boolean DEBUG_BUFFER
Set to true to debug the buffer.- See Also:
- Constant Field Values
-
DEBUG_CHARSET
private static final boolean DEBUG_CHARSET
Set to true to debug character encoding handling.- See Also:
- Constant Field Values
-
DEBUG_CALLBACKS
protected static final boolean DEBUG_CALLBACKS
Set to true to debug callbacks.- See Also:
- Constant Field Values
-
SYNTHESIZED_ITEM
protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.
-
fAugmentations_
private boolean fAugmentations_
Augmentations.
-
fReportErrors_
boolean fReportErrors_
Report errors.
-
fScriptStripCDATADelims_
boolean fScriptStripCDATADelims_
Strip CDATA delimiters from SCRIPT tags.
-
fScriptStripCommentDelims_
boolean fScriptStripCommentDelims_
Strip comment delimiters from SCRIPT tags.
-
fStyleStripCDATADelims_
boolean fStyleStripCDATADelims_
Strip CDATA delimiters from STYLE tags.
-
fStyleStripCommentDelims_
boolean fStyleStripCommentDelims_
Strip comment delimiters from STYLE tags.
-
fIgnoreSpecifiedCharset_
boolean fIgnoreSpecifiedCharset_
Ignore specified character set.
-
fCDATASections_
boolean fCDATASections_
CDATA sections.
-
fOverrideDoctype_
private boolean fOverrideDoctype_
Override doctype declaration public and system identifiers.
-
fInsertDoctype_
boolean fInsertDoctype_
Insert document type declaration.
-
fNormalizeAttributes_
boolean fNormalizeAttributes_
Normalize attribute values.
-
fPlainAttributeValues_
boolean fPlainAttributeValues_
Store the plain attribute values also.
-
fParseNoScriptContent_
boolean fParseNoScriptContent_
Parse noscript content.
-
fAllowSelfclosingIframe_
boolean fAllowSelfclosingIframe_
Allows self closing iframe tags.
-
fAllowSelfclosingTags_
boolean fAllowSelfclosingTags_
Allows self closing tags.
-
fNamesElems
protected short fNamesElems
Modify HTML element names.
-
fNamesAttrs
protected short fNamesAttrs
Modify HTML attribute names.
-
fDefaultIANAEncoding
protected java.lang.String fDefaultIANAEncoding
Default encoding.
-
fErrorReporter
protected HTMLErrorReporter fErrorReporter
Error reporter.
-
fEncodingTranslator
protected EncodingTranslator fEncodingTranslator
Error reporter.
-
fDoctypePubid
protected java.lang.String fDoctypePubid
Doctype declaration public identifier.
-
fDoctypeSysid
protected java.lang.String fDoctypeSysid
Doctype declaration system identifier.
-
fBeginLineNumber
protected int fBeginLineNumber
Beginning line number.
-
fBeginColumnNumber
protected int fBeginColumnNumber
Beginning column number.
-
fBeginCharacterOffset
protected int fBeginCharacterOffset
Beginning character offset in the file.
-
fEndLineNumber
protected int fEndLineNumber
Ending line number.
-
fEndColumnNumber
protected int fEndColumnNumber
Ending column number.
-
fEndCharacterOffset
protected int fEndCharacterOffset
Ending character offset in the file.
-
fByteStream
protected PlaybackInputStream fByteStream
The playback byte stream.
-
fCurrentEntity
HTMLScanner.CurrentEntity fCurrentEntity
Current entity.
-
fCurrentEntityStack
protected final MiniStack<HTMLScanner.CurrentEntity> fCurrentEntityStack
The current entity stack.
-
fScanner
protected HTMLScanner.Scanner fScanner
The current scanner.
-
fScannerState
protected short fScannerState
The current scanner state.
-
fDocumentHandler
protected XMLDocumentHandler fDocumentHandler
The document handler.
-
fIANAEncoding
protected java.lang.String fIANAEncoding
Auto-detected IANA encoding.
-
fJavaEncoding
protected java.lang.String fJavaEncoding
Auto-detected Java encoding.
-
fElementCount
protected int fElementCount
Element count.
-
fElementDepth
protected int fElementDepth
Element depth.
-
fContentScanner
protected HTMLScanner.Scanner fContentScanner
Content scanner.
-
fSpecialScanner
protected final HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
-
fStringBuffer
protected final XMLString fStringBuffer
String buffer.
-
fStringBufferEntiyRef
final XMLString fStringBufferEntiyRef
String buffer used when resolving entity refs.
-
fStringBufferPlainAttribValue
final XMLString fStringBufferPlainAttribValue
-
fScanScriptContent
final XMLString fScanScriptContent
String buffer, larger because scripts areas are larger
-
fScanUntilEndTag
final XMLString fScanUntilEndTag
-
fScanComment
final XMLString fScanComment
-
fScanLiteral
private final XMLString fScanLiteral
-
fSingleBoolean
final boolean[] fSingleBoolean
Single boolean array.
-
htmlConfiguration_
final HTMLConfiguration htmlConfiguration_
-
fLocationItem
private final HTMLScanner.LocationItem fLocationItem
Our location item, to be reused becauseAugmentations
says so, so let's save on memory
-
-
Constructor Detail
-
HTMLScanner
HTMLScanner(HTMLConfiguration htmlConfiguration)
Creates a new HTMLScanner with the given configuration- Parameters:
htmlConfiguration
- the configuration to use
-
-
Method Detail
-
pushInputSource
public void pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource
- The new input source to start scanning.- See Also:
evaluateInputSource(XMLInputSource)
-
getReader
private java.io.Reader getReader(XMLInputSource inputSource)
-
evaluateInputSource
public void evaluateInputSource(XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource
- The new input source to start evaluating.- See Also:
pushInputSource(XMLInputSource)
-
cleanup
public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall
- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
public java.lang.String getEncoding()
Returns the encoding.- Specified by:
getEncoding
in interfaceXMLLocator
- Returns:
- the encoding of the current entity. Note that, for a given entity, this value can only be considered final once the encoding declaration has been read (or once it has been determined that there is no such declaration) since, no encoding having been specified on the XMLInputSource, the parser will make an initial "guess" which could be in error.
-
getPublicId
public java.lang.String getPublicId()
Returns the public identifier.- Specified by:
getPublicId
in interfaceXMLLocator
- Returns:
- the public identifier.
-
getBaseSystemId
public java.lang.String getBaseSystemId()
Returns the base system identifier.- Specified by:
getBaseSystemId
in interfaceXMLLocator
- Returns:
- the base system identifier.
-
getLiteralSystemId
public java.lang.String getLiteralSystemId()
Returns the literal system identifier.- Specified by:
getLiteralSystemId
in interfaceXMLLocator
- Returns:
- the literal system identifier.
-
getExpandedSystemId
public java.lang.String getExpandedSystemId()
Returns the expanded system identifier.- Specified by:
getExpandedSystemId
in interfaceXMLLocator
- Returns:
- the expanded system identifier.
-
getLineNumber
public int getLineNumber()
Returns the current line number.- Specified by:
getLineNumber
in interfaceXMLLocator
- Returns:
- the line number, or
-1
if no line number is available.
-
getColumnNumber
public int getColumnNumber()
Returns the current column number.- Specified by:
getColumnNumber
in interfaceXMLLocator
- Returns:
- the column number, or
-1
if no column number is available.
-
getXMLVersion
public java.lang.String getXMLVersion()
Returns the XML version.- Specified by:
getXMLVersion
in interfaceXMLLocator
- Returns:
- the XML version of the current entity. This will normally be the value from the XML or text declaration or defaulted by the parser. Note that that this value may be different than the version of the processing rules applied to the current entity. For instance, an XML 1.1 document may refer to XML 1.0 entities. In such a case the rules of XML 1.1 are applied to the entire document. Also note that, for a given entity, this value can only be considered final once the XML or text declaration has been read or once it has been determined that there is no such declaration.
-
getCharacterOffset
public int getCharacterOffset()
Returns the character offset.- Specified by:
getCharacterOffset
in interfaceXMLLocator
- Returns:
- the character offset, or
-1
if no character offset is available.
-
getFeatureDefault
public java.lang.Boolean getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.- Specified by:
getFeatureDefault
in interfaceHTMLComponent
- Specified by:
getFeatureDefault
in interfaceXMLComponent
- Parameters:
featureId
- The feature identifier.- Returns:
- the default state for a feature, or null if this component does not want to report a default value for this feature.
-
getPropertyDefault
public java.lang.Object getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.- Specified by:
getPropertyDefault
in interfaceHTMLComponent
- Specified by:
getPropertyDefault
in interfaceXMLComponent
- Parameters:
propertyId
- The property identifier.- Returns:
- the default state for a property, or null if this component does not want to report a default value for this property
-
getRecognizedFeatures
public java.lang.String[] getRecognizedFeatures()
Returns recognized features.- Specified by:
getRecognizedFeatures
in interfaceXMLComponent
- Returns:
- an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
-
getRecognizedProperties
public java.lang.String[] getRecognizedProperties()
Returns recognized properties.- Specified by:
getRecognizedProperties
in interfaceXMLComponent
- Returns:
- an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
-
reset
public void reset(XMLComponentManager manager) throws XMLConfigurationException
Resets the component.- Specified by:
reset
in interfaceXMLComponent
- Parameters:
manager
- The component manager.- Throws:
XMLConfigurationException
-
setFeature
public void setFeature(java.lang.String featureId, boolean state)
Sets a feature.- Specified by:
setFeature
in interfaceXMLComponent
- Parameters:
featureId
- The feature identifier.state
- The state of the feature.
-
setProperty
public void setProperty(java.lang.String propertyId, java.lang.Object value) throws XMLConfigurationException
Sets a property.- Specified by:
setProperty
in interfaceXMLComponent
- Parameters:
propertyId
- The property identifier.value
- The value of the property.- Throws:
XMLConfigurationException
- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setInputSource
public void setInputSource(XMLInputSource source) throws java.io.IOException
Sets the input source.- Specified by:
setInputSource
in interfaceXMLDocumentScanner
- Parameters:
source
- The input source.- Throws:
java.io.IOException
- Thrown on i/o error.
-
scanDocument
public boolean scanDocument(boolean complete) throws XNIException, java.io.IOException
Scans the document.- Specified by:
scanDocument
in interfaceXMLDocumentScanner
- Parameters:
complete
- True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.- Returns:
- True if there is more to scan, false otherwise.
- Throws:
XNIException
- on error.java.io.IOException
- Thrown on i/o error.
-
setDocumentHandler
public void setDocumentHandler(XMLDocumentHandler handler)
Sets the document handler.- Specified by:
setDocumentHandler
in interfaceXMLDocumentSource
- Parameters:
handler
- the new handler
-
getDocumentHandler
public XMLDocumentHandler getDocumentHandler()
Returns the document handler.- Specified by:
getDocumentHandler
in interfaceXMLDocumentSource
- Returns:
- the document handler
-
getValue
protected static java.lang.String getValue(XMLAttributes attrs, java.lang.String aname)
-
expandSystemId
public static java.lang.String expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId
- The systemId to be expanded.baseSystemId
- baseSystemId- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
protected static java.lang.String fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.- Parameters:
str
- The string to fix.- Returns:
- Returns the fixed URI string.
-
modifyName
protected static java.lang.String modifyName(java.lang.String name, short mode)
-
getNamesValue
protected static short getNamesValue(java.lang.String value)
-
setScanner
protected void setScanner(HTMLScanner.Scanner scanner)
-
setScannerState
protected void setScannerState(short state)
-
scanDoctype
protected void scanDoctype() throws java.io.IOException
- Throws:
java.io.IOException
-
scanLiteral
protected java.lang.String scanLiteral() throws java.io.IOException
- Throws:
java.io.IOException
-
scanName
protected java.lang.String scanName(boolean strict) throws java.io.IOException
- Throws:
java.io.IOException
-
scanTagName
protected java.lang.String scanTagName() throws java.io.IOException
- Throws:
java.io.IOException
-
scanEntityRef
protected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws java.io.IOException
- Throws:
java.io.IOException
-
returnEntityRefString
private int returnEntityRefString(XMLString str, boolean content)
-
skip
protected boolean skip(java.lang.String s, boolean caseSensitive) throws java.io.IOException
- Throws:
java.io.IOException
-
skipMarkup
protected boolean skipMarkup(boolean balance) throws java.io.IOException
- Throws:
java.io.IOException
-
skipSpaces
protected boolean skipSpaces() throws java.io.IOException
- Throws:
java.io.IOException
-
skipNewlines
protected int skipNewlines() throws java.io.IOException
- Throws:
java.io.IOException
-
locationAugs
protected final Augmentations locationAugs()
-
synthesizedAugs
protected final Augmentations synthesizedAugs()
-
builtinXmlRef
protected static boolean builtinXmlRef(java.lang.String name)
-
isEncodingCompatible
static boolean isEncodingCompatible(java.lang.String encoding1, java.lang.String encoding2)
To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings
-
canRoundtrip
private static boolean canRoundtrip(java.lang.String encodeCharset, java.lang.String decodeCharset) throws java.io.UnsupportedEncodingException
- Throws:
java.io.UnsupportedEncodingException
-
readPreservingBufferContent
protected int readPreservingBufferContent() throws java.io.IOException
- Throws:
java.io.IOException
-
-