Package org.cyberneko.html
Class HTMLScanner
- java.lang.Object
-
- org.cyberneko.html.HTMLScanner
-
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent
,org.apache.xerces.xni.parser.XMLDocumentScanner
,org.apache.xerces.xni.parser.XMLDocumentSource
,org.apache.xerces.xni.XMLLocator
,HTMLComponent
public class HTMLScanner extends java.lang.Object implements org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://apache.org/xml/features/scanner/notify-char-refs
- http://apache.org/xml/features/scanner/notify-builtin-refs
- http://cyberneko.org/html/features/scanner/notify-builtin-refs
- http://cyberneko.org/html/features/scanner/fix-mswindows-refs
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- Version:
- $Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
- Author:
- Andy Clark, Marc Guillemot, Ahmed Ashour
- See Also:
HTMLElements
,HTMLEntities
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
HTMLScanner.ContentScanner
The primary HTML document scanner.static class
HTMLScanner.CurrentEntity
Current entity.protected static class
HTMLScanner.LocationItem
Location infoset item.static class
HTMLScanner.PlaybackInputStream
A playback input stream.static interface
HTMLScanner.Scanner
Basic scanner interface.class
HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tagstatic java.lang.String
ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g.protected static java.lang.String
AUGMENTATIONS
Include infoset augmentations.static java.lang.String
CDATA_SECTIONS
Scan CDATA sections.protected static boolean
DEBUG_CALLBACKS
Set to true to debug callbacks.protected static int
DEFAULT_BUFFER_SIZE
Default buffer size.protected static java.lang.String
DEFAULT_ENCODING
Default encoding.protected static java.lang.String
DOCTYPE_PUBID
Doctype declaration public identifier.protected static java.lang.String
DOCTYPE_SYSID
Doctype declaration system identifier.protected static java.lang.String
ERROR_REPORTER
Error reporter.protected boolean
fAllowSelfclosingIframe
Allows self closing iframe tags.protected boolean
fAllowSelfclosingTags
Allows self closing tags.protected boolean
fAugmentations
Augmentations.protected int
fBeginCharacterOffset
Beginning character offset in the file.protected int
fBeginColumnNumber
Beginning column number.protected int
fBeginLineNumber
Beginning line number.protected HTMLScanner.PlaybackInputStream
fByteStream
The playback byte stream.protected boolean
fCDATASections
CDATA sections.protected HTMLScanner.Scanner
fContentScanner
Content scanner.protected HTMLScanner.CurrentEntity
fCurrentEntity
Current entity.protected java.util.Stack
fCurrentEntityStack
The current entity stack.protected java.lang.String
fDefaultIANAEncoding
Default encoding.protected java.lang.String
fDoctypePubid
Doctype declaration public identifier.protected java.lang.String
fDoctypeSysid
Doctype declaration system identifier.protected org.apache.xerces.xni.XMLDocumentHandler
fDocumentHandler
The document handler.protected int
fElementCount
Element count.protected int
fElementDepth
Element depth.protected int
fEndCharacterOffset
Ending character offset in the file.protected int
fEndColumnNumber
Ending column number.protected int
fEndLineNumber
Ending line number.protected HTMLErrorReporter
fErrorReporter
Error reporter.protected boolean
fFixWindowsCharRefs
Fix Microsoft Windows® character entity references.protected java.lang.String
fIANAEncoding
Auto-detected IANA encoding.protected boolean
fIgnoreSpecifiedCharset
Ignore specified character set.protected boolean
fInsertDoctype
Insert document type declaration.protected boolean
fIso8859Encoding
True if the encoding matches "ISO-8859-*".static java.lang.String
FIX_MSWINDOWS_REFS
Fix Microsoft Windows® character entity references.protected java.lang.String
fJavaEncoding
Auto-detected Java encoding.protected short
fNamesAttrs
Modify HTML attribute names.protected short
fNamesElems
Modify HTML element names.protected boolean
fNormalizeAttributes
Normalize attribute values.protected boolean
fNotifyCharRefs
Notify character entity references.protected boolean
fNotifyHtmlBuiltinRefs
Notify HTML built-in general entity references.protected boolean
fNotifyXmlBuiltinRefs
Notify XML built-in general entity references.protected boolean
fOverrideDoctype
Override doctype declaration public and system identifiers.protected boolean
fParseNoFramesContent
Parse noframes content.protected boolean
fParseNoScriptContent
Parse noscript content.protected boolean
fReportErrors
Report errors.protected HTMLScanner.Scanner
fScanner
The current scanner.protected short
fScannerState
The current scanner state.protected boolean
fScriptStripCDATADelims
Strip CDATA delimiters from SCRIPT tags.protected boolean
fScriptStripCommentDelims
Strip comment delimiters from SCRIPT tags.protected HTMLScanner.SpecialScanner
fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected org.apache.xerces.util.XMLStringBuffer
fStringBuffer
String buffer.protected boolean
fStyleStripCDATADelims
Strip CDATA delimiters from STYLE tags.protected boolean
fStyleStripCommentDelims
Strip comment delimiters from STYLE tags.static java.lang.String
HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static java.lang.String
HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static java.lang.String
HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static java.lang.String
HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static java.lang.String
HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static java.lang.String
HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").static java.lang.String
IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instructionstatic java.lang.String
INSERT_DOCTYPE
Insert document type declaration.protected static java.lang.String
NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.protected static java.lang.String
NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.protected static short
NAMES_LOWERCASE
Lowercase HTML names.protected static short
NAMES_NO_CHANGE
Don't modify HTML names.protected static short
NAMES_UPPERCASE
Uppercase HTML names.protected static java.lang.String
NORMALIZE_ATTRIBUTES
Normalize attribute values.static java.lang.String
NOTIFY_CHAR_REFS
Notify character entity references (e.g.static java.lang.String
NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g.static java.lang.String
NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g.static java.lang.String
OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.static java.lang.String
PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> contentprotected static java.lang.String
REPORT_ERRORS
Report errors.static java.lang.String
SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.static java.lang.String
SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.protected static short
STATE_CONTENT
State: content.protected static short
STATE_END_DOCUMENT
State: end document.protected static short
STATE_MARKUP_BRACKET
State: markup bracket.protected static short
STATE_START_DOCUMENT
State: start document.static java.lang.String
STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.static java.lang.String
STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.protected static HTMLEventInfo
SYNTHESIZED_ITEM
Synthesized event info item.
-
Constructor Summary
Constructors Constructor Description HTMLScanner()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected static boolean
builtinXmlRef(java.lang.String name)
Returns true if the name is a built-in XML general entity reference.void
cleanup(boolean closeall)
Cleans up used resources.void
evaluateInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g.static java.lang.String
expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded.protected static java.lang.String
fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.protected int
fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.java.lang.String
getBaseSystemId()
Returns the base system identifier.int
getCharacterOffset()
Returns the character offset.int
getColumnNumber()
Returns the current column number.org.apache.xerces.xni.XMLDocumentHandler
getDocumentHandler()
Returns the document handler.java.lang.String
getEncoding()
Returns the encoding.java.lang.String
getExpandedSystemId()
Returns the expanded system identifier.java.lang.Boolean
getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.int
getLineNumber()
Returns the current line number.java.lang.String
getLiteralSystemId()
Returns the literal system identifier.protected static short
getNamesValue(java.lang.String value)
Converts HTML names string value to constant value.java.lang.Object
getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.java.lang.String
getPublicId()
Returns the public identifier.java.lang.String[]
getRecognizedFeatures()
Returns recognized features.java.lang.String[]
getRecognizedProperties()
Returns recognized properties.protected static java.lang.String
getValue(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String aname)
Returns the value of the specified attribute, ignoring case.java.lang.String
getXMLVersion()
Returns the XML version.protected org.apache.xerces.xni.Augmentations
locationAugs()
Returns an augmentations object with a location item added.protected static java.lang.String
modifyName(java.lang.String name, short mode)
Modifies the given name based on the specified mode.void
pushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Pushes an input source onto the current entity stack.protected int
read()
Reads a single character.protected int
readPreservingBufferContent()
Reads a single character, preserving the old buffer contentvoid
reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
Resets the component.protected org.apache.xerces.xni.XMLResourceIdentifier
resourceId()
Returns an empty resource identifier.protected void
scanDoctype()
Scans a DOCTYPE line.boolean
scanDocument(boolean complete)
Scans the document.protected int
scanEntityRef(org.apache.xerces.util.XMLStringBuffer str, boolean content)
Scans an entity reference.protected java.lang.String
scanLiteral()
Scans a quoted literal.protected java.lang.String
scanName(boolean strict)
Scans a name.void
setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
Sets the document handler.void
setFeature(java.lang.String featureId, boolean state)
Sets a feature.void
setInputSource(org.apache.xerces.xni.parser.XMLInputSource source)
Sets the input source.void
setProperty(java.lang.String propertyId, java.lang.Object value)
Sets a property.protected void
setScanner(HTMLScanner.Scanner scanner)
Sets the scanner.protected void
setScannerState(short state)
Sets the scanner state.protected boolean
skip(java.lang.String s, boolean caseSensitive)
Returns true if the specified text is present and is skipped.protected boolean
skipMarkup(boolean balance)
Skips markup.protected int
skipNewlines()
Skips newlines and returns the number of newlines skipped.protected boolean
skipSpaces()
Skips whitespace.protected org.apache.xerces.xni.Augmentations
synthesizedAugs()
Returns an augmentations object with a synthesized item added.
-
-
-
Field Detail
-
HTML_4_01_STRICT_PUBID
public static final java.lang.String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
- Constant Field Values
-
HTML_4_01_STRICT_SYSID
public static final java.lang.String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_PUBID
public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_SYSID
public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_PUBID
public static final java.lang.String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_SYSID
public static final java.lang.String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
- Constant Field Values
-
AUGMENTATIONS
protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
REPORT_ERRORS
protected static final java.lang.String REPORT_ERRORS
Report errors.- See Also:
- Constant Field Values
-
NOTIFY_CHAR_REFS
public static final java.lang.String NOTIFY_CHAR_REFS
Notify character entity references (e.g.  ,  , etc).- See Also:
- Constant Field Values
-
NOTIFY_XML_BUILTIN_REFS
public static final java.lang.String NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &, <, etc).Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.
To be notified of the built-in entity references in HTML, set the
http://cyberneko.org/html/features/scanner/notify-builtin-refs
feature totrue
.- See Also:
- Constant Field Values
-
NOTIFY_HTML_BUILTIN_REFS
public static final java.lang.String NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &nobr;, ©, etc).Note: This includes the five pre-defined XML general entities.
- See Also:
- Constant Field Values
-
FIX_MSWINDOWS_REFS
public static final java.lang.String FIX_MSWINDOWS_REFS
Fix Microsoft Windows® character entity references.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_COMMENT_DELIMS
public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_CDATA_DELIMS
public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_COMMENT_DELIMS
public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_CDATA_DELIMS
public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
- Constant Field Values
-
IGNORE_SPECIFIED_CHARSET
public static final java.lang.String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction- See Also:
- Constant Field Values
-
CDATA_SECTIONS
public static final java.lang.String CDATA_SECTIONS
Scan CDATA sections.- See Also:
- Constant Field Values
-
OVERRIDE_DOCTYPE
public static final java.lang.String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
- Constant Field Values
-
INSERT_DOCTYPE
public static final java.lang.String INSERT_DOCTYPE
Insert document type declaration.- See Also:
- Constant Field Values
-
PARSE_NOSCRIPT_CONTENT
public static final java.lang.String PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_IFRAME
public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_TAGS
public static final java.lang.String ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
- Constant Field Values
-
NORMALIZE_ATTRIBUTES
protected static final java.lang.String NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
- Constant Field Values
-
NAMES_ELEMS
protected static final java.lang.String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
NAMES_ATTRS
protected static final java.lang.String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
DEFAULT_ENCODING
protected static final java.lang.String DEFAULT_ENCODING
Default encoding.- See Also:
- Constant Field Values
-
ERROR_REPORTER
protected static final java.lang.String ERROR_REPORTER
Error reporter.- See Also:
- Constant Field Values
-
DOCTYPE_PUBID
protected static final java.lang.String DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
- Constant Field Values
-
DOCTYPE_SYSID
protected static final java.lang.String DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
- Constant Field Values
-
STATE_CONTENT
protected static final short STATE_CONTENT
State: content.- See Also:
- Constant Field Values
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKET
State: markup bracket.- See Also:
- Constant Field Values
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENT
State: start document.- See Also:
- Constant Field Values
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENT
State: end document.- See Also:
- Constant Field Values
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGE
Don't modify HTML names.- See Also:
- Constant Field Values
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASE
Uppercase HTML names.- See Also:
- Constant Field Values
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASE
Lowercase HTML names.- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE
Default buffer size.- See Also:
- Constant Field Values
-
DEBUG_CALLBACKS
protected static final boolean DEBUG_CALLBACKS
Set to true to debug callbacks.- See Also:
- Constant Field Values
-
SYNTHESIZED_ITEM
protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.
-
fAugmentations
protected boolean fAugmentations
Augmentations.
-
fReportErrors
protected boolean fReportErrors
Report errors.
-
fNotifyCharRefs
protected boolean fNotifyCharRefs
Notify character entity references.
-
fNotifyXmlBuiltinRefs
protected boolean fNotifyXmlBuiltinRefs
Notify XML built-in general entity references.
-
fNotifyHtmlBuiltinRefs
protected boolean fNotifyHtmlBuiltinRefs
Notify HTML built-in general entity references.
-
fFixWindowsCharRefs
protected boolean fFixWindowsCharRefs
Fix Microsoft Windows® character entity references.
-
fScriptStripCDATADelims
protected boolean fScriptStripCDATADelims
Strip CDATA delimiters from SCRIPT tags.
-
fScriptStripCommentDelims
protected boolean fScriptStripCommentDelims
Strip comment delimiters from SCRIPT tags.
-
fStyleStripCDATADelims
protected boolean fStyleStripCDATADelims
Strip CDATA delimiters from STYLE tags.
-
fStyleStripCommentDelims
protected boolean fStyleStripCommentDelims
Strip comment delimiters from STYLE tags.
-
fIgnoreSpecifiedCharset
protected boolean fIgnoreSpecifiedCharset
Ignore specified character set.
-
fCDATASections
protected boolean fCDATASections
CDATA sections.
-
fOverrideDoctype
protected boolean fOverrideDoctype
Override doctype declaration public and system identifiers.
-
fInsertDoctype
protected boolean fInsertDoctype
Insert document type declaration.
-
fNormalizeAttributes
protected boolean fNormalizeAttributes
Normalize attribute values.
-
fParseNoScriptContent
protected boolean fParseNoScriptContent
Parse noscript content.
-
fParseNoFramesContent
protected boolean fParseNoFramesContent
Parse noframes content.
-
fAllowSelfclosingIframe
protected boolean fAllowSelfclosingIframe
Allows self closing iframe tags.
-
fAllowSelfclosingTags
protected boolean fAllowSelfclosingTags
Allows self closing tags.
-
fNamesElems
protected short fNamesElems
Modify HTML element names.
-
fNamesAttrs
protected short fNamesAttrs
Modify HTML attribute names.
-
fDefaultIANAEncoding
protected java.lang.String fDefaultIANAEncoding
Default encoding.
-
fErrorReporter
protected HTMLErrorReporter fErrorReporter
Error reporter.
-
fDoctypePubid
protected java.lang.String fDoctypePubid
Doctype declaration public identifier.
-
fDoctypeSysid
protected java.lang.String fDoctypeSysid
Doctype declaration system identifier.
-
fBeginLineNumber
protected int fBeginLineNumber
Beginning line number.
-
fBeginColumnNumber
protected int fBeginColumnNumber
Beginning column number.
-
fBeginCharacterOffset
protected int fBeginCharacterOffset
Beginning character offset in the file.
-
fEndLineNumber
protected int fEndLineNumber
Ending line number.
-
fEndColumnNumber
protected int fEndColumnNumber
Ending column number.
-
fEndCharacterOffset
protected int fEndCharacterOffset
Ending character offset in the file.
-
fByteStream
protected HTMLScanner.PlaybackInputStream fByteStream
The playback byte stream.
-
fCurrentEntity
protected HTMLScanner.CurrentEntity fCurrentEntity
Current entity.
-
fCurrentEntityStack
protected final java.util.Stack fCurrentEntityStack
The current entity stack.
-
fScanner
protected HTMLScanner.Scanner fScanner
The current scanner.
-
fScannerState
protected short fScannerState
The current scanner state.
-
fDocumentHandler
protected org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
The document handler.
-
fIANAEncoding
protected java.lang.String fIANAEncoding
Auto-detected IANA encoding.
-
fJavaEncoding
protected java.lang.String fJavaEncoding
Auto-detected Java encoding.
-
fIso8859Encoding
protected boolean fIso8859Encoding
True if the encoding matches "ISO-8859-*".
-
fElementCount
protected int fElementCount
Element count.
-
fElementDepth
protected int fElementDepth
Element depth.
-
fContentScanner
protected HTMLScanner.Scanner fContentScanner
Content scanner.
-
fSpecialScanner
protected HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
-
fStringBuffer
protected final org.apache.xerces.util.XMLStringBuffer fStringBuffer
String buffer.
-
-
Method Detail
-
pushInputSource
public void pushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource
- The new input source to start scanning.- See Also:
evaluateInputSource(XMLInputSource)
-
evaluateInputSource
public void evaluateInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource
- The new input source to start evaluating.- See Also:
pushInputSource(XMLInputSource)
-
cleanup
public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall
- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
public java.lang.String getEncoding()
Returns the encoding.- Specified by:
getEncoding
in interfaceorg.apache.xerces.xni.XMLLocator
-
getPublicId
public java.lang.String getPublicId()
Returns the public identifier.- Specified by:
getPublicId
in interfaceorg.apache.xerces.xni.XMLLocator
-
getBaseSystemId
public java.lang.String getBaseSystemId()
Returns the base system identifier.- Specified by:
getBaseSystemId
in interfaceorg.apache.xerces.xni.XMLLocator
-
getLiteralSystemId
public java.lang.String getLiteralSystemId()
Returns the literal system identifier.- Specified by:
getLiteralSystemId
in interfaceorg.apache.xerces.xni.XMLLocator
-
getExpandedSystemId
public java.lang.String getExpandedSystemId()
Returns the expanded system identifier.- Specified by:
getExpandedSystemId
in interfaceorg.apache.xerces.xni.XMLLocator
-
getLineNumber
public int getLineNumber()
Returns the current line number.- Specified by:
getLineNumber
in interfaceorg.apache.xerces.xni.XMLLocator
-
getColumnNumber
public int getColumnNumber()
Returns the current column number.- Specified by:
getColumnNumber
in interfaceorg.apache.xerces.xni.XMLLocator
-
getXMLVersion
public java.lang.String getXMLVersion()
Returns the XML version.- Specified by:
getXMLVersion
in interfaceorg.apache.xerces.xni.XMLLocator
-
getCharacterOffset
public int getCharacterOffset()
Returns the character offset.- Specified by:
getCharacterOffset
in interfaceorg.apache.xerces.xni.XMLLocator
-
getFeatureDefault
public java.lang.Boolean getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.- Specified by:
getFeatureDefault
in interfaceHTMLComponent
- Specified by:
getFeatureDefault
in interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getPropertyDefault
public java.lang.Object getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.- Specified by:
getPropertyDefault
in interfaceHTMLComponent
- Specified by:
getPropertyDefault
in interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getRecognizedFeatures
public java.lang.String[] getRecognizedFeatures()
Returns recognized features.- Specified by:
getRecognizedFeatures
in interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getRecognizedProperties
public java.lang.String[] getRecognizedProperties()
Returns recognized properties.- Specified by:
getRecognizedProperties
in interfaceorg.apache.xerces.xni.parser.XMLComponent
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException
Resets the component.- Specified by:
reset
in interfaceorg.apache.xerces.xni.parser.XMLComponent
- Throws:
org.apache.xerces.xni.parser.XMLConfigurationException
-
setFeature
public void setFeature(java.lang.String featureId, boolean state)
Sets a feature.- Specified by:
setFeature
in interfaceorg.apache.xerces.xni.parser.XMLComponent
-
setProperty
public void setProperty(java.lang.String propertyId, java.lang.Object value) throws org.apache.xerces.xni.parser.XMLConfigurationException
Sets a property.- Specified by:
setProperty
in interfaceorg.apache.xerces.xni.parser.XMLComponent
- Throws:
org.apache.xerces.xni.parser.XMLConfigurationException
-
setInputSource
public void setInputSource(org.apache.xerces.xni.parser.XMLInputSource source) throws java.io.IOException
Sets the input source.- Specified by:
setInputSource
in interfaceorg.apache.xerces.xni.parser.XMLDocumentScanner
- Throws:
java.io.IOException
-
scanDocument
public boolean scanDocument(boolean complete) throws org.apache.xerces.xni.XNIException, java.io.IOException
Scans the document.- Specified by:
scanDocument
in interfaceorg.apache.xerces.xni.parser.XMLDocumentScanner
- Throws:
org.apache.xerces.xni.XNIException
java.io.IOException
-
setDocumentHandler
public void setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
Sets the document handler.- Specified by:
setDocumentHandler
in interfaceorg.apache.xerces.xni.parser.XMLDocumentSource
-
getDocumentHandler
public org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
Returns the document handler.- Specified by:
getDocumentHandler
in interfaceorg.apache.xerces.xni.parser.XMLDocumentSource
-
getValue
protected static java.lang.String getValue(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String aname)
Returns the value of the specified attribute, ignoring case.
-
expandSystemId
public static java.lang.String expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId
- The systemId to be expanded.- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
protected static java.lang.String fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.- Parameters:
str
- The string to fix.- Returns:
- Returns the fixed URI string.
-
modifyName
protected static final java.lang.String modifyName(java.lang.String name, short mode)
Modifies the given name based on the specified mode.
-
getNamesValue
protected static final short getNamesValue(java.lang.String value)
Converts HTML names string value to constant value.- See Also:
NAMES_NO_CHANGE
,NAMES_LOWERCASE
,NAMES_UPPERCASE
-
fixWindowsCharacter
protected int fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.Details about this common problem can be found at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
-
read
protected int read() throws java.io.IOException
Reads a single character.- Throws:
java.io.IOException
-
setScanner
protected void setScanner(HTMLScanner.Scanner scanner)
Sets the scanner.
-
setScannerState
protected void setScannerState(short state)
Sets the scanner state.
-
scanDoctype
protected void scanDoctype() throws java.io.IOException
Scans a DOCTYPE line.- Throws:
java.io.IOException
-
scanLiteral
protected java.lang.String scanLiteral() throws java.io.IOException
Scans a quoted literal.- Throws:
java.io.IOException
-
scanName
protected java.lang.String scanName(boolean strict) throws java.io.IOException
Scans a name.- Throws:
java.io.IOException
-
scanEntityRef
protected int scanEntityRef(org.apache.xerces.util.XMLStringBuffer str, boolean content) throws java.io.IOException
Scans an entity reference.- Throws:
java.io.IOException
-
skip
protected boolean skip(java.lang.String s, boolean caseSensitive) throws java.io.IOException
Returns true if the specified text is present and is skipped.- Throws:
java.io.IOException
-
skipMarkup
protected boolean skipMarkup(boolean balance) throws java.io.IOException
Skips markup.- Throws:
java.io.IOException
-
skipSpaces
protected boolean skipSpaces() throws java.io.IOException
Skips whitespace.- Throws:
java.io.IOException
-
skipNewlines
protected int skipNewlines() throws java.io.IOException
Skips newlines and returns the number of newlines skipped.- Throws:
java.io.IOException
-
locationAugs
protected final org.apache.xerces.xni.Augmentations locationAugs()
Returns an augmentations object with a location item added.
-
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.
-
resourceId
protected final org.apache.xerces.xni.XMLResourceIdentifier resourceId()
Returns an empty resource identifier.
-
builtinXmlRef
protected static boolean builtinXmlRef(java.lang.String name)
Returns true if the name is a built-in XML general entity reference.
-
readPreservingBufferContent
protected int readPreservingBufferContent() throws java.io.IOException
Reads a single character, preserving the old buffer content- Throws:
java.io.IOException
-
-