Package org.htmlunit.cyberneko
Class HTMLScanner
java.lang.Object
org.htmlunit.cyberneko.HTMLScanner
- All Implemented Interfaces:
HTMLComponent
,XMLComponent
,XMLDocumentScanner
,XMLDocumentSource
,XMLLocator
A simple HTML scanner. This scanner makes no attempt to balance tags or fix
other problems in the source document — it just scans what it can and
generates XNI document "events", ignoring errors of all kinds.
This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
- http://cyberneko.org/html/features/scanner/normalize-attrs
- http://cyberneko.org/html/features/scanner/plain-attr-values
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/encoding-translator
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionclass
The primary HTML document scanner.private static final class
Current entity.(package private) static final class
Location infoset item.class
Special scanner used forPLAINTEXT
static interface
Basic scanner interface.private static enum
class
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
Allows self closing <iframe/> tagstatic final String
Allows self closing tags e.g.static final String
Include infoset augmentations.static final String
Scan CDATA sections.private static final boolean
Set to true to debug the buffer.protected static final boolean
Set to true to debug callbacks.private static final boolean
Set to true to debug character encoding handling.private static final boolean
Set to true to debug changes in the scanner.private static final boolean
Set to true to debug changes in the scanner state.protected static final int
static final String
Default encoding.static final String
Doctype declaration public identifier.static final String
Doctype declaration system identifier.static final String
Encoding translator.static final String
Error reporter.(package private) boolean
Allows self closing iframe tags.(package private) boolean
Allows self closing tags.private boolean
Augmentations.protected int
Beginning character offset in the file.protected int
Beginning column number.protected int
Beginning line number.protected PlaybackInputStream
The playback byte stream.(package private) boolean
CDATA sections.protected HTMLScanner.Scanner
Content scanner.(package private) HTMLScanner.CurrentEntity
Current entity.protected final MiniStack
<HTMLScanner.CurrentEntity> The current entity stack.protected String
Default encoding.protected String
Doctype declaration public identifier.protected String
Doctype declaration system identifier.protected XMLDocumentHandler
The document handler.protected int
Element count.protected int
Element depth.protected EncodingTranslator
Error reporter.protected int
Ending character offset in the file.protected int
Ending column number.protected int
Ending line number.protected HTMLErrorReporter
Error reporter.protected String
Auto-detected IANA encoding.(package private) boolean
Ignore specified character set.(package private) boolean
Insert document type declaration.protected String
Auto-detected Java encoding.private final HTMLScanner.LocationItem
Our location item, to be reused becauseAugmentations
says so, so let's save on memoryprotected short
Modify HTML attribute names.protected short
Modify HTML element names.(package private) boolean
Normalize attribute values.private boolean
Override doctype declaration public and system identifiers.(package private) boolean
Parse noscript content.(package private) boolean
Store the plain attribute values also.(package private) boolean
Report errors.(package private) final XMLString
private final XMLString
protected HTMLScanner.Scanner
The current scanner.protected short
The current scanner state.(package private) final XMLString
String buffer, larger because scripts areas are larger(package private) final XMLString
(package private) boolean
Strip CDATA delimiters from SCRIPT tags.(package private) boolean
Strip comment delimiters from SCRIPT tags.(package private) final boolean[]
Single boolean array.protected final HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected final XMLString
String buffer.(package private) final XMLString
String buffer used when resolving entity refs.(package private) final XMLString
(package private) boolean
Strip CDATA delimiters from STYLE tags.(package private) boolean
Strip comment delimiters from STYLE tags.static final String
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static final String
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static final String
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static final String
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static final String
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static final String
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").(package private) final HTMLConfiguration
static final String
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instructionstatic final String
Insert document type declaration.static final String
Modify HTML attribute names: { "upper", "lower", "default" }.static final String
Modify HTML element names: { "upper", "lower", "default" }.protected static final short
Lowercase HTML names.protected static final short
Don't modify HTML names.protected static final short
Uppercase HTML names.static final String
Normalize attribute values.static final String
Override doctype declaration public and system identifiers.static final String
Parse <noscript>...</noscript> contentstatic final String
Store the plain attribute values also.private static final String[]
Recognized features.private static final Boolean[]
Recognized features defaults.private static final String[]
Recognized properties.private static final Object[]
Recognized properties defaults.static final String
Report errors.static final String
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.static final String
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.protected static final short
State: content.protected static final short
State: end document.protected static final short
State: markup bracket.protected static final short
State: start document.static final String
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.static final String
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.protected static final HTMLEventInfo
Synthesized event info item. -
Constructor Summary
ConstructorsConstructorDescriptionHTMLScanner
(HTMLConfiguration htmlConfiguration) Creates a new HTMLScanner with the given configuration -
Method Summary
Modifier and TypeMethodDescriptionprotected static boolean
builtinXmlRef
(String name) private static boolean
canRoundtrip
(String encodeCharset, String decodeCharset) void
cleanup
(boolean closeall) Cleans up used resources.void
evaluateInputSource
(XMLInputSource inputSource) Immediately evaluates an input source and add the new content (e.g.static String
expandSystemId
(String systemId, String baseSystemId) Expands a system id and returns the system id as a URI, if it can be expanded.protected static String
Fixes a platform dependent filename to standard URI form.Returns the base system identifier.int
Returns the character offset.int
Returns the current column number.Returns the document handler.Returns the encoding.Returns the expanded system identifier.getFeatureDefault
(String featureId) Returns the default state for a feature.int
Returns the current line number.Returns the literal system identifier.protected static short
getNamesValue
(String value) getPropertyDefault
(String propertyId) Returns the default state for a property.Returns the public identifier.private Reader
getReader
(XMLInputSource inputSource) String[]
Returns recognized features.String[]
Returns recognized properties.protected static String
getValue
(XMLAttributes attrs, String aname) Returns the XML version.(package private) static boolean
isEncodingCompatible
(String encoding1, String encoding2) To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding.protected final Augmentations
protected static String
modifyName
(String name, short mode) void
pushInputSource
(XMLInputSource inputSource) Pushes an input source onto the current entity stack.protected int
void
reset
(XMLComponentManager manager) Resets the component.private int
returnEntityRefString
(XMLString str, boolean content) protected void
boolean
scanDocument
(boolean complete) Scans the document.protected int
scanEntityRef
(XMLString str, XMLString plainValue, boolean content) protected String
protected String
scanName
(boolean strict) protected String
void
setDocumentHandler
(XMLDocumentHandler handler) Sets the document handler.void
setFeature
(String featureId, boolean state) Sets a feature.void
setInputSource
(XMLInputSource source) Sets the input source.void
setProperty
(String propertyId, Object value) Sets a property.protected void
setScanner
(HTMLScanner.Scanner scanner) protected void
setScannerState
(short state) protected boolean
protected boolean
skipMarkup
(boolean balance) protected int
protected boolean
protected final Augmentations
-
Field Details
-
HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
-
HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
-
HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
-
HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
-
HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
-
HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
-
AUGMENTATIONS
Include infoset augmentations.- See Also:
-
REPORT_ERRORS
Report errors.- See Also:
-
SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
-
SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
-
STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
-
STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
-
IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction- See Also:
-
CDATA_SECTIONS
Scan CDATA sections.- See Also:
-
OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
-
INSERT_DOCTYPE
Insert document type declaration.- See Also:
-
PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content- See Also:
-
ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag- See Also:
-
ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
-
NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
-
PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.- See Also:
-
RECOGNIZED_FEATURES
Recognized features. -
RECOGNIZED_FEATURES_DEFAULTS
Recognized features defaults. -
NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
-
NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
-
DEFAULT_ENCODING
Default encoding.- See Also:
-
ERROR_REPORTER
Error reporter.- See Also:
-
ENCODING_TRANSLATOR
Encoding translator.- See Also:
-
DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
-
DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
-
RECOGNIZED_PROPERTIES
Recognized properties. -
RECOGNIZED_PROPERTIES_DEFAULTS
Recognized properties defaults. -
STATE_CONTENT
protected static final short STATE_CONTENTState: content.- See Also:
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKETState: markup bracket.- See Also:
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENTState: start document.- See Also:
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENTState: end document.- See Also:
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGEDon't modify HTML names.- See Also:
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASEUppercase HTML names.- See Also:
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASELowercase HTML names.- See Also:
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE- See Also:
-
DEBUG_SCANNER
private static final boolean DEBUG_SCANNERSet to true to debug changes in the scanner.- See Also:
-
DEBUG_SCANNER_STATE
private static final boolean DEBUG_SCANNER_STATESet to true to debug changes in the scanner state.- See Also:
-
DEBUG_BUFFER
private static final boolean DEBUG_BUFFERSet to true to debug the buffer.- See Also:
-
DEBUG_CHARSET
private static final boolean DEBUG_CHARSETSet to true to debug character encoding handling.- See Also:
-
DEBUG_CALLBACKS
protected static final boolean DEBUG_CALLBACKSSet to true to debug callbacks.- See Also:
-
SYNTHESIZED_ITEM
Synthesized event info item. -
fAugmentations_
private boolean fAugmentations_Augmentations. -
fReportErrors_
boolean fReportErrors_Report errors. -
fScriptStripCDATADelims_
boolean fScriptStripCDATADelims_Strip CDATA delimiters from SCRIPT tags. -
fScriptStripCommentDelims_
boolean fScriptStripCommentDelims_Strip comment delimiters from SCRIPT tags. -
fStyleStripCDATADelims_
boolean fStyleStripCDATADelims_Strip CDATA delimiters from STYLE tags. -
fStyleStripCommentDelims_
boolean fStyleStripCommentDelims_Strip comment delimiters from STYLE tags. -
fIgnoreSpecifiedCharset_
boolean fIgnoreSpecifiedCharset_Ignore specified character set. -
fCDATASections_
boolean fCDATASections_CDATA sections. -
fOverrideDoctype_
private boolean fOverrideDoctype_Override doctype declaration public and system identifiers. -
fInsertDoctype_
boolean fInsertDoctype_Insert document type declaration. -
fNormalizeAttributes_
boolean fNormalizeAttributes_Normalize attribute values. -
fPlainAttributeValues_
boolean fPlainAttributeValues_Store the plain attribute values also. -
fParseNoScriptContent_
boolean fParseNoScriptContent_Parse noscript content. -
fAllowSelfclosingIframe_
boolean fAllowSelfclosingIframe_Allows self closing iframe tags. -
fAllowSelfclosingTags_
boolean fAllowSelfclosingTags_Allows self closing tags. -
fNamesElems
protected short fNamesElemsModify HTML element names. -
fNamesAttrs
protected short fNamesAttrsModify HTML attribute names. -
fDefaultIANAEncoding
Default encoding. -
fErrorReporter
Error reporter. -
fEncodingTranslator
Error reporter. -
fDoctypePubid
Doctype declaration public identifier. -
fDoctypeSysid
Doctype declaration system identifier. -
fBeginLineNumber
protected int fBeginLineNumberBeginning line number. -
fBeginColumnNumber
protected int fBeginColumnNumberBeginning column number. -
fBeginCharacterOffset
protected int fBeginCharacterOffsetBeginning character offset in the file. -
fEndLineNumber
protected int fEndLineNumberEnding line number. -
fEndColumnNumber
protected int fEndColumnNumberEnding column number. -
fEndCharacterOffset
protected int fEndCharacterOffsetEnding character offset in the file. -
fByteStream
The playback byte stream. -
fCurrentEntity
HTMLScanner.CurrentEntity fCurrentEntityCurrent entity. -
fCurrentEntityStack
The current entity stack. -
fScanner
The current scanner. -
fScannerState
protected short fScannerStateThe current scanner state. -
fDocumentHandler
The document handler. -
fIANAEncoding
Auto-detected IANA encoding. -
fJavaEncoding
Auto-detected Java encoding. -
fElementCount
protected int fElementCountElement count. -
fElementDepth
protected int fElementDepthElement depth. -
fContentScanner
Content scanner. -
fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>. -
fStringBuffer
String buffer. -
fStringBufferEntiyRef
String buffer used when resolving entity refs. -
fStringBufferPlainAttribValue
-
fScanScriptContent
String buffer, larger because scripts areas are larger -
fScanUntilEndTag
-
fScanComment
-
fScanLiteral
-
fSingleBoolean
final boolean[] fSingleBooleanSingle boolean array. -
htmlConfiguration_
-
fLocationItem
Our location item, to be reused becauseAugmentations
says so, so let's save on memory
-
-
Constructor Details
-
HTMLScanner
HTMLScanner(HTMLConfiguration htmlConfiguration) Creates a new HTMLScanner with the given configuration- Parameters:
htmlConfiguration
- the configuration to use
-
-
Method Details
-
pushInputSource
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource
- The new input source to start scanning.- See Also:
-
getReader
-
evaluateInputSource
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource
- The new input source to start evaluating.- See Also:
-
cleanup
public void cleanup(boolean closeall) Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall
- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
Returns the encoding.- Specified by:
getEncoding
in interfaceXMLLocator
- Returns:
- the encoding of the current entity. Note that, for a given entity, this value can only be considered final once the encoding declaration has been read (or once it has been determined that there is no such declaration) since, no encoding having been specified on the XMLInputSource, the parser will make an initial "guess" which could be in error.
-
getPublicId
Returns the public identifier.- Specified by:
getPublicId
in interfaceXMLLocator
- Returns:
- the public identifier.
-
getBaseSystemId
Returns the base system identifier.- Specified by:
getBaseSystemId
in interfaceXMLLocator
- Returns:
- the base system identifier.
-
getLiteralSystemId
Returns the literal system identifier.- Specified by:
getLiteralSystemId
in interfaceXMLLocator
- Returns:
- the literal system identifier.
-
getExpandedSystemId
Returns the expanded system identifier.- Specified by:
getExpandedSystemId
in interfaceXMLLocator
- Returns:
- the expanded system identifier.
-
getLineNumber
public int getLineNumber()Returns the current line number.- Specified by:
getLineNumber
in interfaceXMLLocator
- Returns:
- the line number, or
-1
if no line number is available.
-
getColumnNumber
public int getColumnNumber()Returns the current column number.- Specified by:
getColumnNumber
in interfaceXMLLocator
- Returns:
- the column number, or
-1
if no column number is available.
-
getXMLVersion
Returns the XML version.- Specified by:
getXMLVersion
in interfaceXMLLocator
- Returns:
- the XML version of the current entity. This will normally be the value from the XML or text declaration or defaulted by the parser. Note that that this value may be different than the version of the processing rules applied to the current entity. For instance, an XML 1.1 document may refer to XML 1.0 entities. In such a case the rules of XML 1.1 are applied to the entire document. Also note that, for a given entity, this value can only be considered final once the XML or text declaration has been read or once it has been determined that there is no such declaration.
-
getCharacterOffset
public int getCharacterOffset()Returns the character offset.- Specified by:
getCharacterOffset
in interfaceXMLLocator
- Returns:
- the character offset, or
-1
if no character offset is available.
-
getFeatureDefault
Returns the default state for a feature.- Specified by:
getFeatureDefault
in interfaceHTMLComponent
- Specified by:
getFeatureDefault
in interfaceXMLComponent
- Parameters:
featureId
- The feature identifier.- Returns:
- the default state for a feature, or null if this component does not want to report a default value for this feature.
-
getPropertyDefault
Returns the default state for a property.- Specified by:
getPropertyDefault
in interfaceHTMLComponent
- Specified by:
getPropertyDefault
in interfaceXMLComponent
- Parameters:
propertyId
- The property identifier.- Returns:
- the default state for a property, or null if this component does not want to report a default value for this property
-
getRecognizedFeatures
Returns recognized features.- Specified by:
getRecognizedFeatures
in interfaceXMLComponent
- Returns:
- an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
-
getRecognizedProperties
Returns recognized properties.- Specified by:
getRecognizedProperties
in interfaceXMLComponent
- Returns:
- an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
-
reset
Resets the component.- Specified by:
reset
in interfaceXMLComponent
- Parameters:
manager
- The component manager.- Throws:
XMLConfigurationException
-
setFeature
Sets a feature.- Specified by:
setFeature
in interfaceXMLComponent
- Parameters:
featureId
- The feature identifier.state
- The state of the feature.
-
setProperty
Sets a property.- Specified by:
setProperty
in interfaceXMLComponent
- Parameters:
propertyId
- The property identifier.value
- The value of the property.- Throws:
XMLConfigurationException
- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setInputSource
Sets the input source.- Specified by:
setInputSource
in interfaceXMLDocumentScanner
- Parameters:
source
- The input source.- Throws:
IOException
- Thrown on i/o error.
-
scanDocument
Scans the document.- Specified by:
scanDocument
in interfaceXMLDocumentScanner
- Parameters:
complete
- True if the scanner should scan the document completely, pushing all events to the registered document handler. A value of false indicates that that the scanner should only scan the next portion of the document and return. A scanner instance is permitted to completely scan a document if it does not support this "pull" scanning model.- Returns:
- True if there is more to scan, false otherwise.
- Throws:
XNIException
- on error.IOException
- Thrown on i/o error.
-
setDocumentHandler
Sets the document handler.- Specified by:
setDocumentHandler
in interfaceXMLDocumentSource
- Parameters:
handler
- the new handler
-
getDocumentHandler
Returns the document handler.- Specified by:
getDocumentHandler
in interfaceXMLDocumentSource
- Returns:
- the document handler
-
getValue
-
expandSystemId
Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId
- The systemId to be expanded.baseSystemId
- baseSystemId- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
Fixes a platform dependent filename to standard URI form.- Parameters:
str
- The string to fix.- Returns:
- Returns the fixed URI string.
-
modifyName
-
getNamesValue
-
setScanner
-
setScannerState
protected void setScannerState(short state) -
scanDoctype
- Throws:
IOException
-
scanLiteral
- Throws:
IOException
-
scanName
- Throws:
IOException
-
scanTagName
- Throws:
IOException
-
scanEntityRef
protected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws IOException - Throws:
IOException
-
returnEntityRefString
-
skip
- Throws:
IOException
-
skipMarkup
- Throws:
IOException
-
skipSpaces
- Throws:
IOException
-
skipNewlines
- Throws:
IOException
-
locationAugs
-
synthesizedAugs
-
builtinXmlRef
-
isEncodingCompatible
To detect if 2 encoding are compatible, both must be able to read the meta tag specifying the new encoding. This means that the byte representation of some minimal html markup must be the same in both encodings -
canRoundtrip
private static boolean canRoundtrip(String encodeCharset, String decodeCharset) throws UnsupportedEncodingException - Throws:
UnsupportedEncodingException
-
readPreservingBufferContent
- Throws:
IOException
-