Class HtmlDocumentBuilder
- java.lang.Object
-
- javax.xml.parsers.DocumentBuilder
-
- nu.validator.htmlparser.dom.HtmlDocumentBuilder
-
public class HtmlDocumentBuilder extends javax.xml.parsers.DocumentBuilder
This class implements an HTML5 parser that exposes data through the DOM interface.By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to
ALTER_INFOSET
as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy toALLOW
. This does not work with a standard DOM implementation. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy toFATAL
.The doctype is not represented in the tree.
The document mode is represented as user data
DocumentMode
object with the keynu.validator.document-mode
on the document node.The form pointer is also stored as user data with the key
nu.validator.form-pointer
.- Version:
- $Id$
-
-
Field Summary
-
Constructor Summary
Constructors Constructor Description HtmlDocumentBuilder()
Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation)
Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation, XmlViolationPolicy xmlPolicy)
Instantiates the document builder with a specific DOM implementation and XML violation policy.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
addCharacterHandler(CharacterHandler characterHandler)
XmlViolationPolicy
getBogusXmlnsPolicy()
Deprecated.XmlViolationPolicy
getCommentPolicy()
Returns the commentPolicy.XmlViolationPolicy
getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.XmlViolationPolicy
getContentSpacePolicy()
Returns the contentSpacePolicy.DoctypeExpectation
getDoctypeExpectation()
Returns the doctype expectation.org.xml.sax.Locator
getDocumentLocator()
Returns theLocator
during parse.DocumentModeHandler
getDocumentModeHandler()
Returns the document mode handler.org.w3c.dom.DOMImplementation
getDOMImplementation()
Returns the DOM implementationHeuristics
getHeuristics()
XmlViolationPolicy
getNamePolicy()
The policy for non-NCName element and attribute names.XmlViolationPolicy
getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.XmlViolationPolicy
getXmlnsPolicy()
Returns the xmlnsPolicy.boolean
isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.boolean
isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.boolean
isMappingLangToXmlLang()
Whetherlang
is mapped toxml:lang
.boolean
isNamespaceAware()
Returnstrue
.boolean
isReportingDoctype()
Returns the reportingDoctype.boolean
isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.boolean
isValidating()
Returnsfalse
private static org.w3c.dom.DOMImplementation
jaxpDOMImplementation()
Returns the JAXP DOM implementation.private void
lazyInit()
This class wraps different tree builders depending on configuration.org.w3c.dom.Document
newDocument()
For API compatibility.private Tokenizer
newTokenizer(TokenHandler handler, boolean newAttributesEachTime)
org.w3c.dom.Document
parse(org.xml.sax.InputSource is)
Parses a document from a SAXInputSource
.org.w3c.dom.DocumentFragment
parseFragment(org.xml.sax.InputSource is, java.lang.String context)
Parses a document fragment from a SAXInputSource
.void
setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated.void
setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.void
setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.void
setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.void
setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.void
setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.void
setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.void
setEntityResolver(org.xml.sax.EntityResolver resolver)
Sets the entity resolver for URI-only inputs.void
setErrorHandler(org.xml.sax.ErrorHandler errorHandler)
Sets the error handler.void
setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.void
setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.void
setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.void
setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whetherlang
is mapped toxml:lang
.void
setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.void
setReportingDoctype(boolean reportingDoctype)
void
setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.void
setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.void
setTransitionHander(TransitionHandler handler)
void
setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether thexmlns
attribute on the root element is passed to through.void
setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.private void
tokenize(org.xml.sax.InputSource is)
Tokenizes the input source.
-
-
-
Field Detail
-
driver
private Driver driver
The tokenizer.
-
treeBuilder
private final DOMTreeBuilder treeBuilder
The tree builder.
-
implementation
private final org.w3c.dom.DOMImplementation implementation
The DOM impl.
-
entityResolver
private org.xml.sax.EntityResolver entityResolver
The entity resolver.
-
errorHandler
private org.xml.sax.ErrorHandler errorHandler
-
documentModeHandler
private DocumentModeHandler documentModeHandler
-
doctypeExpectation
private DoctypeExpectation doctypeExpectation
-
checkingNormalization
private boolean checkingNormalization
-
scriptingEnabled
private boolean scriptingEnabled
-
characterHandlers
private final java.util.List<CharacterHandler> characterHandlers
-
contentSpacePolicy
private XmlViolationPolicy contentSpacePolicy
-
contentNonXmlCharPolicy
private XmlViolationPolicy contentNonXmlCharPolicy
-
commentPolicy
private XmlViolationPolicy commentPolicy
-
namePolicy
private XmlViolationPolicy namePolicy
-
streamabilityViolationPolicy
private XmlViolationPolicy streamabilityViolationPolicy
-
html4ModeCompatibleWithXhtml1Schemata
private boolean html4ModeCompatibleWithXhtml1Schemata
-
mappingLangToXmlLang
private boolean mappingLangToXmlLang
-
xmlnsPolicy
private XmlViolationPolicy xmlnsPolicy
-
reportingDoctype
private boolean reportingDoctype
-
treeBuilderErrorHandler
private org.xml.sax.ErrorHandler treeBuilderErrorHandler
-
heuristics
private Heuristics heuristics
-
transitionHandler
private TransitionHandler transitionHandler
-
-
Constructor Detail
-
HtmlDocumentBuilder
public HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation, XmlViolationPolicy xmlPolicy)
Instantiates the document builder with a specific DOM implementation and XML violation policy.- Parameters:
implementation
- the DOM implementationxmlPolicy
- the policy
-
HtmlDocumentBuilder
public HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation)
Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.- Parameters:
implementation
- the DOM implementation
-
HtmlDocumentBuilder
public HtmlDocumentBuilder()
Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.
-
HtmlDocumentBuilder
public HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.- Parameters:
xmlPolicy
- the policy
-
-
Method Detail
-
jaxpDOMImplementation
private static org.w3c.dom.DOMImplementation jaxpDOMImplementation()
Returns the JAXP DOM implementation.- Returns:
- the JAXP DOM implementation
-
newTokenizer
private Tokenizer newTokenizer(TokenHandler handler, boolean newAttributesEachTime)
-
lazyInit
private void lazyInit()
This class wraps different tree builders depending on configuration. This method does the work of hiding this from the user of the class.
-
tokenize
private void tokenize(org.xml.sax.InputSource is) throws org.xml.sax.SAXException, java.io.IOException, java.net.MalformedURLException
Tokenizes the input source.- Parameters:
is
- the source- Throws:
org.xml.sax.SAXException
- if stuff goes wrongjava.io.IOException
- if IO goes wrongjava.net.MalformedURLException
- if the system ID is malformed and the entity resolver isnull
-
getDOMImplementation
public org.w3c.dom.DOMImplementation getDOMImplementation()
Returns the DOM implementation- Specified by:
getDOMImplementation
in classjavax.xml.parsers.DocumentBuilder
- Returns:
- the DOM implementation
- See Also:
DocumentBuilder.getDOMImplementation()
-
isNamespaceAware
public boolean isNamespaceAware()
Returnstrue
.- Specified by:
isNamespaceAware
in classjavax.xml.parsers.DocumentBuilder
- Returns:
true
- See Also:
DocumentBuilder.isNamespaceAware()
-
isValidating
public boolean isValidating()
Returnsfalse
- Specified by:
isValidating
in classjavax.xml.parsers.DocumentBuilder
- Returns:
false
- See Also:
DocumentBuilder.isValidating()
-
newDocument
public org.w3c.dom.Document newDocument()
For API compatibility.- Specified by:
newDocument
in classjavax.xml.parsers.DocumentBuilder
- See Also:
DocumentBuilder.newDocument()
-
parse
public org.w3c.dom.Document parse(org.xml.sax.InputSource is) throws org.xml.sax.SAXException, java.io.IOException
Parses a document from a SAXInputSource
.- Specified by:
parse
in classjavax.xml.parsers.DocumentBuilder
- Parameters:
is
- the source- Returns:
- the doc
- Throws:
org.xml.sax.SAXException
- if stuff goes wrongjava.io.IOException
- if IO goes wrong- See Also:
DocumentBuilder.parse(org.xml.sax.InputSource)
-
parseFragment
public org.w3c.dom.DocumentFragment parseFragment(org.xml.sax.InputSource is, java.lang.String context) throws java.io.IOException, org.xml.sax.SAXException
Parses a document fragment from a SAXInputSource
.- Parameters:
is
- the sourcecontext
- the context element name- Returns:
- the doc
- Throws:
org.xml.sax.SAXException
- if stuff goes wrongjava.io.IOException
- if IO goes wrong
-
setEntityResolver
public void setEntityResolver(org.xml.sax.EntityResolver resolver)
Sets the entity resolver for URI-only inputs.- Specified by:
setEntityResolver
in classjavax.xml.parsers.DocumentBuilder
- Parameters:
resolver
- the resolver- See Also:
DocumentBuilder.setEntityResolver(org.xml.sax.EntityResolver)
-
setErrorHandler
public void setErrorHandler(org.xml.sax.ErrorHandler errorHandler)
Sets the error handler.- Specified by:
setErrorHandler
in classjavax.xml.parsers.DocumentBuilder
- Parameters:
errorHandler
- the handler- See Also:
DocumentBuilder.setErrorHandler(org.xml.sax.ErrorHandler)
-
setTransitionHander
public void setTransitionHander(TransitionHandler handler)
-
isCheckingNormalization
public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.- Returns:
true
if NFC normalization of source is being checked.- See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.- Parameters:
enable
-true
to check normalization- See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
-
setCommentPolicy
public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy
- the policy- See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentNonXmlCharPolicy
public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy
- the policy- See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentSpacePolicy
public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy
- the policy- See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
isScriptingEnabled
public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
true
if enabled- See Also:
TreeBuilder.isScriptingEnabled()
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled
-true
to enable- See Also:
TreeBuilder.setScriptingEnabled(boolean)
-
getDoctypeExpectation
public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.- Parameters:
doctypeExpectation
- the doctypeExpectation to set- See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)
-
getDocumentModeHandler
public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.- Parameters:
documentModeHandler
- the documentModeHandler to set- See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)
-
getStreamabilityViolationPolicy
public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy
- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata
-
-
getDocumentLocator
public org.xml.sax.Locator getDocumentLocator()
Returns theLocator
during parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whetherlang
is mapped toxml:lang
.- Parameters:
mappingLangToXmlLang
-- See Also:
Tokenizer.setMappingLangToXmlLang(boolean)
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()
Whetherlang
is mapped toxml:lang
.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether thexmlns
attribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy
-- See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
getXmlnsPolicy
public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getCommentPolicy
public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype)
- Parameters:
reportingDoctype
-- See Also:
TreeBuilder.setReportingDoctype(boolean)
-
isReportingDoctype
public boolean isReportingDoctype()
Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setNamePolicy
public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.- Parameters:
namePolicy
-- See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setHeuristics
public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.- Parameters:
heuristics
- the heuristics to set- See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
-
getHeuristics
public Heuristics getHeuristics()
-
setXmlPolicy
public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy
-
-
getNamePolicy
public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated.Does nothing.
-
getBogusXmlnsPolicy
public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET
.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
public void addCharacterHandler(CharacterHandler characterHandler)
-
setIgnoringComments
public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.- Parameters:
ignoreComments
-true
to ignore comments- See Also:
TreeBuilder.setIgnoringComments(boolean)
-
-