Class HtmlBuilder
By default, when using the constructor without arguments, the
this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible
infosets. This corresponds to ALTER_INFOSET
as the general
XML violation policy. It is possible to treat XML 1.0 infoset violations
as fatal by setting the general XML violation policy to FATAL
.
The doctype is not represented in the tree.
The document mode is represented via the Mode
interface on the Document
node if the node implements
that interface (depends on the used node factory).
The form pointer is stored if the node factory supports storing it.
This package has its own node factory class because the official XOM node factory may return multiple nodes instead of one confusing the assumptions of the DOM-oriented HTML5 parsing algorithm.
- Version:
- $Id$
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final List
<CharacterHandler> private boolean
private XmlViolationPolicy
private XmlViolationPolicy
private XmlViolationPolicy
private DoctypeExpectation
private DocumentModeHandler
private Driver
private EntityResolver
private ErrorHandler
private Heuristics
private boolean
private boolean
private XmlViolationPolicy
private boolean
private boolean
private final SimpleNodeFactory
private XmlViolationPolicy
private TransitionHandler
private final XOMTreeBuilder
private ErrorHandler
private XmlViolationPolicy
-
Constructor Summary
ConstructorsConstructorDescriptionConstructor with default node factory and fatal XML violation policy.HtmlBuilder
(XmlViolationPolicy xmlPolicy) Constructor with default node factory and given XML violation policy.HtmlBuilder
(SimpleNodeFactory nodeFactory) Constructor with given node factory and fatal XML violation policy.HtmlBuilder
(SimpleNodeFactory nodeFactory, XmlViolationPolicy xmlPolicy) Constructor with given node factory and given XML violation policy. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addCharacterHandler
(CharacterHandler characterHandler) nu.xom.Document
Parse fromFile
.nu.xom.Document
build
(InputStream stream) Parse fromInputStream
.nu.xom.Document
build
(InputStream stream, String uri) Parse fromInputStream
.nu.xom.Document
Parse fromReader
.nu.xom.Document
Parse fromReader
.nu.xom.Document
Parse from URI.nu.xom.Document
Parse fromString
.nu.xom.Document
build
(InputSource is) Parse from SAXInputSource
.nu.xom.Nodes
buildFragment
(InputSource is, String context) Parse a fragment from SAXInputSource
.Deprecated.Returns the commentPolicy.Returns the contentNonXmlCharPolicy.Returns the contentSpacePolicy.Returns the doctype expectation.Returns theLocator
during parse.Returns the document mode handler.The policy for non-NCName element and attribute names.Gets the node factoryReturns the streamabilityViolationPolicy.Returns the xmlnsPolicy.boolean
Indicates whether NFC normalization of source is being checked.boolean
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.boolean
Whetherlang
is mapped toxml:lang
.boolean
Returns the reportingDoctype.boolean
Whether the parser considers scripting to be enabled for noscript treatment.private void
lazyInit()
This class wraps different tree builders depending on configuration.private Tokenizer
newTokenizer
(TokenHandler handler, boolean newAttributesEachTime) void
setBogusXmlnsPolicy
(XmlViolationPolicy bogusXmlnsPolicy) Deprecated.void
setCheckingNormalization
(boolean enable) Toggles the checking of the NFC normalization of source.void
setCommentPolicy
(XmlViolationPolicy commentPolicy) Sets the policy for consecutive hyphens in comments.void
setContentNonXmlCharPolicy
(XmlViolationPolicy contentNonXmlCharPolicy) Sets the policy for non-XML characters except white space.void
setContentSpacePolicy
(XmlViolationPolicy contentSpacePolicy) Sets the policy for non-XML white space.void
setDoctypeExpectation
(DoctypeExpectation doctypeExpectation) Sets the doctype expectation.void
setDocumentModeHandler
(DocumentModeHandler documentModeHandler) Sets the document mode handler.void
setEntityResolver
(EntityResolver resolver) void
setErrorHandler
(ErrorHandler handler) void
setHeuristics
(Heuristics heuristics) Sets the encoding sniffing heuristics.void
setHtml4ModeCompatibleWithXhtml1Schemata
(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.void
setIgnoringComments
(boolean ignoreComments) Sets whether comment nodes appear in the tree.void
setMappingLangToXmlLang
(boolean mappingLangToXmlLang) Whetherlang
is mapped toxml:lang
.void
setNamePolicy
(XmlViolationPolicy namePolicy) The policy for non-NCName element and attribute names.void
setReportingDoctype
(boolean reportingDoctype) void
setScriptingEnabled
(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.void
setStreamabilityViolationPolicy
(XmlViolationPolicy streamabilityViolationPolicy) Sets the streamabilityViolationPolicy.void
setTransitionHander
(TransitionHandler handler) void
setXmlnsPolicy
(XmlViolationPolicy xmlnsPolicy) Whether thexmlns
attribute on the root element is passed to through.void
setXmlPolicy
(XmlViolationPolicy xmlPolicy) This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.private void
tokenize
(InputSource is) Methods inherited from class nu.xom.Builder
getNodeFactory
-
Field Details
-
driver
-
treeBuilder
-
simpleNodeFactory
-
entityResolver
-
errorHandler
-
documentModeHandler
-
doctypeExpectation
-
checkingNormalization
private boolean checkingNormalization -
scriptingEnabled
private boolean scriptingEnabled -
characterHandlers
-
contentSpacePolicy
-
contentNonXmlCharPolicy
-
commentPolicy
-
namePolicy
-
streamabilityViolationPolicy
-
html4ModeCompatibleWithXhtml1Schemata
private boolean html4ModeCompatibleWithXhtml1Schemata -
mappingLangToXmlLang
private boolean mappingLangToXmlLang -
xmlnsPolicy
-
reportingDoctype
private boolean reportingDoctype -
treeBuilderErrorHandler
-
heuristics
-
transitionHandler
-
-
Constructor Details
-
HtmlBuilder
public HtmlBuilder()Constructor with default node factory and fatal XML violation policy. -
HtmlBuilder
Constructor with given node factory and fatal XML violation policy.- Parameters:
nodeFactory
- the factory
-
HtmlBuilder
Constructor with default node factory and given XML violation policy.- Parameters:
xmlPolicy
- the policy
-
HtmlBuilder
Constructor with given node factory and given XML violation policy.- Parameters:
nodeFactory
- the factoryxmlPolicy
- the policy
-
-
Method Details
-
newTokenizer
-
lazyInit
private void lazyInit()This class wraps different tree builders depending on configuration. This method does the work of hiding this from the user of the class. -
tokenize
private void tokenize(InputSource is) throws nu.xom.ParsingException, IOException, MalformedURLException - Throws:
nu.xom.ParsingException
IOException
MalformedURLException
-
build
Parse from SAXInputSource
.- Parameters:
is
- theInputSource
- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrang
-
buildFragment
public nu.xom.Nodes buildFragment(InputSource is, String context) throws IOException, nu.xom.ParsingException Parse a fragment from SAXInputSource
.- Parameters:
is
- theInputSource
context
- the name of the context element- Returns:
- the fragment
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrang
-
build
public nu.xom.Document build(File file) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromFile
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
file
- the file- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(InputStream stream, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromInputStream
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
stream
- the streamuri
- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(InputStream stream) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromInputStream
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
stream
- the stream- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(Reader stream, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromReader
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
stream
- the readeruri
- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(Reader stream) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromReader
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
stream
- the reader- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(String content, String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse fromString
.- Overrides:
build
in classnu.xom.Builder
- Parameters:
content
- the HTML source as stringuri
- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
build
public nu.xom.Document build(String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, IOException Parse from URI.- Overrides:
build
in classnu.xom.Builder
- Parameters:
uri
- the URI of the document- Returns:
- the document
- Throws:
nu.xom.ParsingException
- in case of an XML violationIOException
- if IO goes wrangnu.xom.ValidityException
- See Also:
-
getSimpleNodeFactory
Gets the node factory -
setEntityResolver
- See Also:
-
setErrorHandler
- See Also:
-
setTransitionHander
-
isCheckingNormalization
public boolean isCheckingNormalization()Indicates whether NFC normalization of source is being checked.- Returns:
true
if NFC normalization of source is being checked.- See Also:
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable) Toggles the checking of the NFC normalization of source.- Parameters:
enable
-true
to check normalization- See Also:
-
setCommentPolicy
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy
- the policy- See Also:
-
setContentNonXmlCharPolicy
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy
- the policy- See Also:
-
setContentSpacePolicy
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy
- the policy- See Also:
-
isScriptingEnabled
public boolean isScriptingEnabled()Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
true
if enabled- See Also:
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled
-true
to enable- See Also:
-
getDoctypeExpectation
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
Sets the doctype expectation.- Parameters:
doctypeExpectation
- the doctypeExpectation to set- See Also:
-
getDocumentModeHandler
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
Sets the document mode handler.- Parameters:
documentModeHandler
- the documentModeHandler to set- See Also:
-
getStreamabilityViolationPolicy
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy
- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata
-
-
getDocumentLocator
Returns theLocator
during parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang) Whetherlang
is mapped toxml:lang
.- Parameters:
mappingLangToXmlLang
-- See Also:
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()Whetherlang
is mapped toxml:lang
.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
Whether thexmlns
attribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy
-- See Also:
-
getXmlnsPolicy
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getCommentPolicy
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype) - Parameters:
reportingDoctype
-- See Also:
-
isReportingDoctype
public boolean isReportingDoctype()Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setNamePolicy
The policy for non-NCName element and attribute names.- Parameters:
namePolicy
-- See Also:
-
setHeuristics
Sets the encoding sniffing heuristics.- Parameters:
heuristics
- the heuristics to set- See Also:
-
getHeuristics
-
setXmlPolicy
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy
-
-
getNamePolicy
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
Deprecated.Does nothing. -
getBogusXmlnsPolicy
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET
.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
-
setIgnoringComments
public void setIgnoringComments(boolean ignoreComments) Sets whether comment nodes appear in the tree.- Parameters:
ignoreComments
-true
to ignore comments- See Also:
-