Package org.cyberneko.html.filters
Class Purifier
- java.lang.Object
-
- org.cyberneko.html.filters.DefaultFilter
-
- org.cyberneko.html.filters.Purifier
-
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent
,org.apache.xerces.xni.parser.XMLDocumentFilter
,org.apache.xerces.xni.parser.XMLDocumentSource
,org.apache.xerces.xni.XMLDocumentHandler
,HTMLComponent
public class Purifier extends DefaultFilter
This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:- fixing illegal characters in the document, including
- element and attribute names,
- processing instruction target and data,
- document text;
- ensuring the string "--" does not appear in the content of a comment;
- ensuring the string "]]>" does not appear in the content of a CDATA section;
- ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
- synthesized missing namespace bindings.
Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".
In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.
The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.
- Version:
- $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
- Author:
- Andy Clark
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.String
AUGMENTATIONS
Include infoset augmentations.protected boolean
fAugmentations
Augmentations.protected boolean
fInCDATASection
True if inside a CDATA section.protected org.apache.xerces.xni.NamespaceContext
fNamespaceContext
Namespace information.protected boolean
fNamespaces
Namespaces.protected java.lang.String
fPublicId
Public identifier of doctype declaration.protected boolean
fSeenDoctype
True if the doctype declaration was seen.protected boolean
fSeenRootElement
True if root element was seen.protected int
fSynthesizedNamespaceCount
Synthesized namespace binding count.protected java.lang.String
fSystemId
System identifier of doctype declaration.protected static java.lang.String
NAMESPACES
Namespaces.protected static HTMLEventInfo
SYNTHESIZED_ITEM
Synthesized event info item.static java.lang.String
SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.-
Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource
-
-
Constructor Summary
Constructors Constructor Description Purifier()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
Characters.void
comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
Comment.void
doctypeDecl(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs)
Doctype declaration.void
emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
Empty element.void
endCDATA(org.apache.xerces.xni.Augmentations augs)
End CDATA section.void
endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)
End element.protected void
handleStartDocument()
Handle start document.protected void
handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)
Handle start element.void
processingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)
Processing instruction.protected java.lang.String
purifyName(java.lang.String name, boolean localpart)
Purify name.protected org.apache.xerces.xni.QName
purifyQName(org.apache.xerces.xni.QName qname)
Purify qualified name.protected org.apache.xerces.xni.XMLString
purifyText(org.apache.xerces.xni.XMLString text)
Purify content.void
reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
Resets the component.void
startCDATA(org.apache.xerces.xni.Augmentations augs)
Start CDATA section.void
startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
Start document.void
startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)
Start document.void
startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
Start element.protected void
synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)
Synthesize namespace binding.protected org.apache.xerces.xni.Augmentations
synthesizedAugs()
Returns an augmentations object with a synthesized item added.protected static java.lang.String
toHexString(int c, int padlen)
Returns a padded hexadecimal string for the given value.void
xmlDecl(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs)
XML declaration.-
Methods inherited from class org.cyberneko.html.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
-
-
-
-
Field Detail
-
SYNTHESIZED_NAMESPACE_PREFX
public static final java.lang.String SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.- See Also:
- Constant Field Values
-
NAMESPACES
protected static final java.lang.String NAMESPACES
Namespaces.- See Also:
- Constant Field Values
-
AUGMENTATIONS
protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
SYNTHESIZED_ITEM
protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.
-
fNamespaces
protected boolean fNamespaces
Namespaces.
-
fAugmentations
protected boolean fAugmentations
Augmentations.
-
fSeenDoctype
protected boolean fSeenDoctype
True if the doctype declaration was seen.
-
fSeenRootElement
protected boolean fSeenRootElement
True if root element was seen.
-
fInCDATASection
protected boolean fInCDATASection
True if inside a CDATA section.
-
fPublicId
protected java.lang.String fPublicId
Public identifier of doctype declaration.
-
fSystemId
protected java.lang.String fSystemId
System identifier of doctype declaration.
-
fNamespaceContext
protected org.apache.xerces.xni.NamespaceContext fNamespaceContext
Namespace information.
-
fSynthesizedNamespaceCount
protected int fSynthesizedNamespaceCount
Synthesized namespace binding count.
-
-
Method Detail
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException
Description copied from class:DefaultFilter
Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.- Specified by:
reset
in interfaceorg.apache.xerces.xni.parser.XMLComponent
- Overrides:
reset
in classDefaultFilter
- Parameters:
manager
- The component manager.- Throws:
org.apache.xerces.xni.parser.XMLConfigurationException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Start document.- Overrides:
startDocument
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Start document.- Specified by:
startDocument
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startDocument
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
xmlDecl
public void xmlDecl(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
XML declaration.- Specified by:
xmlDecl
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
xmlDecl
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
comment
public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Comment.- Specified by:
comment
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
comment
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
processingInstruction
public void processingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Processing instruction.- Specified by:
processingInstruction
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
processingInstruction
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
doctypeDecl
public void doctypeDecl(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Doctype declaration.- Specified by:
doctypeDecl
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
doctypeDecl
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startElement
public void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Start element.- Specified by:
startElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
emptyElement
public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Empty element.- Specified by:
emptyElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
emptyElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startCDATA
public void startCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Start CDATA section.- Specified by:
startCDATA
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startCDATA
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
endCDATA
public void endCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
End CDATA section.- Specified by:
endCDATA
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
endCDATA
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
characters
public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
Characters.- Specified by:
characters
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
characters
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
endElement
public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
End element.- Specified by:
endElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
endElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
handleStartDocument
protected void handleStartDocument()
Handle start document.
-
handleStartElement
protected void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)
Handle start element.
-
synthesizeBinding
protected void synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)
Synthesize namespace binding.
-
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.
-
purifyQName
protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname)
Purify qualified name.
-
purifyName
protected java.lang.String purifyName(java.lang.String name, boolean localpart)
Purify name.
-
purifyText
protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text)
Purify content.
-
toHexString
protected static java.lang.String toHexString(int c, int padlen)
Returns a padded hexadecimal string for the given value.
-
-