Class Purifier

java.lang.Object
org.cyberneko.html.filters.DefaultFilter
org.cyberneko.html.filters.Purifier
All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentFilter, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, HTMLComponent

public class Purifier extends DefaultFilter
This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:
  • fixing illegal characters in the document, including
    • element and attribute names,
    • processing instruction target and data,
    • document text;
  • ensuring the string "--" does not appear in the content of a comment;
  • ensuring the string "]]>" does not appear in the content of a CDATA section;
  • ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
  • synthesized missing namespace bindings.

Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".

In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.

The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.

Version:
$Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
Author:
Andy Clark
  • Field Details

    • SYNTHESIZED_NAMESPACE_PREFX

      public static final String SYNTHESIZED_NAMESPACE_PREFX
      Synthesized namespace binding prefix.
      See Also:
    • NAMESPACES

      protected static final String NAMESPACES
      Namespaces.
      See Also:
    • AUGMENTATIONS

      protected static final String AUGMENTATIONS
      Include infoset augmentations.
      See Also:
    • SYNTHESIZED_ITEM

      protected static final HTMLEventInfo SYNTHESIZED_ITEM
      Synthesized event info item.
    • fNamespaces

      protected boolean fNamespaces
      Namespaces.
    • fAugmentations

      protected boolean fAugmentations
      Augmentations.
    • fSeenDoctype

      protected boolean fSeenDoctype
      True if the doctype declaration was seen.
    • fSeenRootElement

      protected boolean fSeenRootElement
      True if root element was seen.
    • fInCDATASection

      protected boolean fInCDATASection
      True if inside a CDATA section.
    • fPublicId

      protected String fPublicId
      Public identifier of doctype declaration.
    • fSystemId

      protected String fSystemId
      System identifier of doctype declaration.
    • fNamespaceContext

      protected org.apache.xerces.xni.NamespaceContext fNamespaceContext
      Namespace information.
    • fSynthesizedNamespaceCount

      protected int fSynthesizedNamespaceCount
      Synthesized namespace binding count.
  • Constructor Details

    • Purifier

      public Purifier()
  • Method Details

    • reset

      public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException
      Description copied from class: DefaultFilter
      Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.
      Specified by:
      reset in interface org.apache.xerces.xni.parser.XMLComponent
      Overrides:
      reset in class DefaultFilter
      Parameters:
      manager - The component manager.
      Throws:
      org.apache.xerces.xni.parser.XMLConfigurationException
    • startDocument

      public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start document.
      Overrides:
      startDocument in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • startDocument

      public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start document.
      Specified by:
      startDocument in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      startDocument in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • xmlDecl

      public void xmlDecl(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      XML declaration.
      Specified by:
      xmlDecl in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      xmlDecl in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • comment

      public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Comment.
      Specified by:
      comment in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      comment in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • processingInstruction

      public void processingInstruction(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Processing instruction.
      Specified by:
      processingInstruction in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      processingInstruction in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • doctypeDecl

      public void doctypeDecl(String root, String pubid, String sysid, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Doctype declaration.
      Specified by:
      doctypeDecl in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      doctypeDecl in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • startElement

      public void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start element.
      Specified by:
      startElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      startElement in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • emptyElement

      public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Empty element.
      Specified by:
      emptyElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      emptyElement in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • startCDATA

      public void startCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start CDATA section.
      Specified by:
      startCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      startCDATA in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • endCDATA

      public void endCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End CDATA section.
      Specified by:
      endCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      endCDATA in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • characters

      public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Characters.
      Specified by:
      characters in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      characters in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • endElement

      public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End element.
      Specified by:
      endElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Overrides:
      endElement in class DefaultFilter
      Throws:
      org.apache.xerces.xni.XNIException
    • handleStartDocument

      protected void handleStartDocument()
      Handle start document.
    • handleStartElement

      protected void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)
      Handle start element.
    • synthesizeBinding

      protected void synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, String ns)
      Synthesize namespace binding.
    • synthesizedAugs

      protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
      Returns an augmentations object with a synthesized item added.
    • purifyQName

      protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname)
      Purify qualified name.
    • purifyName

      protected String purifyName(String name, boolean localpart)
      Purify name.
    • purifyText

      protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text)
      Purify content.
    • toHexString

      protected static String toHexString(int c, int padlen)
      Returns a padded hexadecimal string for the given value.