Class Purifier

  • All Implemented Interfaces:
    org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentFilter, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, HTMLComponent

    public class Purifier
    extends DefaultFilter
    This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:
    • fixing illegal characters in the document, including
      • element and attribute names,
      • processing instruction target and data,
      • document text;
    • ensuring the string "--" does not appear in the content of a comment;
    • ensuring the string "]]>" does not appear in the content of a CDATA section;
    • ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
    • synthesized missing namespace bindings.

    Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".

    In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.

    The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.

    Version:
    $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
    Author:
    Andy Clark
    • Constructor Summary

      Constructors 
      Constructor Description
      Purifier()  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void characters​(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
      Characters.
      void comment​(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
      Comment.
      void doctypeDecl​(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs)
      Doctype declaration.
      void emptyElement​(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
      Empty element.
      void endCDATA​(org.apache.xerces.xni.Augmentations augs)
      End CDATA section.
      void endElement​(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)
      End element.
      protected void handleStartDocument()
      Handle start document.
      protected void handleStartElement​(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)
      Handle start element.
      void processingInstruction​(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)
      Processing instruction.
      protected java.lang.String purifyName​(java.lang.String name, boolean localpart)
      Purify name.
      protected org.apache.xerces.xni.QName purifyQName​(org.apache.xerces.xni.QName qname)
      Purify qualified name.
      protected org.apache.xerces.xni.XMLString purifyText​(org.apache.xerces.xni.XMLString text)
      Purify content.
      void reset​(org.apache.xerces.xni.parser.XMLComponentManager manager)
      Resets the component.
      void startCDATA​(org.apache.xerces.xni.Augmentations augs)
      Start CDATA section.
      void startDocument​(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
      Start document.
      void startDocument​(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)
      Start document.
      void startElement​(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
      Start element.
      protected void synthesizeBinding​(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)
      Synthesize namespace binding.
      protected org.apache.xerces.xni.Augmentations synthesizedAugs()
      Returns an augmentations object with a synthesized item added.
      protected static java.lang.String toHexString​(int c, int padlen)
      Returns a padded hexadecimal string for the given value.
      void xmlDecl​(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs)
      XML declaration.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SYNTHESIZED_NAMESPACE_PREFX

        public static final java.lang.String SYNTHESIZED_NAMESPACE_PREFX
        Synthesized namespace binding prefix.
        See Also:
        Constant Field Values
      • NAMESPACES

        protected static final java.lang.String NAMESPACES
        Namespaces.
        See Also:
        Constant Field Values
      • AUGMENTATIONS

        protected static final java.lang.String AUGMENTATIONS
        Include infoset augmentations.
        See Also:
        Constant Field Values
      • SYNTHESIZED_ITEM

        protected static final HTMLEventInfo SYNTHESIZED_ITEM
        Synthesized event info item.
      • fNamespaces

        protected boolean fNamespaces
        Namespaces.
      • fAugmentations

        protected boolean fAugmentations
        Augmentations.
      • fSeenDoctype

        protected boolean fSeenDoctype
        True if the doctype declaration was seen.
      • fSeenRootElement

        protected boolean fSeenRootElement
        True if root element was seen.
      • fInCDATASection

        protected boolean fInCDATASection
        True if inside a CDATA section.
      • fPublicId

        protected java.lang.String fPublicId
        Public identifier of doctype declaration.
      • fSystemId

        protected java.lang.String fSystemId
        System identifier of doctype declaration.
      • fNamespaceContext

        protected org.apache.xerces.xni.NamespaceContext fNamespaceContext
        Namespace information.
      • fSynthesizedNamespaceCount

        protected int fSynthesizedNamespaceCount
        Synthesized namespace binding count.
    • Constructor Detail

      • Purifier

        public Purifier()
    • Method Detail

      • reset

        public void reset​(org.apache.xerces.xni.parser.XMLComponentManager manager)
                   throws org.apache.xerces.xni.parser.XMLConfigurationException
        Description copied from class: DefaultFilter
        Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.
        Specified by:
        reset in interface org.apache.xerces.xni.parser.XMLComponent
        Overrides:
        reset in class DefaultFilter
        Parameters:
        manager - The component manager.
        Throws:
        org.apache.xerces.xni.parser.XMLConfigurationException
      • startDocument

        public void startDocument​(org.apache.xerces.xni.XMLLocator locator,
                                  java.lang.String encoding,
                                  org.apache.xerces.xni.Augmentations augs)
                           throws org.apache.xerces.xni.XNIException
        Start document.
        Overrides:
        startDocument in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • startDocument

        public void startDocument​(org.apache.xerces.xni.XMLLocator locator,
                                  java.lang.String encoding,
                                  org.apache.xerces.xni.NamespaceContext nscontext,
                                  org.apache.xerces.xni.Augmentations augs)
                           throws org.apache.xerces.xni.XNIException
        Start document.
        Specified by:
        startDocument in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        startDocument in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • xmlDecl

        public void xmlDecl​(java.lang.String version,
                            java.lang.String encoding,
                            java.lang.String standalone,
                            org.apache.xerces.xni.Augmentations augs)
                     throws org.apache.xerces.xni.XNIException
        XML declaration.
        Specified by:
        xmlDecl in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        xmlDecl in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • comment

        public void comment​(org.apache.xerces.xni.XMLString text,
                            org.apache.xerces.xni.Augmentations augs)
                     throws org.apache.xerces.xni.XNIException
        Comment.
        Specified by:
        comment in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        comment in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • processingInstruction

        public void processingInstruction​(java.lang.String target,
                                          org.apache.xerces.xni.XMLString data,
                                          org.apache.xerces.xni.Augmentations augs)
                                   throws org.apache.xerces.xni.XNIException
        Processing instruction.
        Specified by:
        processingInstruction in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        processingInstruction in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • doctypeDecl

        public void doctypeDecl​(java.lang.String root,
                                java.lang.String pubid,
                                java.lang.String sysid,
                                org.apache.xerces.xni.Augmentations augs)
                         throws org.apache.xerces.xni.XNIException
        Doctype declaration.
        Specified by:
        doctypeDecl in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        doctypeDecl in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • startElement

        public void startElement​(org.apache.xerces.xni.QName element,
                                 org.apache.xerces.xni.XMLAttributes attrs,
                                 org.apache.xerces.xni.Augmentations augs)
                          throws org.apache.xerces.xni.XNIException
        Start element.
        Specified by:
        startElement in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        startElement in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • emptyElement

        public void emptyElement​(org.apache.xerces.xni.QName element,
                                 org.apache.xerces.xni.XMLAttributes attrs,
                                 org.apache.xerces.xni.Augmentations augs)
                          throws org.apache.xerces.xni.XNIException
        Empty element.
        Specified by:
        emptyElement in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        emptyElement in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • startCDATA

        public void startCDATA​(org.apache.xerces.xni.Augmentations augs)
                        throws org.apache.xerces.xni.XNIException
        Start CDATA section.
        Specified by:
        startCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        startCDATA in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • endCDATA

        public void endCDATA​(org.apache.xerces.xni.Augmentations augs)
                      throws org.apache.xerces.xni.XNIException
        End CDATA section.
        Specified by:
        endCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        endCDATA in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • characters

        public void characters​(org.apache.xerces.xni.XMLString text,
                               org.apache.xerces.xni.Augmentations augs)
                        throws org.apache.xerces.xni.XNIException
        Characters.
        Specified by:
        characters in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        characters in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • endElement

        public void endElement​(org.apache.xerces.xni.QName element,
                               org.apache.xerces.xni.Augmentations augs)
                        throws org.apache.xerces.xni.XNIException
        End element.
        Specified by:
        endElement in interface org.apache.xerces.xni.XMLDocumentHandler
        Overrides:
        endElement in class DefaultFilter
        Throws:
        org.apache.xerces.xni.XNIException
      • handleStartDocument

        protected void handleStartDocument()
        Handle start document.
      • handleStartElement

        protected void handleStartElement​(org.apache.xerces.xni.QName element,
                                          org.apache.xerces.xni.XMLAttributes attrs)
        Handle start element.
      • synthesizeBinding

        protected void synthesizeBinding​(org.apache.xerces.xni.XMLAttributes attrs,
                                         java.lang.String ns)
        Synthesize namespace binding.
      • synthesizedAugs

        protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
        Returns an augmentations object with a synthesized item added.
      • purifyQName

        protected org.apache.xerces.xni.QName purifyQName​(org.apache.xerces.xni.QName qname)
        Purify qualified name.
      • purifyName

        protected java.lang.String purifyName​(java.lang.String name,
                                              boolean localpart)
        Purify name.
      • purifyText

        protected org.apache.xerces.xni.XMLString purifyText​(org.apache.xerces.xni.XMLString text)
        Purify content.
      • toHexString

        protected static java.lang.String toHexString​(int c,
                                                      int padlen)
        Returns a padded hexadecimal string for the given value.