Class HTMLTagBalancer

  • All Implemented Interfaces:
    HTMLComponent, XMLComponent, XMLDocumentFilter, XMLDocumentSource, XMLDocumentHandler

    public class HTMLTagBalancer
    extends java.lang.Object
    implements XMLDocumentFilter, HTMLComponent
    Balances tags in an HTML document. This component receives document events and tries to correct many common mistakes that human (and computer) HTML document authors make. This tag balancer can:
    • add missing parent elements;
    • automatically close elements with optional end tags; and
    • handle mis-matched inline element tags.

    This component recognizes the following features:

    • http://cyberneko.org/html/features/augmentations
    • http://cyberneko.org/html/features/report-errors
    • http://cyberneko.org/html/features/balance-tags/document-fragment
    • http://cyberneko.org/html/features/balance-tags/ignore-outside-content

    This component recognizes the following properties:

    • http://cyberneko.org/html/properties/names/elems
    • http://cyberneko.org/html/properties/names/attrs
    • http://cyberneko.org/html/properties/error-reporter
    • http://cyberneko.org/html/properties/balance-tags/current-stack
    See Also:
    HTMLElements
    • Field Detail

      • NAMESPACES

        protected static final java.lang.String NAMESPACES
        Namespaces.
        See Also:
        Constant Field Values
      • AUGMENTATIONS

        protected static final java.lang.String AUGMENTATIONS
        Include infoset augmentations.
        See Also:
        Constant Field Values
      • REPORT_ERRORS

        protected static final java.lang.String REPORT_ERRORS
        Report errors.
        See Also:
        Constant Field Values
      • DOCUMENT_FRAGMENT

        protected static final java.lang.String DOCUMENT_FRAGMENT
        Document fragment balancing only.
        See Also:
        Constant Field Values
      • IGNORE_OUTSIDE_CONTENT

        protected static final java.lang.String IGNORE_OUTSIDE_CONTENT
        Ignore outside content.
        See Also:
        Constant Field Values
      • RECOGNIZED_FEATURES

        private static final java.lang.String[] RECOGNIZED_FEATURES
        Recognized features.
      • RECOGNIZED_FEATURES_DEFAULTS

        private static final java.lang.Boolean[] RECOGNIZED_FEATURES_DEFAULTS
        Recognized features defaults.
      • NAMES_ELEMS

        protected static final java.lang.String NAMES_ELEMS
        Modify HTML element names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • NAMES_ATTRS

        protected static final java.lang.String NAMES_ATTRS
        Modify HTML attribute names: { "upper", "lower", "default" }.
        See Also:
        Constant Field Values
      • ERROR_REPORTER

        protected static final java.lang.String ERROR_REPORTER
        Error reporter.
        See Also:
        Constant Field Values
      • FRAGMENT_CONTEXT_STACK

        public static final java.lang.String FRAGMENT_CONTEXT_STACK
        <font color="red">EXPERIMENTAL: may change in next release</font><br/> Name of the property holding the stack of elements in which context a document fragment should be parsed.
        See Also:
        Constant Field Values
      • RECOGNIZED_PROPERTIES

        private static final java.lang.String[] RECOGNIZED_PROPERTIES
        Recognized properties.
      • RECOGNIZED_PROPERTIES_DEFAULTS

        private static final java.lang.Object[] RECOGNIZED_PROPERTIES_DEFAULTS
        Recognized properties defaults.
      • NAMES_NO_CHANGE

        private static final short NAMES_NO_CHANGE
        Don't modify HTML names.
        See Also:
        Constant Field Values
      • NAMES_UPPERCASE

        private static final short NAMES_UPPERCASE
        Uppercase HTML names.
        See Also:
        Constant Field Values
      • NAMES_LOWERCASE

        private static final short NAMES_LOWERCASE
        Lowercase HTML names.
        See Also:
        Constant Field Values
      • SYNTHESIZED_ITEM

        private static final HTMLEventInfo SYNTHESIZED_ITEM
        Synthesized event info item.
      • fNamespaces

        protected boolean fNamespaces
        Namespaces.
      • fAugmentations

        protected boolean fAugmentations
        Include infoset augmentations.
      • fReportErrors

        protected boolean fReportErrors
        Report errors.
      • fDocumentFragment

        protected boolean fDocumentFragment
        Document fragment balancing only.
      • fTemplateFragment

        protected boolean fTemplateFragment
        Template document fragment balancing only.
      • fIgnoreOutsideContent

        protected boolean fIgnoreOutsideContent
        Ignore outside content.
      • fAllowSelfclosingIframe

        protected boolean fAllowSelfclosingIframe
        Allows self closing iframe tags.
      • fAllowSelfclosingTags

        protected boolean fAllowSelfclosingTags
        Allows self closing tags.
      • fNamesElems

        protected short fNamesElems
        Modify HTML element names.
      • fSeenAnything

        protected boolean fSeenAnything
        True if seen anything. Important for xml declaration.
      • fSeenDoctype

        protected boolean fSeenDoctype
        True if root element has been seen.
      • fSeenRootElement

        protected boolean fSeenRootElement
        True if root element has been seen.
      • fSeenRootElementEnd

        protected boolean fSeenRootElementEnd
        True if seen the end of the document element. In other words, this variable is set to false until the end </HTML> tag is seen (or synthesized). This is used to ensure that extraneous events after the end of the document element do not make the document stream ill-formed.
      • fSeenHeadElement

        protected boolean fSeenHeadElement
        True if seen head element.
      • fSeenBodyElement

        protected boolean fSeenBodyElement
        True if seen body element.
      • fSeenBodyElementEnd

        private boolean fSeenBodyElementEnd
      • fSeenFramesetElement

        private boolean fSeenFramesetElement
        True if seen frameset element.
      • fSeenCharacters

        private boolean fSeenCharacters
        True if seen non whitespace characters.
      • fOpenedForm

        protected boolean fOpenedForm
        True if a form is in the stack (allow to discard opening of nested forms)
      • fOpenedSvg

        protected boolean fOpenedSvg
        True if a svg is in the stack (no parent checking takes place)
      • fOpenedSelect

        protected boolean fOpenedSelect
        True if a select is in the stack
      • fQName

        private final QName fQName
        A qualified name.
      • lostText_

        private final LostText lostText_
      • forcedStartElement_

        private boolean forcedStartElement_
      • forcedEndElement_

        private boolean forcedEndElement_
      • fragmentContextStack_

        private QName[] fragmentContextStack_
        Stack of elements determining the context in which a document fragment should be parsed
      • fragmentContextStackSize_

        private int fragmentContextStackSize_
      • discardedStartElements

        private final java.util.List<java.lang.String> discardedStartElements
    • Method Detail

      • getFeatureDefault

        public java.lang.Boolean getFeatureDefault​(java.lang.String featureId)
        Returns the default state for a feature.
        Specified by:
        getFeatureDefault in interface HTMLComponent
        Specified by:
        getFeatureDefault in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        Returns:
        the default state for a feature, or null if this component does not want to report a default value for this feature.
      • getPropertyDefault

        public java.lang.Object getPropertyDefault​(java.lang.String propertyId)
        Returns the default state for a property.
        Specified by:
        getPropertyDefault in interface HTMLComponent
        Specified by:
        getPropertyDefault in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        Returns:
        the default state for a property, or null if this component does not want to report a default value for this property
      • getRecognizedFeatures

        public java.lang.String[] getRecognizedFeatures()
        Returns recognized features.
        Specified by:
        getRecognizedFeatures in interface XMLComponent
        Returns:
        an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
      • getRecognizedProperties

        public java.lang.String[] getRecognizedProperties()
        Returns recognized properties.
        Specified by:
        getRecognizedProperties in interface XMLComponent
        Returns:
        an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
      • setFeature

        public void setFeature​(java.lang.String featureId,
                               boolean state)
                        throws XMLConfigurationException
        Sets a feature.
        Specified by:
        setFeature in interface XMLComponent
        Parameters:
        featureId - The feature identifier.
        state - The state of the feature.
        Throws:
        XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
      • setProperty

        public void setProperty​(java.lang.String propertyId,
                                java.lang.Object value)
                         throws XMLConfigurationException
        Sets a property.
        Specified by:
        setProperty in interface XMLComponent
        Parameters:
        propertyId - The property identifier.
        value - The value of the property.
        Throws:
        XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
      • startDocument

        public void startDocument​(XMLLocator locator,
                                  java.lang.String encoding,
                                  NamespaceContext nscontext,
                                  Augmentations augs)
                           throws XNIException
        Start document.
        Specified by:
        startDocument in interface XMLDocumentHandler
        Parameters:
        locator - The document locator, or null if the document location cannot be reported during the parsing of this document. However, it is strongly recommended that a locator be supplied that can at least report the system identifier of the document.
        encoding - The auto-detected IANA encoding name of the entity stream. This value will be null in those situations where the entity encoding is not auto-detected (e.g. internal entities or a document entity that is parsed from a java.io.Reader).
        nscontext - The namespace context in effect at the start of this document. This object represents the current context. Implementors of this class are responsible for copying the namespace bindings from the the current context (and its parent contexts) if that information is important.
        augs - Additional information that may include infoset augmentations
        Throws:
        XNIException - Thrown by handler to signal an error.
      • xmlDecl

        public void xmlDecl​(java.lang.String version,
                            java.lang.String encoding,
                            java.lang.String standalone,
                            Augmentations augs)
                     throws XNIException
        XML declaration.
        Specified by:
        xmlDecl in interface XMLDocumentHandler
        Parameters:
        version - The XML version.
        encoding - The IANA encoding name of the document, or null if not specified.
        standalone - The standalone value, or null if not specified.
        augs - Additional information that may include infoset augmentations
        Throws:
        XNIException - Thrown by handler to signal an error.
      • doctypeDecl

        public void doctypeDecl​(java.lang.String rootElementName,
                                java.lang.String publicId,
                                java.lang.String systemId,
                                Augmentations augs)
                         throws XNIException
        Doctype declaration.
        Specified by:
        doctypeDecl in interface XMLDocumentHandler
        Parameters:
        rootElementName - The name of the root element.
        publicId - The public identifier if an external DTD or null if the external DTD is specified using SYSTEM.
        systemId - The system identifier if an external DTD, null otherwise.
        augs - Additional information that may include infoset augmentations
        Throws:
        XNIException - Thrown by handler to signal an error.
      • consumeBufferedEndElements

        private void consumeBufferedEndElements()
        Consume elements that have been buffered, like that are first consumed at the end of document
      • consumeEarlyTextIfNeeded

        private void consumeEarlyTextIfNeeded()
      • processingInstruction

        public void processingInstruction​(java.lang.String target,
                                          XMLString data,
                                          Augmentations augs)
                                   throws XNIException
        Processing instruction.
        Specified by:
        processingInstruction in interface XMLDocumentHandler
        Parameters:
        target - The target.
        data - The data or null if none specified.
        augs - Additional information that may include infoset augmentations
        Throws:
        XNIException - Thrown by handler to signal an error.
      • forceStartElement

        private boolean forceStartElement​(QName elem,
                                          XMLAttributes attrs,
                                          Augmentations augs)
                                   throws XNIException
        Forces an element start, taking care to set the information to allow startElement to "see" that's the element has been forced.
        Returns:
        true if creation could be done (TABLE's creation for instance can't be forced)
        Throws:
        XNIException
      • createQName

        private QName createQName​(java.lang.String tagName)
      • forceStartBody

        private void forceStartBody()
        Generates a missing (which creates missing when needed)
      • addBodyIfNeeded

        private void addBodyIfNeeded​(short element)
      • getElementDepth

        protected final int getElementDepth​(HTMLElements.Element element)
        Parameters:
        element - The element.
        Returns:
        the depth of the open tag associated with the specified element name or -1 if no matching element is found.
      • getParentDepth

        protected int getParentDepth​(HTMLElements.Element[] parents,
                                     short bounds)
        Parameters:
        parents - The parent elements.
        bounds - bounds
        Returns:
        the depth of the open tag associated with the specified element parent names or -1 if no matching element is found.
      • synthesizedAugs

        protected final Augmentations synthesizedAugs()
      • modifyName

        protected static java.lang.String modifyName​(java.lang.String name,
                                                     short mode)
      • getNamesValue

        protected static short getNamesValue​(java.lang.String value)
      • notifyDiscardedStartElement

        private void notifyDiscardedStartElement​(QName elem,
                                                 XMLAttributes attrs,
                                                 Augmentations augs)
        Notifies the tagBalancingListener (if any) of an ignored start element
      • notifyDiscardedEndElement

        private void notifyDiscardedEndElement​(QName element,
                                               Augmentations augs)
        Notifies the tagBalancingListener (if any) of an ignored end element