Class HTMLTagBalancer

java.lang.Object
org.htmlunit.cyberneko.HTMLTagBalancer
All Implemented Interfaces:
HTMLComponent, XMLComponent, XMLDocumentFilter, XMLDocumentSource, XMLDocumentHandler

public class HTMLTagBalancer extends Object implements XMLDocumentFilter, HTMLComponent
Balances tags in an HTML document. This component receives document events and tries to correct many common mistakes that human (and computer) HTML document authors make. This tag balancer can:
  • add missing parent elements;
  • automatically close elements with optional end tags; and
  • handle mis-matched inline element tags.

This component recognizes the following features:

  • http://cyberneko.org/html/features/augmentations
  • http://cyberneko.org/html/features/report-errors
  • http://cyberneko.org/html/features/balance-tags/document-fragment
  • http://cyberneko.org/html/features/balance-tags/ignore-outside-content

This component recognizes the following properties:

  • http://cyberneko.org/html/properties/names/elems
  • http://cyberneko.org/html/properties/names/attrs
  • http://cyberneko.org/html/properties/error-reporter
  • http://cyberneko.org/html/properties/balance-tags/current-stack
See Also:
  • Field Details

    • NAMESPACES

      protected static final String NAMESPACES
      Namespaces.
      See Also:
    • AUGMENTATIONS

      protected static final String AUGMENTATIONS
      Include infoset augmentations.
      See Also:
    • REPORT_ERRORS

      protected static final String REPORT_ERRORS
      Report errors.
      See Also:
    • DOCUMENT_FRAGMENT

      protected static final String DOCUMENT_FRAGMENT
      Document fragment balancing only.
      See Also:
    • IGNORE_OUTSIDE_CONTENT

      protected static final String IGNORE_OUTSIDE_CONTENT
      Ignore outside content.
      See Also:
    • RECOGNIZED_FEATURES

      private static final String[] RECOGNIZED_FEATURES
      Recognized features.
    • RECOGNIZED_FEATURES_DEFAULTS

      private static final Boolean[] RECOGNIZED_FEATURES_DEFAULTS
      Recognized features defaults.
    • NAMES_ELEMS

      protected static final String NAMES_ELEMS
      Modify HTML element names: { "upper", "lower", "default" }.
      See Also:
    • NAMES_ATTRS

      protected static final String NAMES_ATTRS
      Modify HTML attribute names: { "upper", "lower", "default" }.
      See Also:
    • ERROR_REPORTER

      protected static final String ERROR_REPORTER
      Error reporter.
      See Also:
    • FRAGMENT_CONTEXT_STACK

      public static final String FRAGMENT_CONTEXT_STACK
      <font color="red">EXPERIMENTAL: may change in next release</font><br/> Name of the property holding the stack of elements in which context a document fragment should be parsed.
      See Also:
    • RECOGNIZED_PROPERTIES

      private static final String[] RECOGNIZED_PROPERTIES
      Recognized properties.
    • RECOGNIZED_PROPERTIES_DEFAULTS

      private static final Object[] RECOGNIZED_PROPERTIES_DEFAULTS
      Recognized properties defaults.
    • NAMES_NO_CHANGE

      private static final short NAMES_NO_CHANGE
      Don't modify HTML names.
      See Also:
    • NAMES_UPPERCASE

      private static final short NAMES_UPPERCASE
      Uppercase HTML names.
      See Also:
    • NAMES_LOWERCASE

      private static final short NAMES_LOWERCASE
      Lowercase HTML names.
      See Also:
    • SYNTHESIZED_ITEM

      private static final HTMLEventInfo SYNTHESIZED_ITEM
      Synthesized event info item.
    • fNamespaces

      protected boolean fNamespaces
      Namespaces.
    • fAugmentations

      protected boolean fAugmentations
      Include infoset augmentations.
    • fReportErrors

      protected boolean fReportErrors
      Report errors.
    • fDocumentFragment

      protected boolean fDocumentFragment
      Document fragment balancing only.
    • fTemplateFragment

      protected boolean fTemplateFragment
      Template document fragment balancing only.
    • fIgnoreOutsideContent

      protected boolean fIgnoreOutsideContent
      Ignore outside content.
    • fAllowSelfclosingIframe

      protected boolean fAllowSelfclosingIframe
      Allows self closing iframe tags.
    • fAllowSelfclosingTags

      protected boolean fAllowSelfclosingTags
      Allows self closing tags.
    • fNamesElems

      protected short fNamesElems
      Modify HTML element names.
    • fErrorReporter

      protected HTMLErrorReporter fErrorReporter
      Error reporter.
    • documentSource_

      private XMLDocumentSource documentSource_
    • documentHandler_

      private XMLDocumentHandler documentHandler_
      The document handler.
    • fElementStack

      protected final HTMLTagBalancer.InfoStack fElementStack
      The element stack.
    • fInlineStack

      protected final HTMLTagBalancer.InfoStack fInlineStack
      The inline stack.
    • fSeenAnything

      protected boolean fSeenAnything
      True if seen anything. Important for xml declaration.
    • fSeenDoctype

      protected boolean fSeenDoctype
      True if root element has been seen.
    • fSeenRootElement

      protected boolean fSeenRootElement
      True if root element has been seen.
    • fSeenRootElementEnd

      protected boolean fSeenRootElementEnd
      True if seen the end of the document element. In other words, this variable is set to false until the end </HTML> tag is seen (or synthesized). This is used to ensure that extraneous events after the end of the document element do not make the document stream ill-formed.
    • fSeenHeadElement

      protected boolean fSeenHeadElement
      True if seen head element.
    • fSeenBodyElement

      protected boolean fSeenBodyElement
      True if seen body element.
    • fSeenBodyElementEnd

      private boolean fSeenBodyElementEnd
    • fSeenFramesetElement

      private boolean fSeenFramesetElement
      True if seen frameset element.
    • fSeenCharacters

      private boolean fSeenCharacters
      True if seen non whitespace characters.
    • fOpenedForm

      protected boolean fOpenedForm
      True if a form is in the stack (allow to discard opening of nested forms)
    • fOpenedSvg

      protected boolean fOpenedSvg
      True if a svg is in the stack (no parent checking takes place)
    • fOpenedSelect

      protected boolean fOpenedSelect
      True if a select is in the stack
    • fQName

      private final QName fQName
      A qualified name.
    • tagBalancingListener

      protected HTMLTagBalancingListener tagBalancingListener
    • lostText_

      private final LostText lostText_
    • forcedStartElement_

      private boolean forcedStartElement_
    • forcedEndElement_

      private boolean forcedEndElement_
    • fragmentContextStack_

      private QName[] fragmentContextStack_
      Stack of elements determining the context in which a document fragment should be parsed
    • fragmentContextStackSize_

      private int fragmentContextStackSize_
    • endElementsBuffer_

      private final List<HTMLTagBalancer.ElementEntry> endElementsBuffer_
    • discardedStartElements

      private final List<String> discardedStartElements
    • htmlConfiguration_

      private final HTMLConfiguration htmlConfiguration_
  • Constructor Details

  • Method Details

    • getFeatureDefault

      public Boolean getFeatureDefault(String featureId)
      Returns the default state for a feature.
      Specified by:
      getFeatureDefault in interface HTMLComponent
      Specified by:
      getFeatureDefault in interface XMLComponent
      Parameters:
      featureId - The feature identifier.
      Returns:
      the default state for a feature, or null if this component does not want to report a default value for this feature.
    • getPropertyDefault

      public Object getPropertyDefault(String propertyId)
      Returns the default state for a property.
      Specified by:
      getPropertyDefault in interface HTMLComponent
      Specified by:
      getPropertyDefault in interface XMLComponent
      Parameters:
      propertyId - The property identifier.
      Returns:
      the default state for a property, or null if this component does not want to report a default value for this property
    • getRecognizedFeatures

      public String[] getRecognizedFeatures()
      Returns recognized features.
      Specified by:
      getRecognizedFeatures in interface XMLComponent
      Returns:
      an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
    • getRecognizedProperties

      public String[] getRecognizedProperties()
      Returns recognized properties.
      Specified by:
      getRecognizedProperties in interface XMLComponent
      Returns:
      an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
    • reset

      public void reset(XMLComponentManager manager) throws XMLConfigurationException
      Resets the component.
      Specified by:
      reset in interface XMLComponent
      Parameters:
      manager - The component manager.
      Throws:
      XMLConfigurationException
    • setFeature

      public void setFeature(String featureId, boolean state) throws XMLConfigurationException
      Sets a feature.
      Specified by:
      setFeature in interface XMLComponent
      Parameters:
      featureId - The feature identifier.
      state - The state of the feature.
      Throws:
      XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
    • setProperty

      public void setProperty(String propertyId, Object value) throws XMLConfigurationException
      Sets a property.
      Specified by:
      setProperty in interface XMLComponent
      Parameters:
      propertyId - The property identifier.
      value - The value of the property.
      Throws:
      XMLConfigurationException - Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
    • setDocumentHandler

      public void setDocumentHandler(XMLDocumentHandler handler)
      Sets the document handler.
      Specified by:
      setDocumentHandler in interface XMLDocumentSource
      Parameters:
      handler - the new handler
    • getDocumentHandler

      public XMLDocumentHandler getDocumentHandler()
      Returns the document handler.
      Specified by:
      getDocumentHandler in interface XMLDocumentSource
      Returns:
      the document handler
    • setDocumentSource

      public void setDocumentSource(XMLDocumentSource source)
      Sets the document source.
      Specified by:
      setDocumentSource in interface XMLDocumentHandler
      Parameters:
      source - the new source
    • getDocumentSource

      public XMLDocumentSource getDocumentSource()
      Specified by:
      getDocumentSource in interface XMLDocumentHandler
      Returns:
      the document source.
    • startDocument

      public void startDocument(XMLLocator locator, String encoding, NamespaceContext nscontext, Augmentations augs) throws XNIException
      Start document.
      Specified by:
      startDocument in interface XMLDocumentHandler
      Parameters:
      locator - The document locator, or null if the document location cannot be reported during the parsing of this document. However, it is strongly recommended that a locator be supplied that can at least report the system identifier of the document.
      encoding - The auto-detected IANA encoding name of the entity stream. This value will be null in those situations where the entity encoding is not auto-detected (e.g. internal entities or a document entity that is parsed from a java.io.Reader).
      nscontext - The namespace context in effect at the start of this document. This object represents the current context. Implementors of this class are responsible for copying the namespace bindings from the the current context (and its parent contexts) if that information is important.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • xmlDecl

      public void xmlDecl(String version, String encoding, String standalone, Augmentations augs) throws XNIException
      XML declaration.
      Specified by:
      xmlDecl in interface XMLDocumentHandler
      Parameters:
      version - The XML version.
      encoding - The IANA encoding name of the document, or null if not specified.
      standalone - The standalone value, or null if not specified.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • doctypeDecl

      public void doctypeDecl(String rootElementName, String publicId, String systemId, Augmentations augs) throws XNIException
      Doctype declaration.
      Specified by:
      doctypeDecl in interface XMLDocumentHandler
      Parameters:
      rootElementName - The name of the root element.
      publicId - The public identifier if an external DTD or null if the external DTD is specified using SYSTEM.
      systemId - The system identifier if an external DTD, null otherwise.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • endDocument

      public void endDocument(Augmentations augs) throws XNIException
      End document.
      Specified by:
      endDocument in interface XMLDocumentHandler
      Parameters:
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • consumeBufferedEndElements

      private void consumeBufferedEndElements()
      Consume elements that have been buffered, like that are first consumed at the end of document
    • comment

      public void comment(XMLString text, Augmentations augs) throws XNIException
      Comment.
      Specified by:
      comment in interface XMLDocumentHandler
      Parameters:
      text - The text in the comment.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by application to signal an error.
    • consumeEarlyTextIfNeeded

      private void consumeEarlyTextIfNeeded()
    • processingInstruction

      public void processingInstruction(String target, XMLString data, Augmentations augs) throws XNIException
      Processing instruction.
      Specified by:
      processingInstruction in interface XMLDocumentHandler
      Parameters:
      target - The target.
      data - The data or null if none specified.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • startElement

      public void startElement(QName elem, XMLAttributes attrs, Augmentations augs) throws XNIException
      Start element.
      Specified by:
      startElement in interface XMLDocumentHandler
      Parameters:
      elem - The name of the element.
      attrs - The element attributes.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • forceStartElement

      private boolean forceStartElement(QName elem, XMLAttributes attrs, Augmentations augs) throws XNIException
      Forces an element start, taking care to set the information to allow startElement to "see" that's the element has been forced.
      Returns:
      true if creation could be done (TABLE's creation for instance can't be forced)
      Throws:
      XNIException
    • createQName

      private QName createQName(String tagName)
    • emptyElement

      public void emptyElement(QName element, XMLAttributes attrs, Augmentations augs) throws XNIException
      Empty element.
      Specified by:
      emptyElement in interface XMLDocumentHandler
      Parameters:
      element - The name of the element.
      attrs - The element attributes.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • forceStartBody

      private void forceStartBody()
      Generates a missing (which creates missing when needed)
    • startCDATA

      public void startCDATA(Augmentations augs) throws XNIException
      Start CDATA section.
      Specified by:
      startCDATA in interface XMLDocumentHandler
      Parameters:
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • endCDATA

      public void endCDATA(Augmentations augs) throws XNIException
      End CDATA section.
      Specified by:
      endCDATA in interface XMLDocumentHandler
      Parameters:
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • characters

      public void characters(XMLString text, Augmentations augs) throws XNIException
      Characters.
      Specified by:
      characters in interface XMLDocumentHandler
      Parameters:
      text - The content.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • endElement

      public void endElement(QName element, Augmentations augs) throws XNIException
      End element.
      Specified by:
      endElement in interface XMLDocumentHandler
      Parameters:
      element - The name of the element.
      augs - Additional information that may include infoset augmentations
      Throws:
      XNIException - Thrown by handler to signal an error.
    • getElement

      protected HTMLElements.Element getElement(QName elementName)
    • callStartElement

      protected final void callStartElement(QName element, XMLAttributes attrs, Augmentations augs) throws XNIException
      Throws:
      XNIException
    • addBodyIfNeeded

      private void addBodyIfNeeded(short element)
    • callEndElement

      protected final void callEndElement(QName element, Augmentations augs) throws XNIException
      Throws:
      XNIException
    • getElementDepth

      protected final int getElementDepth(HTMLElements.Element element)
      Parameters:
      element - The element.
      Returns:
      the depth of the open tag associated with the specified element name or -1 if no matching element is found.
    • getParentDepth

      protected int getParentDepth(HTMLElements.Element[] parents, short bounds)
      Parameters:
      parents - The parent elements.
      bounds - bounds
      Returns:
      the depth of the open tag associated with the specified element parent names or -1 if no matching element is found.
    • synthesizedAugs

      protected final Augmentations synthesizedAugs()
    • modifyName

      protected static String modifyName(String name, short mode)
    • getNamesValue

      protected static short getNamesValue(String value)
    • setTagBalancingListener

      void setTagBalancingListener(HTMLTagBalancingListener tagBalancingListener)
    • notifyDiscardedStartElement

      private void notifyDiscardedStartElement(QName elem, XMLAttributes attrs, Augmentations augs)
      Notifies the tagBalancingListener (if any) of an ignored start element
    • notifyDiscardedEndElement

      private void notifyDiscardedEndElement(QName element, Augmentations augs)
      Notifies the tagBalancingListener (if any) of an ignored end element