Class HtmlCleaner

java.lang.Object
org.htmlcleaner.HtmlCleaner

public class HtmlCleaner extends Object
Main HtmlCleaner class.

It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.

Typical usage is the following:

// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behavior with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
  • Field Details

    • MARKER_ATTRIBUTE

      private static final String MARKER_ATTRIBUTE
      Marker attribute added to aid with part of the cleaning process. TODO: a non-intrusive way of doing this that does not involve modifying the source html
      See Also:
    • HTML_4

      public static int HTML_4
    • HTML_5

      public static int HTML_5
    • properties

      private CleanerProperties properties
    • transformations

      private CleanerTransformations transformations
  • Constructor Details

    • HtmlCleaner

      public HtmlCleaner()
      Constructor - creates cleaner instance with default tag info provider,default version and default properties.
    • HtmlCleaner

      public HtmlCleaner(ITagInfoProvider tagInfoProvider)
      Constructor - creates the instance with specified tag info provider and default properties
      Parameters:
      tagInfoProvider - Provider for tag filtering and balancing
    • HtmlCleaner

      public HtmlCleaner(CleanerProperties properties)
      Constructor - creates the instance with default tag info provider and specified properties
      Parameters:
      properties - Properties used during parsing and serializing
    • HtmlCleaner

      public HtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
      Constructor - creates the instance with specified tag info provider and specified properties
      Parameters:
      tagInfoProvider - Provider for tag filtering and balancing
      properties - Properties used during parsing and serializing
  • Method Details

    • clean

      public TagNode clean(String htmlContent)
    • clean

      public TagNode clean(File file, String charset) throws IOException
      Throws:
      IOException
    • clean

      public TagNode clean(File file) throws IOException
      Throws:
      IOException
    • clean

      @Deprecated public TagNode clean(URL url, String charset) throws IOException
      Deprecated.
      Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.
      Parameters:
      url -
      charset -
      Returns:
      Throws:
      IOException
    • clean

      @Deprecated public TagNode clean(URL url) throws IOException
      Deprecated.
      Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.
      Parameters:
      url - the url to retrieve content from
      Returns:
      the cleaned content
      Throws:
      IOException
    • clean

      public TagNode clean(InputStream in, String charset) throws IOException
      Throws:
      IOException
    • clean

      public TagNode clean(InputStream in) throws IOException
      Throws:
      IOException
    • clean

      public TagNode clean(Reader reader) throws IOException
      Throws:
      IOException
    • clean

      protected TagNode clean(Reader reader, CleanTimeValues cleanTimeValues) throws IOException
      Basic version of the cleaning call.
      Parameters:
      reader - (not closed)
      Returns:
      An instance of TagNode object which is the root of the XML tree.
      Throws:
      IOException
    • markNodesToPrune

      private boolean markNodesToPrune(List nodeList, CleanTimeValues cleanTimeValues, int depth)
    • calculateRootNode

      private void calculateRootNode(CleanTimeValues cleanTimeValues, Set<String> namespacePrefixes)
      Assigns root node to internal variable and adds neccessery xmlns attributes if cleaner is namespace-aware. Root node of the result depends on parameter "omitHtmlEnvelope". If it is set, then first child of the body will be root node, or html will be root node otherwise.
      Parameters:
      namespacePrefixes -
    • addAttributesToTag

      private void addAttributesToTag(TagNode tag, Map<String,String> attributes)
      Add attributes from specified map to the specified tag. If some attribute already exist it is preserved.
      Parameters:
      tag -
      attributes -
    • isFatalTagSatisfied

      private boolean isFatalTagSatisfied(TagInfo tag, CleanTimeValues cleanTimeValues)
      Checks if open fatal tag is missing if there is a fatal tag for the specified tag.
      Parameters:
      tag -
    • mustAddRequiredParent

      private boolean mustAddRequiredParent(TagInfo tag, CleanTimeValues cleanTimeValues)
      Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.
      Parameters:
      tag -
    • newTagNode

      private TagNode newTagNode(String tagName)
    • createTagNode

      private TagNode createTagNode(TagNode startTagToken)
    • isAllowedInLastOpenTag

      private boolean isAllowedInLastOpenTag(BaseToken token, CleanTimeValues cleanTimeValues)
    • saveToLastOpenTag

      private void saveToLastOpenTag(List nodeList, Object tokenToAdd, CleanTimeValues cleanTimeValues)
    • isStartToken

      private boolean isStartToken(Object o)
    • isAllowedAsForeignMarkup

      private boolean isAllowedAsForeignMarkup(String tagname, CleanTimeValues cleanTimeValues)
      Checks whether we can allow a tag as "foreign markup". This means we must have namespace aware set to true, and we must either have a current xmlns declaration within scope that isn't for HTML, or we have a namespace prefix on the tag
      Parameters:
      cleanTimeValues -
      Returns:
    • handleEndTagToken

      private void handleEndTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues)
      Process rules for a new end tag token in the HTML tree.
      Parameters:
      token -
      nodeIterator -
      nodeList -
      cleanTimeValues -
    • handleStartTagToken

      private void handleStartTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues)
      Processes all the rules associated with a new opening tag in the HTML tree
      Parameters:
      token -
      nodeIterator -
      nodeList -
      cleanTimeValues -
    • makeTree

      void makeTree(List nodeList, ListIterator<BaseToken> nodeIterator, CleanTimeValues cleanTimeValues)
      This method generally mutates flattened list of tokens into tree structure.
      Parameters:
      nodeList -
      nodeIterator -
    • isCopiedTokenEqualToNextThreeCopiedTokens

      private static boolean isCopiedTokenEqualToNextThreeCopiedTokens(TagNode copiedStartToken, ListIterator<BaseToken> nodeIterator)
      Determines if a copied token is equal to the next 3 tokens in the iterator.
    • flattenNestedList

      private List<TagNode> flattenNestedList(List list)
      Flattens a list of tagnodes
    • areCopiedTokensEqual

      private static boolean areCopiedTokensEqual(TagNode token1, TagNode token2)
      Determines if two copied tokens are equal.
    • reopenBrokenNode

      private void reopenBrokenNode(ListIterator<BaseToken> nodeIterator, TagNode toReopen, CleanTimeValues cleanTimeValues)
    • isRemovingNodeReasonablySafe

      protected boolean isRemovingNodeReasonablySafe(TagNode startTagToken)
      Parameters:
      startTagToken -
      Returns:
      true if no id attribute or class attribute
    • createDocumentNodes

      private void createDocumentNodes(List listNodes, CleanTimeValues cleanTimeValues)
    • closeSnippet

      private List<TagNode> closeSnippet(List nodeList, TagPos tagPos, Object toNode, CleanTimeValues cleanTimeValues)
      Forced closing
      Parameters:
      nodeList -
      tagPos -
      toNode -
      Returns:
    • closeAll

      private void closeAll(List nodeList, CleanTimeValues cleanTimeValues)
      Close all unclosed tags if there are any.
    • addPossibleHeadCandidate

      private void addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, CleanTimeValues cleanTimeValues)
      Checks if specified tag with specified info is candidate for moving to head section.
      Parameters:
      tagInfo -
      tagNode -
    • getProperties

      public CleanerProperties getProperties()
    • getPruneTagSet

      protected Set<ITagNodeCondition> getPruneTagSet(CleanTimeValues cleanTimeValues)
    • getAllowTagSet

      protected Set<ITagNodeCondition> getAllowTagSet(CleanTimeValues cleanTimeValues)
    • addPruneNode

      protected void addPruneNode(TagNode node, CleanTimeValues cleanTimeValues)
    • getTagInfo

      public TagInfo getTagInfo(String tagName, CleanTimeValues cleanTimeValues)
      Returns a TagInfo object for the specified tag name. If the tag is foreign markup, we leave it as null. This is because we may get name clashes, e.g. svg:title. However, we do handle the tag if its embedded content within the correct NS (e.g. SVG, MathML)
      Parameters:
      tagName -
      cleanTimeValues -
      Returns:
      a TagInfo object, or null if no matching TagInfo is found
    • addIfNeededToPruneSet

      private boolean addIfNeededToPruneSet(TagNode tagNode, CleanTimeValues cleanTimeValues)
    • getAllTags

      protected Set<String> getAllTags(CleanTimeValues cleanTimeValues)
    • getTagInfoProvider

      public ITagInfoProvider getTagInfoProvider()
      Returns:
      ITagInfoProvider instance for this HtmlCleaner
    • getTransformations

      public CleanerTransformations getTransformations()
      Returns:
      Transformations defined for this instance of cleaner
    • getInnerHtml

      public String getInnerHtml(TagNode node)
      For the specified node, returns it's content as string.
      Parameters:
      node -
      Returns:
      node's content as string
    • setInnerHtml

      public void setInnerHtml(TagNode node, String content)
      For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.
      Parameters:
      node -
      content -
    • initCleanerTransformations

      public void initCleanerTransformations(Map transInfos)
      Parameters:
      transInfos -
    • getOpenTags

      private OpenTags getOpenTags(CleanTimeValues cleanTimeValues)
    • getChildBreaks

      private ChildBreaks getChildBreaks(CleanTimeValues cleanTimeValues)
    • pushNesting

      private NestingState pushNesting(CleanTimeValues cleanTimeValues)
    • popNesting

      private NestingState popNesting(CleanTimeValues cleanTimeValues)
    • handleInterruption

      protected void handleInterruption()
      Called whenever the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction