Package org.htmlcleaner
Class HtmlCleaner
- java.lang.Object
-
- org.htmlcleaner.HtmlCleaner
-
public class HtmlCleaner extends java.lang.Object
Main HtmlCleaner class.It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.
Typical usage is the following:
// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behavior with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
-
-
Field Summary
Fields Modifier and Type Field Description static int
HTML_4
static int
HTML_5
private static java.lang.String
MARKER_ATTRIBUTE
Marker attribute added to aid with part of the cleaning process.private CleanerProperties
properties
private CleanerTransformations
transformations
-
Constructor Summary
Constructors Constructor Description HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider,default version and default properties.HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description private void
addAttributesToTag(TagNode tag, java.util.Map<java.lang.String,java.lang.String> attributes)
Add attributes from specified map to the specified tag.private boolean
addIfNeededToPruneSet(TagNode tagNode, CleanTimeValues cleanTimeValues)
private void
addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, CleanTimeValues cleanTimeValues)
Checks if specified tag with specified info is candidate for moving to head section.protected void
addPruneNode(TagNode node, CleanTimeValues cleanTimeValues)
private static boolean
areCopiedTokensEqual(TagNode token1, TagNode token2)
Determines if two copied tokens are equal.private void
calculateRootNode(CleanTimeValues cleanTimeValues, java.util.Set<java.lang.String> namespacePrefixes)
Assigns root node to internal variable and adds neccessery xmlns attributes if cleaner is namespace-aware.TagNode
clean(java.io.File file)
TagNode
clean(java.io.File file, java.lang.String charset)
TagNode
clean(java.io.InputStream in)
TagNode
clean(java.io.InputStream in, java.lang.String charset)
TagNode
clean(java.io.Reader reader)
protected TagNode
clean(java.io.Reader reader, CleanTimeValues cleanTimeValues)
Basic version of the cleaning call.TagNode
clean(java.lang.String htmlContent)
TagNode
clean(java.net.URL url)
Deprecated.TagNode
clean(java.net.URL url, java.lang.String charset)
Deprecated.private void
closeAll(java.util.List nodeList, CleanTimeValues cleanTimeValues)
Close all unclosed tags if there are any.private java.util.List<TagNode>
closeSnippet(java.util.List nodeList, TagPos tagPos, java.lang.Object toNode, CleanTimeValues cleanTimeValues)
Forced closingprivate void
createDocumentNodes(java.util.List listNodes, CleanTimeValues cleanTimeValues)
private TagNode
createTagNode(TagNode startTagToken)
private java.util.List<TagNode>
flattenNestedList(java.util.List list)
Flattens a list of tagnodesprotected java.util.Set<ITagNodeCondition>
getAllowTagSet(CleanTimeValues cleanTimeValues)
protected java.util.Set<java.lang.String>
getAllTags(CleanTimeValues cleanTimeValues)
private ChildBreaks
getChildBreaks(CleanTimeValues cleanTimeValues)
java.lang.String
getInnerHtml(TagNode node)
For the specified node, returns it's content as string.private OpenTags
getOpenTags(CleanTimeValues cleanTimeValues)
CleanerProperties
getProperties()
protected java.util.Set<ITagNodeCondition>
getPruneTagSet(CleanTimeValues cleanTimeValues)
TagInfo
getTagInfo(java.lang.String tagName, CleanTimeValues cleanTimeValues)
Returns a TagInfo object for the specified tag name.ITagInfoProvider
getTagInfoProvider()
CleanerTransformations
getTransformations()
private void
handleEndTagToken(BaseToken token, java.util.ListIterator<BaseToken> nodeIterator, java.util.List nodeList, CleanTimeValues cleanTimeValues)
Process rules for a new end tag token in the HTML tree.protected void
handleInterruption()
Called whenever the thread is interrupted.private void
handleStartTagToken(BaseToken token, java.util.ListIterator<BaseToken> nodeIterator, java.util.List nodeList, CleanTimeValues cleanTimeValues)
Processes all the rules associated with a new opening tag in the HTML treevoid
initCleanerTransformations(java.util.Map transInfos)
private boolean
isAllowedAsForeignMarkup(java.lang.String tagname, CleanTimeValues cleanTimeValues)
Checks whether we can allow a tag as "foreign markup".private boolean
isAllowedInLastOpenTag(BaseToken token, CleanTimeValues cleanTimeValues)
private static boolean
isCopiedTokenEqualToNextThreeCopiedTokens(TagNode copiedStartToken, java.util.ListIterator<BaseToken> nodeIterator)
Determines if a copied token is equal to the next 3 tokens in the iterator.private boolean
isFatalTagSatisfied(TagInfo tag, CleanTimeValues cleanTimeValues)
Checks if open fatal tag is missing if there is a fatal tag for the specified tag.protected boolean
isRemovingNodeReasonablySafe(TagNode startTagToken)
private boolean
isStartToken(java.lang.Object o)
(package private) void
makeTree(java.util.List nodeList, java.util.ListIterator<BaseToken> nodeIterator, CleanTimeValues cleanTimeValues)
This method generally mutates flattened list of tokens into tree structure.private boolean
markNodesToPrune(java.util.List nodeList, CleanTimeValues cleanTimeValues, int depth)
private boolean
mustAddRequiredParent(TagInfo tag, CleanTimeValues cleanTimeValues)
Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.private TagNode
newTagNode(java.lang.String tagName)
private NestingState
popNesting(CleanTimeValues cleanTimeValues)
private NestingState
pushNesting(CleanTimeValues cleanTimeValues)
private void
reopenBrokenNode(java.util.ListIterator<BaseToken> nodeIterator, TagNode toReopen, CleanTimeValues cleanTimeValues)
private void
saveToLastOpenTag(java.util.List nodeList, java.lang.Object tokenToAdd, CleanTimeValues cleanTimeValues)
void
setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content.
-
-
-
Field Detail
-
MARKER_ATTRIBUTE
private static final java.lang.String MARKER_ATTRIBUTE
Marker attribute added to aid with part of the cleaning process. TODO: a non-intrusive way of doing this that does not involve modifying the source html- See Also:
- Constant Field Values
-
HTML_4
public static int HTML_4
-
HTML_5
public static int HTML_5
-
properties
private CleanerProperties properties
-
transformations
private CleanerTransformations transformations
-
-
Constructor Detail
-
HtmlCleaner
public HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider,default version and default properties.
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancing
-
HtmlCleaner
public HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified properties- Parameters:
properties
- Properties used during parsing and serializing
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancingproperties
- Properties used during parsing and serializing
-
-
Method Detail
-
clean
public TagNode clean(java.lang.String htmlContent)
-
clean
public TagNode clean(java.io.File file, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.File file) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
@Deprecated public TagNode clean(java.net.URL url, java.lang.String charset) throws java.io.IOException
Deprecated.Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.- Parameters:
url
-charset
-- Returns:
- Throws:
java.io.IOException
-
clean
@Deprecated public TagNode clean(java.net.URL url) throws java.io.IOException
Deprecated.Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.- Parameters:
url
- the url to retrieve content from- Returns:
- the cleaned content
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.Reader reader) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
protected TagNode clean(java.io.Reader reader, CleanTimeValues cleanTimeValues) throws java.io.IOException
Basic version of the cleaning call.- Parameters:
reader
- (not closed)- Returns:
- An instance of TagNode object which is the root of the XML tree.
- Throws:
java.io.IOException
-
markNodesToPrune
private boolean markNodesToPrune(java.util.List nodeList, CleanTimeValues cleanTimeValues, int depth)
-
calculateRootNode
private void calculateRootNode(CleanTimeValues cleanTimeValues, java.util.Set<java.lang.String> namespacePrefixes)
Assigns root node to internal variable and adds neccessery xmlns attributes if cleaner is namespace-aware. Root node of the result depends on parameter "omitHtmlEnvelope". If it is set, then first child of the body will be root node, or html will be root node otherwise.- Parameters:
namespacePrefixes
-
-
addAttributesToTag
private void addAttributesToTag(TagNode tag, java.util.Map<java.lang.String,java.lang.String> attributes)
Add attributes from specified map to the specified tag. If some attribute already exist it is preserved.- Parameters:
tag
-attributes
-
-
isFatalTagSatisfied
private boolean isFatalTagSatisfied(TagInfo tag, CleanTimeValues cleanTimeValues)
Checks if open fatal tag is missing if there is a fatal tag for the specified tag.- Parameters:
tag
-
-
mustAddRequiredParent
private boolean mustAddRequiredParent(TagInfo tag, CleanTimeValues cleanTimeValues)
Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.- Parameters:
tag
-
-
newTagNode
private TagNode newTagNode(java.lang.String tagName)
-
isAllowedInLastOpenTag
private boolean isAllowedInLastOpenTag(BaseToken token, CleanTimeValues cleanTimeValues)
-
saveToLastOpenTag
private void saveToLastOpenTag(java.util.List nodeList, java.lang.Object tokenToAdd, CleanTimeValues cleanTimeValues)
-
isStartToken
private boolean isStartToken(java.lang.Object o)
-
isAllowedAsForeignMarkup
private boolean isAllowedAsForeignMarkup(java.lang.String tagname, CleanTimeValues cleanTimeValues)
Checks whether we can allow a tag as "foreign markup". This means we must have namespace aware set to true, and we must either have a current xmlns declaration within scope that isn't for HTML, or we have a namespace prefix on the tag- Parameters:
cleanTimeValues
-- Returns:
-
handleEndTagToken
private void handleEndTagToken(BaseToken token, java.util.ListIterator<BaseToken> nodeIterator, java.util.List nodeList, CleanTimeValues cleanTimeValues)
Process rules for a new end tag token in the HTML tree.- Parameters:
token
-nodeIterator
-nodeList
-cleanTimeValues
-
-
handleStartTagToken
private void handleStartTagToken(BaseToken token, java.util.ListIterator<BaseToken> nodeIterator, java.util.List nodeList, CleanTimeValues cleanTimeValues)
Processes all the rules associated with a new opening tag in the HTML tree- Parameters:
token
-nodeIterator
-nodeList
-cleanTimeValues
-
-
makeTree
void makeTree(java.util.List nodeList, java.util.ListIterator<BaseToken> nodeIterator, CleanTimeValues cleanTimeValues)
This method generally mutates flattened list of tokens into tree structure.- Parameters:
nodeList
-nodeIterator
-
-
isCopiedTokenEqualToNextThreeCopiedTokens
private static boolean isCopiedTokenEqualToNextThreeCopiedTokens(TagNode copiedStartToken, java.util.ListIterator<BaseToken> nodeIterator)
Determines if a copied token is equal to the next 3 tokens in the iterator.
-
flattenNestedList
private java.util.List<TagNode> flattenNestedList(java.util.List list)
Flattens a list of tagnodes
-
areCopiedTokensEqual
private static boolean areCopiedTokensEqual(TagNode token1, TagNode token2)
Determines if two copied tokens are equal.
-
reopenBrokenNode
private void reopenBrokenNode(java.util.ListIterator<BaseToken> nodeIterator, TagNode toReopen, CleanTimeValues cleanTimeValues)
-
isRemovingNodeReasonablySafe
protected boolean isRemovingNodeReasonablySafe(TagNode startTagToken)
- Parameters:
startTagToken
-- Returns:
- true if no id attribute or class attribute
-
createDocumentNodes
private void createDocumentNodes(java.util.List listNodes, CleanTimeValues cleanTimeValues)
-
closeSnippet
private java.util.List<TagNode> closeSnippet(java.util.List nodeList, TagPos tagPos, java.lang.Object toNode, CleanTimeValues cleanTimeValues)
Forced closing- Parameters:
nodeList
-tagPos
-toNode
-- Returns:
-
closeAll
private void closeAll(java.util.List nodeList, CleanTimeValues cleanTimeValues)
Close all unclosed tags if there are any.
-
addPossibleHeadCandidate
private void addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, CleanTimeValues cleanTimeValues)
Checks if specified tag with specified info is candidate for moving to head section.- Parameters:
tagInfo
-tagNode
-
-
getProperties
public CleanerProperties getProperties()
-
getPruneTagSet
protected java.util.Set<ITagNodeCondition> getPruneTagSet(CleanTimeValues cleanTimeValues)
-
getAllowTagSet
protected java.util.Set<ITagNodeCondition> getAllowTagSet(CleanTimeValues cleanTimeValues)
-
addPruneNode
protected void addPruneNode(TagNode node, CleanTimeValues cleanTimeValues)
-
getTagInfo
public TagInfo getTagInfo(java.lang.String tagName, CleanTimeValues cleanTimeValues)
Returns a TagInfo object for the specified tag name. If the tag is foreign markup, we leave it as null. This is because we may get name clashes, e.g. svg:title. However, we do handle the tag if its embedded content within the correct NS (e.g. SVG, MathML)- Parameters:
tagName
-cleanTimeValues
-- Returns:
- a TagInfo object, or null if no matching TagInfo is found
-
addIfNeededToPruneSet
private boolean addIfNeededToPruneSet(TagNode tagNode, CleanTimeValues cleanTimeValues)
-
getAllTags
protected java.util.Set<java.lang.String> getAllTags(CleanTimeValues cleanTimeValues)
-
getTagInfoProvider
public ITagInfoProvider getTagInfoProvider()
- Returns:
- ITagInfoProvider instance for this HtmlCleaner
-
getTransformations
public CleanerTransformations getTransformations()
- Returns:
- Transformations defined for this instance of cleaner
-
getInnerHtml
public java.lang.String getInnerHtml(TagNode node)
For the specified node, returns it's content as string.- Parameters:
node
-- Returns:
- node's content as string
-
setInnerHtml
public void setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.- Parameters:
node
-content
-
-
initCleanerTransformations
public void initCleanerTransformations(java.util.Map transInfos)
- Parameters:
transInfos
-
-
getOpenTags
private OpenTags getOpenTags(CleanTimeValues cleanTimeValues)
-
getChildBreaks
private ChildBreaks getChildBreaks(CleanTimeValues cleanTimeValues)
-
pushNesting
private NestingState pushNesting(CleanTimeValues cleanTimeValues)
-
popNesting
private NestingState popNesting(CleanTimeValues cleanTimeValues)
-
handleInterruption
protected void handleInterruption()
Called whenever the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
-
-