Package org.htmlcleaner
Class HtmlTokenizer
java.lang.Object
org.htmlcleaner.HtmlTokenizer
Main HTML tokenizer.
Date: November, 2006
It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.
Created by: Vladimir Nikic.Date: November, 2006
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate boolean
private int
private TagToken
private DoctypeToken
private boolean
private boolean
private String
private int
private int
private BufferedReader
private int
private StringBuffer
private char[]
private HtmlCleaner
private CleanTimeValues
private CleanerProperties
private CleanerTransformations
private static final int
-
Constructor Summary
ConstructorsConstructorDescriptionHtmlTokenizer
(HtmlCleaner cleaner, Reader reader, CleanTimeValues cleanTimeValues) Constructor - creates instance of the parser with specified content. -
Method Summary
Modifier and TypeMethodDescriptionprivate boolean
private void
private String
Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' nameprivate void
cdata()
private void
comment()
private boolean
private boolean
content()
private void
doctype()
private void
go()
private void
go
(int step) private void
Called whenver the thread is interrupted.private String
identifier
(boolean attribute) Parses an identifier from the current position.private void
ignoreUntil
(char ch) private boolean
Checks if end of the content is reached.private boolean
isChar
(char ch) Checks if character at current runtime position is equal to specified char.private boolean
isChar
(int position, char ch) Checks if character at specified position is equal to specified char.private boolean
isElementIdentifierStartChar
(int position) Checks if character at specified position can be identifier start.private boolean
private boolean
isHtmlAttributeIdentifierChar
(int position) Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTMLprivate boolean
Checks if character at current runtime position can be identifier start.private boolean
private boolean
isHtmlElementIdentifier
(int position) private boolean
isReservedTag
(String tagName) Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODYprivate boolean
Not all 'invalid input: '<'' (lt) symbols mean tag start or end.private boolean
Checks if character at current runtime position is whitespace.private boolean
isWhitespace
(int position) Checks if character at specified position is whitespace.private void
readIfNeeded
(int neededChars) private void
save
(char ch) Saves specified character to the temporary buffer.private void
Saves character at current runtime position to the temporary buffer.private void
saveCurrent
(int size) Saves specified number of characters at current runtime position to the temporary buffer.private void
Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.(package private) void
start()
Starts parsing HTML.private boolean
startsWith
(String value) Checks if content starts with specified value at the current position.private void
Parses list tag attributes from the current position.private void
tagEnd()
Parses end of the tag.private void
tagStart()
Parses start of the tag.private void
updateCoordinates
(char ch) Looks onto the char passed and updates current position coordinates.
-
Field Details
-
WORKING_BUFFER_SIZE
private static final int WORKING_BUFFER_SIZE- See Also:
-
_reader
-
_working
private char[] _working -
_pos
private transient int _pos -
_len
private transient int _len -
_row
private transient int _row -
_col
private transient int _col -
_saved
-
_isLateForDoctype
private transient boolean _isLateForDoctype -
_docType
-
_currentTagToken
-
_tokenList
-
_namespacePrefixes
-
_asExpected
private boolean _asExpected -
_isSpecialContext
private boolean _isSpecialContext -
_isSpecialContextName
-
cleaner
-
props
-
transformations
-
cleanTimeValues
-
-
Constructor Details
-
HtmlTokenizer
Constructor - creates instance of the parser with specified content.- Parameters:
cleaner
-reader
-
-
-
Method Details
-
addToken
-
readIfNeeded
- Throws:
IOException
-
getTokenList
-
getNamespacePrefixes
-
go
- Throws:
IOException
-
go
- Throws:
IOException
-
startsWith
Checks if content starts with specified value at the current position.- Parameters:
value
-- Returns:
- true if starts with specified value, false otherwise.
- Throws:
IOException
-
isWhitespace
private boolean isWhitespace(int position) Checks if character at specified position is whitespace.- Parameters:
position
-- Returns:
- true is whitespace, false otherwise.
-
isWhitespace
private boolean isWhitespace()Checks if character at current runtime position is whitespace.- Returns:
- true is whitespace, false otherwise.
-
isChar
private boolean isChar(int position, char ch) Checks if character at specified position is equal to specified char.- Parameters:
position
-ch
-- Returns:
- true is equals, false otherwise.
-
isChar
private boolean isChar(char ch) Checks if character at current runtime position is equal to specified char.- Parameters:
ch
-- Returns:
- true is equal, false otherwise.
-
isElementIdentifierStartChar
private boolean isElementIdentifierStartChar(int position) Checks if character at specified position can be identifier start.- Parameters:
position
-- Returns:
- true is may be identifier start, false otherwise.
-
isHtmlAttributeIdentifierStartChar
private boolean isHtmlAttributeIdentifierStartChar()Checks if character at current runtime position can be identifier start.- Returns:
- true is may be identifier start, false otherwise.
-
isHtmlAttributeIdentifierChar
private boolean isHtmlAttributeIdentifierChar() -
isHtmlElementIdentifier
private boolean isHtmlElementIdentifier() -
isHtmlElementIdentifier
private boolean isHtmlElementIdentifier(int position) -
isHtmlAttributeIdentifierChar
private boolean isHtmlAttributeIdentifierChar(int position) Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTML- Parameters:
position
-- Returns:
-
isAllRead
private boolean isAllRead()Checks if end of the content is reached. -
save
private void save(char ch) Saves specified character to the temporary buffer.- Parameters:
ch
-
-
updateCoordinates
private void updateCoordinates(char ch) Looks onto the char passed and updates current position coordinates. If char is a line break, increments row coordinate, if not -- col coordinate.- Parameters:
ch
- - char to analyze.
-
saveCurrent
private void saveCurrent()Saves character at current runtime position to the temporary buffer. -
saveCurrent
Saves specified number of characters at current runtime position to the temporary buffer.- Throws:
IOException
-
skipWhitespaces
Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.- Throws:
IOException
-
addSavedAsContent
private boolean addSavedAsContent() -
start
Starts parsing HTML.- Throws:
IOException
-
isReservedTag
Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY- Parameters:
tagName
-- Returns:
-
tagStart
Parses start of the tag. It expects that current position is at the "invalid input: '<'" after which the tag's name follows.- Throws:
IOException
-
tagEnd
Parses end of the tag. It expects that current position is at the "invalid input: '<'" after which "/" and the tag's name follows.- Throws:
IOException
-
identifier
Parses an identifier from the current position.- Throws:
IOException
-
tagAttributes
Parses list tag attributes from the current position.- Throws:
IOException
-
attributeValue
Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name- Throws:
IOException
-
content
- Throws:
IOException
-
isTagStartOrEnd
Not all 'invalid input: '<'' (lt) symbols mean tag start or end. For example 'invalid input: '<'' can be part of mathematical expression. To avoid false breaks of content tags use this method to determine content tag end.- Returns:
- true if current position is tag start or end.
- Throws:
IOException
-
ignoreUntil
- Throws:
IOException
-
comment
- Throws:
IOException
-
cdata
- Throws:
IOException
-
doctype
- Throws:
IOException
-
getDocType
-
handleInterruption
private void handleInterruption()Called whenver the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction -
containsEndCData
- Throws:
IOException
-