Class HtmlTokenizer

java.lang.Object
org.htmlcleaner.HtmlTokenizer

public class HtmlTokenizer extends Object
Main HTML tokenizer.

It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.

Created by: Vladimir Nikic.
Date: November, 2006
  • Field Details

    • WORKING_BUFFER_SIZE

      private static final int WORKING_BUFFER_SIZE
      See Also:
    • _reader

      private BufferedReader _reader
    • _working

      private char[] _working
    • _pos

      private transient int _pos
    • _len

      private transient int _len
    • _row

      private transient int _row
    • _col

      private transient int _col
    • _saved

      private transient StringBuffer _saved
    • _isLateForDoctype

      private transient boolean _isLateForDoctype
    • _docType

      private transient DoctypeToken _docType
    • _currentTagToken

      private transient TagToken _currentTagToken
    • _tokenList

      private transient List<BaseToken> _tokenList
    • _namespacePrefixes

      private transient Set<String> _namespacePrefixes
    • _asExpected

      private boolean _asExpected
    • _isSpecialContext

      private boolean _isSpecialContext
    • _isSpecialContextName

      private String _isSpecialContextName
    • cleaner

      private HtmlCleaner cleaner
    • props

      private CleanerProperties props
    • transformations

      private CleanerTransformations transformations
    • cleanTimeValues

      private CleanTimeValues cleanTimeValues
  • Constructor Details

    • HtmlTokenizer

      public HtmlTokenizer(HtmlCleaner cleaner, Reader reader, CleanTimeValues cleanTimeValues)
      Constructor - creates instance of the parser with specified content.
      Parameters:
      cleaner -
      reader -
  • Method Details

    • addToken

      private void addToken(BaseToken token)
    • readIfNeeded

      private void readIfNeeded(int neededChars) throws IOException
      Throws:
      IOException
    • getTokenList

      List<BaseToken> getTokenList()
    • getNamespacePrefixes

      Set<String> getNamespacePrefixes()
    • go

      private void go() throws IOException
      Throws:
      IOException
    • go

      private void go(int step) throws IOException
      Throws:
      IOException
    • startsWith

      private boolean startsWith(String value) throws IOException
      Checks if content starts with specified value at the current position.
      Parameters:
      value -
      Returns:
      true if starts with specified value, false otherwise.
      Throws:
      IOException
    • isWhitespace

      private boolean isWhitespace(int position)
      Checks if character at specified position is whitespace.
      Parameters:
      position -
      Returns:
      true is whitespace, false otherwise.
    • isWhitespace

      private boolean isWhitespace()
      Checks if character at current runtime position is whitespace.
      Returns:
      true is whitespace, false otherwise.
    • isChar

      private boolean isChar(int position, char ch)
      Checks if character at specified position is equal to specified char.
      Parameters:
      position -
      ch -
      Returns:
      true is equals, false otherwise.
    • isChar

      private boolean isChar(char ch)
      Checks if character at current runtime position is equal to specified char.
      Parameters:
      ch -
      Returns:
      true is equal, false otherwise.
    • isElementIdentifierStartChar

      private boolean isElementIdentifierStartChar(int position)
      Checks if character at specified position can be identifier start.
      Parameters:
      position -
      Returns:
      true is may be identifier start, false otherwise.
    • isHtmlAttributeIdentifierStartChar

      private boolean isHtmlAttributeIdentifierStartChar()
      Checks if character at current runtime position can be identifier start.
      Returns:
      true is may be identifier start, false otherwise.
    • isHtmlAttributeIdentifierChar

      private boolean isHtmlAttributeIdentifierChar()
    • isHtmlElementIdentifier

      private boolean isHtmlElementIdentifier()
    • isHtmlElementIdentifier

      private boolean isHtmlElementIdentifier(int position)
    • isHtmlAttributeIdentifierChar

      private boolean isHtmlAttributeIdentifierChar(int position)
      Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTML
      Parameters:
      position -
      Returns:
    • isAllRead

      private boolean isAllRead()
      Checks if end of the content is reached.
    • save

      private void save(char ch)
      Saves specified character to the temporary buffer.
      Parameters:
      ch -
    • updateCoordinates

      private void updateCoordinates(char ch)
      Looks onto the char passed and updates current position coordinates. If char is a line break, increments row coordinate, if not -- col coordinate.
      Parameters:
      ch - - char to analyze.
    • saveCurrent

      private void saveCurrent()
      Saves character at current runtime position to the temporary buffer.
    • saveCurrent

      private void saveCurrent(int size) throws IOException
      Saves specified number of characters at current runtime position to the temporary buffer.
      Throws:
      IOException
    • skipWhitespaces

      private void skipWhitespaces() throws IOException
      Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.
      Throws:
      IOException
    • addSavedAsContent

      private boolean addSavedAsContent()
    • start

      void start() throws IOException
      Starts parsing HTML.
      Throws:
      IOException
    • isReservedTag

      private boolean isReservedTag(String tagName)
      Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY
      Parameters:
      tagName -
      Returns:
    • tagStart

      private void tagStart() throws IOException
      Parses start of the tag. It expects that current position is at the "invalid input: '<'" after which the tag's name follows.
      Throws:
      IOException
    • tagEnd

      private void tagEnd() throws IOException
      Parses end of the tag. It expects that current position is at the "invalid input: '<'" after which "/" and the tag's name follows.
      Throws:
      IOException
    • identifier

      private String identifier(boolean attribute) throws IOException
      Parses an identifier from the current position.
      Throws:
      IOException
    • tagAttributes

      private void tagAttributes() throws IOException
      Parses list tag attributes from the current position.
      Throws:
      IOException
    • attributeValue

      private String attributeValue() throws IOException
      Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name
      Throws:
      IOException
    • content

      private boolean content() throws IOException
      Throws:
      IOException
    • isTagStartOrEnd

      private boolean isTagStartOrEnd() throws IOException
      Not all 'invalid input: '<'' (lt) symbols mean tag start or end. For example 'invalid input: '<'' can be part of mathematical expression. To avoid false breaks of content tags use this method to determine content tag end.
      Returns:
      true if current position is tag start or end.
      Throws:
      IOException
    • ignoreUntil

      private void ignoreUntil(char ch) throws IOException
      Throws:
      IOException
    • comment

      private void comment() throws IOException
      Throws:
      IOException
    • cdata

      private void cdata() throws IOException
      Throws:
      IOException
    • doctype

      private void doctype() throws IOException
      Throws:
      IOException
    • getDocType

      public DoctypeToken getDocType()
    • handleInterruption

      private void handleInterruption()
      Called whenver the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
    • containsEndCData

      private boolean containsEndCData() throws IOException
      Throws:
      IOException