Class HtmlTokenizer


  • public class HtmlTokenizer
    extends java.lang.Object
    Main HTML tokenizer.

    It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.

    Created by: Vladimir Nikic.
    Date: November, 2006
    • Field Detail

      • _reader

        private java.io.BufferedReader _reader
      • _working

        private char[] _working
      • _pos

        private transient int _pos
      • _len

        private transient int _len
      • _row

        private transient int _row
      • _col

        private transient int _col
      • _saved

        private transient java.lang.StringBuffer _saved
      • _isLateForDoctype

        private transient boolean _isLateForDoctype
      • _currentTagToken

        private transient TagToken _currentTagToken
      • _tokenList

        private transient java.util.List<BaseToken> _tokenList
      • _namespacePrefixes

        private transient java.util.Set<java.lang.String> _namespacePrefixes
      • _asExpected

        private boolean _asExpected
      • _isSpecialContext

        private boolean _isSpecialContext
      • _isSpecialContextName

        private java.lang.String _isSpecialContextName
    • Constructor Detail

      • HtmlTokenizer

        public HtmlTokenizer​(HtmlCleaner cleaner,
                             java.io.Reader reader,
                             CleanTimeValues cleanTimeValues)
        Constructor - creates instance of the parser with specified content.
        Parameters:
        cleaner -
        reader -
    • Method Detail

      • addToken

        private void addToken​(BaseToken token)
      • readIfNeeded

        private void readIfNeeded​(int neededChars)
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • getTokenList

        java.util.List<BaseToken> getTokenList()
      • getNamespacePrefixes

        java.util.Set<java.lang.String> getNamespacePrefixes()
      • go

        private void go()
                 throws java.io.IOException
        Throws:
        java.io.IOException
      • go

        private void go​(int step)
                 throws java.io.IOException
        Throws:
        java.io.IOException
      • startsWith

        private boolean startsWith​(java.lang.String value)
                            throws java.io.IOException
        Checks if content starts with specified value at the current position.
        Parameters:
        value -
        Returns:
        true if starts with specified value, false otherwise.
        Throws:
        java.io.IOException
      • isWhitespace

        private boolean isWhitespace​(int position)
        Checks if character at specified position is whitespace.
        Parameters:
        position -
        Returns:
        true is whitespace, false otherwise.
      • isWhitespace

        private boolean isWhitespace()
        Checks if character at current runtime position is whitespace.
        Returns:
        true is whitespace, false otherwise.
      • isChar

        private boolean isChar​(int position,
                               char ch)
        Checks if character at specified position is equal to specified char.
        Parameters:
        position -
        ch -
        Returns:
        true is equals, false otherwise.
      • isChar

        private boolean isChar​(char ch)
        Checks if character at current runtime position is equal to specified char.
        Parameters:
        ch -
        Returns:
        true is equal, false otherwise.
      • isElementIdentifierStartChar

        private boolean isElementIdentifierStartChar​(int position)
        Checks if character at specified position can be identifier start.
        Parameters:
        position -
        Returns:
        true is may be identifier start, false otherwise.
      • isHtmlAttributeIdentifierStartChar

        private boolean isHtmlAttributeIdentifierStartChar()
        Checks if character at current runtime position can be identifier start.
        Returns:
        true is may be identifier start, false otherwise.
      • isHtmlAttributeIdentifierChar

        private boolean isHtmlAttributeIdentifierChar()
      • isHtmlElementIdentifier

        private boolean isHtmlElementIdentifier()
      • isHtmlElementIdentifier

        private boolean isHtmlElementIdentifier​(int position)
      • isHtmlAttributeIdentifierChar

        private boolean isHtmlAttributeIdentifierChar​(int position)
        Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTML
        Parameters:
        position -
        Returns:
      • isAllRead

        private boolean isAllRead()
        Checks if end of the content is reached.
      • save

        private void save​(char ch)
        Saves specified character to the temporary buffer.
        Parameters:
        ch -
      • updateCoordinates

        private void updateCoordinates​(char ch)
        Looks onto the char passed and updates current position coordinates. If char is a line break, increments row coordinate, if not -- col coordinate.
        Parameters:
        ch - - char to analyze.
      • saveCurrent

        private void saveCurrent()
        Saves character at current runtime position to the temporary buffer.
      • saveCurrent

        private void saveCurrent​(int size)
                          throws java.io.IOException
        Saves specified number of characters at current runtime position to the temporary buffer.
        Throws:
        java.io.IOException
      • skipWhitespaces

        private void skipWhitespaces()
                              throws java.io.IOException
        Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.
        Throws:
        java.io.IOException
      • addSavedAsContent

        private boolean addSavedAsContent()
      • start

        void start()
            throws java.io.IOException
        Starts parsing HTML.
        Throws:
        java.io.IOException
      • isReservedTag

        private boolean isReservedTag​(java.lang.String tagName)
        Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY
        Parameters:
        tagName -
        Returns:
      • tagStart

        private void tagStart()
                       throws java.io.IOException
        Parses start of the tag. It expects that current position is at the "<" after which the tag's name follows.
        Throws:
        java.io.IOException
      • tagEnd

        private void tagEnd()
                     throws java.io.IOException
        Parses end of the tag. It expects that current position is at the "<" after which "/" and the tag's name follows.
        Throws:
        java.io.IOException
      • identifier

        private java.lang.String identifier​(boolean attribute)
                                     throws java.io.IOException
        Parses an identifier from the current position.
        Throws:
        java.io.IOException
      • tagAttributes

        private void tagAttributes()
                            throws java.io.IOException
        Parses list tag attributes from the current position.
        Throws:
        java.io.IOException
      • attributeValue

        private java.lang.String attributeValue()
                                         throws java.io.IOException
        Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name
        Throws:
        java.io.IOException
      • content

        private boolean content()
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • isTagStartOrEnd

        private boolean isTagStartOrEnd()
                                 throws java.io.IOException
        Not all '<' (lt) symbols mean tag start or end. For example '<' can be part of mathematical expression. To avoid false breaks of content tags use this method to determine content tag end.
        Returns:
        true if current position is tag start or end.
        Throws:
        java.io.IOException
      • ignoreUntil

        private void ignoreUntil​(char ch)
                          throws java.io.IOException
        Throws:
        java.io.IOException
      • comment

        private void comment()
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • cdata

        private void cdata()
                    throws java.io.IOException
        Throws:
        java.io.IOException
      • doctype

        private void doctype()
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • handleInterruption

        private void handleInterruption()
        Called whenver the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
      • containsEndCData

        private boolean containsEndCData()
                                  throws java.io.IOException
        Throws:
        java.io.IOException