Class Tokenizer


  • public final class Tokenizer
    extends Object
    Tokenizer for expressions and inputs.

    This code was originally derived from James Clark's xt, though it has been greatly modified since. See copyright notice at end of file.

    • Field Detail

      • DEFAULT_STATE

        public static final int DEFAULT_STATE
        Initial default state of the Tokenizer
        See Also:
        Constant Field Values
      • BARE_NAME_STATE

        public static final int BARE_NAME_STATE
        State in which a name is NOT to be merged with what comes next, for example "("
        See Also:
        Constant Field Values
      • SEQUENCE_TYPE_STATE

        public static final int SEQUENCE_TYPE_STATE
        State in which the next thing to be read is a SequenceType
        See Also:
        Constant Field Values
      • OPERATOR_STATE

        public static final int OPERATOR_STATE
        State in which the next thing to be read is an operator
        See Also:
        Constant Field Values
      • currentToken

        public int currentToken
        The number identifying the most recently read token
      • currentTokenValue

        public String currentTokenValue
        The string value of the most recently read token
      • currentTokenStartOffset

        public int currentTokenStartOffset
        The position in the input expression where the current token starts
      • input

        public String input
        The string being parsed
      • inputOffset

        public int inputOffset
        The current position within the input string
      • disallowUnionKeyword

        public boolean disallowUnionKeyword
        Flag to disallow "union" as a synonym for "|" when parsing XSLT 2.0 patterns
      • isXQuery

        public boolean isXQuery
        Flag to indicate that this is XQuery as distinct from XPath
      • languageLevel

        public int languageLevel
        XPath language level: e.g. 2.0, 3.0, or 3.1
      • allowSaxonExtensions

        public boolean allowSaxonExtensions
        Flag to allow Saxon extensions
    • Constructor Detail

      • Tokenizer

        public Tokenizer()
    • Method Detail

      • getState

        public int getState()
        Get the current tokenizer state
        Returns:
        the current state
      • setState

        public void setState​(int state)
        Set the tokenizer into a special state
        Parameters:
        state - the new state
      • tokenize

        public void tokenize​(String input,
                             int start,
                             int end)
                      throws XPathException
        Prepare a string for tokenization. The actual tokens are obtained by calls on next()
        Parameters:
        input - the string to be tokenized
        start - start point within the string
        end - end point within the string (last character not read): -1 means end of string
        Throws:
        XPathException - if a lexical error occurs, e.g. unmatched string quotes
      • next

        public void next()
                  throws XPathException
        Get the next token from the input expression. The type of token is returned in the currentToken variable, the string value of the token in currentTokenValue.
        Throws:
        XPathException - if a lexical error is detected
      • peekAhead

        int peekAhead()
        Peek ahead at the next token
      • treatCurrentAsOperator

        public void treatCurrentAsOperator()
        Force the current token to be treated as an operator if possible
      • lookAhead

        public void lookAhead()
                       throws XPathException
        Look ahead by one token. This method does the real tokenization work. The method is normally called internally, but the XQuery parser also calls it to resume normal tokenization after dealing with pseudo-XML syntax.
        Throws:
        XPathException - if a lexical error occurs
      • getBinaryOp

        int getBinaryOp​(String s)
        Identify a binary operator
        Parameters:
        s - String representation of the operator - must be interned
        Returns:
        the token number of the operator, or UNKNOWN if it is not a known operator
      • nextChar

        public char nextChar()
        Read next character directly. Used by the XQuery parser when parsing pseudo-XML syntax
        Returns:
        the next character from the input, or NUL at the end of the input
      • incrementLineNumber

        public void incrementLineNumber​(int offset)
        Increment the line number, making a record of where in the input string the newline character occurred.
        Parameters:
        offset - the place in the input string where the newline occurred
      • unreadChar

        public void unreadChar()
        Step back one character. If this steps back to a previous line, adjust the line number.
      • recentText

        String recentText​(int offset)
        Get the most recently read text (for use in an error message)
        Parameters:
        offset - the offset of the offending token, if known, or -1 to use the current offset
        Returns:
        a chunk of text leading up to the error
      • getLineNumber

        public int getLineNumber()
        Get the line number of the current token
        Returns:
        the line number. Line numbers reported by the tokenizer start at zero.
      • getColumnNumber

        public int getColumnNumber()
        Get the column number of the current token
        Returns:
        the column number. Column numbers reported by the tokenizer start at zero.
      • getLineNumber

        public int getLineNumber​(int offset)
        Return the line number corresponding to a given offset in the expression
        Parameters:
        offset - the byte offset in the expression
        Returns:
        the line number. Line and column numbers reported by the tokenizer start at zero.
      • getColumnNumber

        public int getColumnNumber​(int offset)
        Return the column number corresponding to a given offset in the expression
        Parameters:
        offset - the byte offset in the expression
        Returns:
        the column number. Line and column numbers reported by the tokenizer start at zero.