Class TextTokenizer


  • public class TextTokenizer
    extends java.lang.Object
    An implementation of a text tokenizer for whitespace separated natural lanuage text.

    The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of

    • sequences of word characters and sentence delimiters where the last character is a word character,
    • sentence delimiter characters (if they do not precede a word character),
    • sequences of whitespace characters,
    • and individual separator characters.

    The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.

    By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.

    A tokenizer provides a standard iterator interface similar to StringTokenizer. The validity of the iterator can be queried with hasNext(), and the next token can be queried with nextToken(). In addition, getNextTokenType() returns the type of the token as an integer. NB that you need to call getNextTokenType() before calling nextToken(), since calling nextToken() will advance the iterator.

    Version:
    $Id: TextTokenizer.java,v 1.1 2002/09/30 19:09:09 goetz Exp $
    • Constructor Summary

      Constructors 
      Constructor Description
      TextTokenizer​(java.lang.String string)
      Construct a tokenizer from a Java string.
      TextTokenizer​(CharArrayString string)
      Construct a tokenizer from a CharArrayString.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addSeparators​(java.lang.String chars)
      Add to the set of separator characters.
      void addToEndOfSentenceChars​(java.lang.String chars)
      Add to the set of sentence delimiters.
      private char[] addToSortedList​(java.lang.String s, char[] list)
      Add the characters in s to the sorted array of characters in list, returning a new, sorted array.
      void addWhitespaceChars​(java.lang.String chars)
      Add to the set of whitespace characters.
      void addWordChars​(java.lang.String chars)
      Add to the set of word characters.
      private boolean computeNextToken()
      Compute the next token.
      int getCharType​(char c)
      Get the type of an individual character.
      int getNextTokenType()
      Get the type of the token returned by the next call to nextToken().
      boolean hasNext()  
      private char[] makeSortedList​(java.lang.String s)  
      java.lang.String nextToken()  
      void setEndOfSentenceChars​(java.lang.String chars)
      Set the set of sentence delimiters.
      void setSeparators​(java.lang.String chars)
      Set the set of separator characters.
      void setShowSeparators​(boolean b)
      Set the flag for showing separator tokens.
      void setShowWhitespace​(boolean b)
      Set the flag for showing whitespace tokens.
      void setWhitespaceChars​(java.lang.String chars)
      Set the set of whitespace characters (in addition to the Unicode whitespace chars).
      void setWordChars​(java.lang.String chars)
      Set the set of word characters.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • EOS

        public static final int EOS
        Sentence delimiter character/word type.
        See Also:
        Constant Field Values
      • text

        private final char[] text
      • end

        private final int end
      • pos

        private int pos
      • eosDels

        private char[] eosDels
      • separators

        private char[] separators
      • whitespace

        private char[] whitespace
      • wordChars

        private char[] wordChars
      • nextTokenStart

        private int nextTokenStart
      • nextTokenEnd

        private int nextTokenEnd
      • nextTokenType

        private int nextTokenType
      • nextComputed

        private boolean nextComputed
      • showWhitespace

        private boolean showWhitespace
      • showSeparators

        private boolean showSeparators
    • Constructor Detail

      • TextTokenizer

        public TextTokenizer​(CharArrayString string)
        Construct a tokenizer from a CharArrayString.
        Parameters:
        string - The string to tokenize.
      • TextTokenizer

        public TextTokenizer​(java.lang.String string)
        Construct a tokenizer from a Java string.
        Parameters:
        string - -
    • Method Detail

      • setShowWhitespace

        public void setShowWhitespace​(boolean b)
        Set the flag for showing whitespace tokens.
        Parameters:
        b - -
      • setShowSeparators

        public void setShowSeparators​(boolean b)
        Set the flag for showing separator tokens.
        Parameters:
        b - -
      • setEndOfSentenceChars

        public void setEndOfSentenceChars​(java.lang.String chars)
        Set the set of sentence delimiters.
        Parameters:
        chars - -
      • addToEndOfSentenceChars

        public void addToEndOfSentenceChars​(java.lang.String chars)
        Add to the set of sentence delimiters.
        Parameters:
        chars - -
      • setSeparators

        public void setSeparators​(java.lang.String chars)
        Set the set of separator characters.
        Parameters:
        chars - -
      • addSeparators

        public void addSeparators​(java.lang.String chars)
        Add to the set of separator characters.
        Parameters:
        chars - -
      • setWhitespaceChars

        public void setWhitespaceChars​(java.lang.String chars)
        Set the set of whitespace characters (in addition to the Unicode whitespace chars).
        Parameters:
        chars - -
      • addWhitespaceChars

        public void addWhitespaceChars​(java.lang.String chars)
        Add to the set of whitespace characters.
        Parameters:
        chars - -
      • setWordChars

        public void setWordChars​(java.lang.String chars)
        Set the set of word characters.
        Parameters:
        chars - -
      • addWordChars

        public void addWordChars​(java.lang.String chars)
        Add to the set of word characters.
        Parameters:
        chars - -
      • getNextTokenType

        public int getNextTokenType()
        Get the type of the token returned by the next call to nextToken().
        Returns:
        The token type, or -1 if there is no next token.
      • hasNext

        public boolean hasNext()
        Returns:
        true iff there is a next token.
      • nextToken

        public java.lang.String nextToken()
        Returns:
        the next token.
      • computeNextToken

        private boolean computeNextToken()
        Compute the next token.
      • getCharType

        public int getCharType​(char c)
        Get the type of an individual character.
        Parameters:
        c - -
        Returns:
        -
      • addToSortedList

        private char[] addToSortedList​(java.lang.String s,
                                       char[] list)
        Add the characters in s to the sorted array of characters in list, returning a new, sorted array.
      • makeSortedList

        private char[] makeSortedList​(java.lang.String s)