Class TextTokenizer
- java.lang.Object
-
- org.apache.uima.internal.util.TextTokenizer
-
public class TextTokenizer extends java.lang.Object
An implementation of a text tokenizer for whitespace separated natural lanuage text.The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of
- sequences of word characters and sentence delimiters where the last character is a word character,
- sentence delimiter characters (if they do not precede a word character),
- sequences of whitespace characters,
- and individual separator characters.
The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.
By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.
A tokenizer provides a standard iterator interface similar to
StringTokenizer
. The validity of the iterator can be queried withhasNext()
, and the next token can be queried withnextToken()
. In addition,getNextTokenType()
returns the type of the token as an integer. NB that you need to callgetNextTokenType()
before callingnextToken()
, since callingnextToken()
will advance the iterator.- Version:
- $Id: TextTokenizer.java,v 1.1 2002/09/30 19:09:09 goetz Exp $
-
-
Field Summary
Fields Modifier and Type Field Description private int
end
static int
EOS
Sentence delimiter character/word type.private char[]
eosDels
private boolean
nextComputed
private int
nextTokenEnd
private int
nextTokenStart
private int
nextTokenType
private int
pos
static int
SEP
Separator character/word type.private char[]
separators
private boolean
showSeparators
private boolean
showWhitespace
private char[]
text
static int
WCH
Word character/word type.private char[]
whitespace
private char[]
wordChars
static int
WSP
Whitespace character/word type.
-
Constructor Summary
Constructors Constructor Description TextTokenizer(java.lang.String string)
Construct a tokenizer from a Java string.TextTokenizer(CharArrayString string)
Construct a tokenizer from a CharArrayString.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addSeparators(java.lang.String chars)
Add to the set of separator characters.void
addToEndOfSentenceChars(java.lang.String chars)
Add to the set of sentence delimiters.private char[]
addToSortedList(java.lang.String s, char[] list)
Add the characters ins
to the sorted array of characters inlist
, returning a new, sorted array.void
addWhitespaceChars(java.lang.String chars)
Add to the set of whitespace characters.void
addWordChars(java.lang.String chars)
Add to the set of word characters.private boolean
computeNextToken()
Compute the next token.int
getCharType(char c)
Get the type of an individual character.int
getNextTokenType()
Get the type of the token returned by the next call tonextToken()
.boolean
hasNext()
private char[]
makeSortedList(java.lang.String s)
java.lang.String
nextToken()
void
setEndOfSentenceChars(java.lang.String chars)
Set the set of sentence delimiters.void
setSeparators(java.lang.String chars)
Set the set of separator characters.void
setShowSeparators(boolean b)
Set the flag for showing separator tokens.void
setShowWhitespace(boolean b)
Set the flag for showing whitespace tokens.void
setWhitespaceChars(java.lang.String chars)
Set the set of whitespace characters (in addition to the Unicode whitespace chars).void
setWordChars(java.lang.String chars)
Set the set of word characters.
-
-
-
Field Detail
-
EOS
public static final int EOS
Sentence delimiter character/word type.- See Also:
- Constant Field Values
-
SEP
public static final int SEP
Separator character/word type.- See Also:
- Constant Field Values
-
WSP
public static final int WSP
Whitespace character/word type.- See Also:
- Constant Field Values
-
WCH
public static final int WCH
Word character/word type.- See Also:
- Constant Field Values
-
text
private final char[] text
-
end
private final int end
-
pos
private int pos
-
eosDels
private char[] eosDels
-
separators
private char[] separators
-
whitespace
private char[] whitespace
-
wordChars
private char[] wordChars
-
nextTokenStart
private int nextTokenStart
-
nextTokenEnd
private int nextTokenEnd
-
nextTokenType
private int nextTokenType
-
nextComputed
private boolean nextComputed
-
showWhitespace
private boolean showWhitespace
-
showSeparators
private boolean showSeparators
-
-
Constructor Detail
-
TextTokenizer
public TextTokenizer(CharArrayString string)
Construct a tokenizer from a CharArrayString.- Parameters:
string
- The string to tokenize.
-
TextTokenizer
public TextTokenizer(java.lang.String string)
Construct a tokenizer from a Java string.- Parameters:
string
- -
-
-
Method Detail
-
setShowWhitespace
public void setShowWhitespace(boolean b)
Set the flag for showing whitespace tokens.- Parameters:
b
- -
-
setShowSeparators
public void setShowSeparators(boolean b)
Set the flag for showing separator tokens.- Parameters:
b
- -
-
setEndOfSentenceChars
public void setEndOfSentenceChars(java.lang.String chars)
Set the set of sentence delimiters.- Parameters:
chars
- -
-
addToEndOfSentenceChars
public void addToEndOfSentenceChars(java.lang.String chars)
Add to the set of sentence delimiters.- Parameters:
chars
- -
-
setSeparators
public void setSeparators(java.lang.String chars)
Set the set of separator characters.- Parameters:
chars
- -
-
addSeparators
public void addSeparators(java.lang.String chars)
Add to the set of separator characters.- Parameters:
chars
- -
-
setWhitespaceChars
public void setWhitespaceChars(java.lang.String chars)
Set the set of whitespace characters (in addition to the Unicode whitespace chars).- Parameters:
chars
- -
-
addWhitespaceChars
public void addWhitespaceChars(java.lang.String chars)
Add to the set of whitespace characters.- Parameters:
chars
- -
-
setWordChars
public void setWordChars(java.lang.String chars)
Set the set of word characters.- Parameters:
chars
- -
-
addWordChars
public void addWordChars(java.lang.String chars)
Add to the set of word characters.- Parameters:
chars
- -
-
getNextTokenType
public int getNextTokenType()
Get the type of the token returned by the next call tonextToken()
.- Returns:
- The token type, or
-1
if there is no next token.
-
hasNext
public boolean hasNext()
- Returns:
true
iff there is a next token.
-
nextToken
public java.lang.String nextToken()
- Returns:
- the next token.
-
computeNextToken
private boolean computeNextToken()
Compute the next token.
-
getCharType
public int getCharType(char c)
Get the type of an individual character.- Parameters:
c
- -- Returns:
- -
-
addToSortedList
private char[] addToSortedList(java.lang.String s, char[] list)
Add the characters ins
to the sorted array of characters inlist
, returning a new, sorted array.
-
makeSortedList
private char[] makeSortedList(java.lang.String s)
-
-