Package edu.berkeley.nlp.lm
Interface WordIndexer<W>
-
- Type Parameters:
W
- A type representing words in the language. Can be aString
, or something more complex if needed
- All Superinterfaces:
java.io.Serializable
- All Known Implementing Classes:
StringWordIndexer
public interface WordIndexer<W> extends java.io.Serializable
Enumerates words in the vocabulary of a language model. Stores a two-way mapping between integers and words.- Author:
- adampauls
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
WordIndexer.StaticMethods
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description W
getEndSymbol()
Returns the start symbol (usually something like </s>int
getIndexPossiblyUnk(W word)
Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.int
getOrAddIndex(W word)
Gets the index for a word, adding if necessary.int
getOrAddIndexFromString(java.lang.String word)
W
getStartSymbol()
Returns the start symbol (usually something like <s>W
getUnkSymbol()
Returns the unk symbol (usually something like <unk>W
getWord(int index)
Gets the word object for an index.int
numWords()
Number of words that have been added so farvoid
setEndSymbol(W sym)
void
setStartSymbol(W sym)
void
setUnkSymbol(W sym)
void
trimAndLock()
Informs the implementation that no more words can be added to the vocabulary.
-
-
-
Method Detail
-
getOrAddIndex
int getOrAddIndex(W word)
Gets the index for a word, adding if necessary.- Parameters:
word
-- Returns:
-
getOrAddIndexFromString
int getOrAddIndexFromString(java.lang.String word)
-
getIndexPossiblyUnk
int getIndexPossiblyUnk(W word)
Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.- Parameters:
word
-- Returns:
-
getWord
W getWord(int index)
Gets the word object for an index.- Parameters:
index
-- Returns:
-
numWords
int numWords()
Number of words that have been added so far- Returns:
-
getStartSymbol
W getStartSymbol()
Returns the start symbol (usually something like <s>- Returns:
-
setStartSymbol
void setStartSymbol(W sym)
-
getEndSymbol
W getEndSymbol()
Returns the start symbol (usually something like </s>- Returns:
-
setEndSymbol
void setEndSymbol(W sym)
-
getUnkSymbol
W getUnkSymbol()
Returns the unk symbol (usually something like <unk>- Returns:
-
setUnkSymbol
void setUnkSymbol(W sym)
-
trimAndLock
void trimAndLock()
Informs the implementation that no more words can be added to the vocabulary. Implementations may perform some space optimization, and should trigger an error if an attempt is made to add a word after this point.
-
-