Interface WordIndexer<W>

  • Type Parameters:
    W - A type representing words in the language. Can be a String, or something more complex if needed
    All Superinterfaces:
    java.io.Serializable
    All Known Implementing Classes:
    StringWordIndexer

    public interface WordIndexer<W>
    extends java.io.Serializable
    Enumerates words in the vocabulary of a language model. Stores a two-way mapping between integers and words.
    Author:
    adampauls
    • Method Detail

      • getOrAddIndex

        int getOrAddIndex​(W word)
        Gets the index for a word, adding if necessary.
        Parameters:
        word -
        Returns:
      • getOrAddIndexFromString

        int getOrAddIndexFromString​(java.lang.String word)
      • getIndexPossiblyUnk

        int getIndexPossiblyUnk​(W word)
        Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.
        Parameters:
        word -
        Returns:
      • getWord

        W getWord​(int index)
        Gets the word object for an index.
        Parameters:
        index -
        Returns:
      • numWords

        int numWords()
        Number of words that have been added so far
        Returns:
      • getStartSymbol

        W getStartSymbol()
        Returns the start symbol (usually something like <s>
        Returns:
      • setStartSymbol

        void setStartSymbol​(W sym)
      • getEndSymbol

        W getEndSymbol()
        Returns the start symbol (usually something like </s>
        Returns:
      • setEndSymbol

        void setEndSymbol​(W sym)
      • getUnkSymbol

        W getUnkSymbol()
        Returns the unk symbol (usually something like <unk>
        Returns:
      • setUnkSymbol

        void setUnkSymbol​(W sym)
      • trimAndLock

        void trimAndLock()
        Informs the implementation that no more words can be added to the vocabulary. Implementations may perform some space optimization, and should trigger an error if an attempt is made to add a word after this point.