Class DictionaryLookup

  • All Implemented Interfaces:
    java.lang.Iterable<WordData>, IStemmer

    public final class DictionaryLookup
    extends java.lang.Object
    implements IStemmer, java.lang.Iterable<WordData>
    This class implements a dictionary lookup of an inflected word over a dictionary previously compiled using the dict_compile tool.
    • Field Detail

      • matcher

        private final FSATraversal matcher
        An FSA used for lookups.
      • finalStatesIterator

        private final ByteSequenceIterator finalStatesIterator
        An iterator for walking along the final states of fsa.
      • rootNode

        private final int rootNode
        FSA's root node.
      • EXPAND_SIZE

        private static final int EXPAND_SIZE
        Expand buffers and arrays by this constant.
        See Also:
        Constant Field Values
      • forms

        private WordData[] forms
        Private internal array of reusable word data objects.
      • encoder

        private final java.nio.charset.CharsetEncoder encoder
        Charset encoder for the FSA.
      • decoder

        private final java.nio.charset.CharsetDecoder decoder
        Charset decoder for the FSA.
      • fsa

        private final FSA fsa
        The FSA we are using.
      • byteBuffer

        private java.nio.ByteBuffer byteBuffer
        Internal reusable buffer for encoding words into byte arrays using encoder.
      • charBuffer

        private java.nio.CharBuffer charBuffer
        Internal reusable buffer for encoding words into byte arrays using encoder.
      • matchResult

        private final MatchResult matchResult
        Reusable match result.
    • Constructor Detail

      • DictionaryLookup

        public DictionaryLookup​(Dictionary dictionary)
                         throws java.lang.IllegalArgumentException
        Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
        Parameters:
        dictionary - The dictionary to use for lookups.
        Throws:
        java.lang.IllegalArgumentException - if FSA's root node cannot be acquired (dictionary is empty).
    • Method Detail

      • lookup

        public java.util.List<WordData> lookup​(java.lang.CharSequence word)
        Searches the automaton for a symbol sequence equal to word, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data.
        Specified by:
        lookup in interface IStemmer
        Parameters:
        word - The word (typically inflected) to look up base forms for.
        Returns:
        A list of WordData entries (possibly empty).
      • applyReplacements

        public static java.lang.String applyReplacements​(java.lang.CharSequence word,
                                                         java.util.LinkedHashMap<java.lang.String,​java.lang.String> replacements)
        Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).
        Parameters:
        word - The word to apply replacements to.
        replacements - A map of replacements (from->to).
        Returns:
        new string with all replacements applied.
      • iterator

        public java.util.Iterator<WordData> iterator()
        Return an iterator over all WordData entries available in the embedded Dictionary.
        Specified by:
        iterator in interface java.lang.Iterable<WordData>
      • getDictionary

        public Dictionary getDictionary()
        Returns:
        Return the Dictionary used by this object.
      • getSeparatorChar

        public char getSeparatorChar()
        Returns:
        Returns the logical separator character splitting inflected form, lemma correction token and a tag. Note that this character is a best-effort conversion from a byte in DictionaryMetadata.separator and may not be valid in the target encoding (although this is highly unlikely).