Package morfologik.stemming
Class DictionaryLookup
- java.lang.Object
-
- morfologik.stemming.DictionaryLookup
-
-
Field Summary
Fields Modifier and Type Field Description private java.nio.ByteBuffer
byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private java.nio.CharBuffer
charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private java.nio.charset.CharsetDecoder
decoder
Charset decoder for the FSA.private Dictionary
dictionary
TheDictionary
this lookup is using.private DictionaryMetadata
dictionaryMetadata
Features of the compiled dictionary.private java.nio.charset.CharsetEncoder
encoder
Charset encoder for the FSA.private static int
EXPAND_SIZE
Expand buffers and arrays by this constant.private ByteSequenceIterator
finalStatesIterator
An iterator for walking along the final states offsa
.private WordData[]
forms
Private internal array of reusable word data objects.private ArrayViewList<WordData>
formsList
A "view" over an array implementingprivate FSA
fsa
The FSA we are using.private FSATraversal
matcher
An FSA used for lookups.private MatchResult
matchResult
Reusable match result.private int
rootNode
FSA's root node.private char
separatorChar
private ISequenceEncoder
sequenceEncoder
-
Constructor Summary
Constructors Constructor Description DictionaryLookup(Dictionary dictionary)
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.String
applyReplacements(java.lang.CharSequence word, java.util.LinkedHashMap<java.lang.String,java.lang.String> replacements)
Apply partial string replacements from a given map.Dictionary
getDictionary()
char
getSeparatorChar()
java.util.Iterator<WordData>
iterator()
Return an iterator over allWordData
entries available in the embeddedDictionary
.java.util.List<WordData>
lookup(java.lang.CharSequence word)
Searches the automaton for a symbol sequence equal toword
, followed by a separator.
-
-
-
Field Detail
-
matcher
private final FSATraversal matcher
An FSA used for lookups.
-
finalStatesIterator
private final ByteSequenceIterator finalStatesIterator
An iterator for walking along the final states offsa
.
-
rootNode
private final int rootNode
FSA's root node.
-
EXPAND_SIZE
private static final int EXPAND_SIZE
Expand buffers and arrays by this constant.- See Also:
- Constant Field Values
-
forms
private WordData[] forms
Private internal array of reusable word data objects.
-
formsList
private final ArrayViewList<WordData> formsList
A "view" over an array implementing
-
dictionaryMetadata
private final DictionaryMetadata dictionaryMetadata
Features of the compiled dictionary.- See Also:
DictionaryMetadata
-
encoder
private final java.nio.charset.CharsetEncoder encoder
Charset encoder for the FSA.
-
decoder
private final java.nio.charset.CharsetDecoder decoder
Charset decoder for the FSA.
-
fsa
private final FSA fsa
The FSA we are using.
-
separatorChar
private final char separatorChar
- See Also:
getSeparatorChar()
-
byteBuffer
private java.nio.ByteBuffer byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.
-
charBuffer
private java.nio.CharBuffer charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.
-
matchResult
private final MatchResult matchResult
Reusable match result.
-
dictionary
private final Dictionary dictionary
TheDictionary
this lookup is using.
-
sequenceEncoder
private final ISequenceEncoder sequenceEncoder
-
-
Constructor Detail
-
DictionaryLookup
public DictionaryLookup(Dictionary dictionary) throws java.lang.IllegalArgumentException
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.- Parameters:
dictionary
- The dictionary to use for lookups.- Throws:
java.lang.IllegalArgumentException
- if FSA's root node cannot be acquired (dictionary is empty).
-
-
Method Detail
-
lookup
public java.util.List<WordData> lookup(java.lang.CharSequence word)
Searches the automaton for a symbol sequence equal toword
, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data.
-
applyReplacements
public static java.lang.String applyReplacements(java.lang.CharSequence word, java.util.LinkedHashMap<java.lang.String,java.lang.String> replacements)
Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).- Parameters:
word
- The word to apply replacements to.replacements
- A map of replacements (from->to).- Returns:
- new string with all replacements applied.
-
iterator
public java.util.Iterator<WordData> iterator()
Return an iterator over allWordData
entries available in the embeddedDictionary
.- Specified by:
iterator
in interfacejava.lang.Iterable<WordData>
-
getDictionary
public Dictionary getDictionary()
- Returns:
- Return the
Dictionary
used by this object.
-
getSeparatorChar
public char getSeparatorChar()
- Returns:
- Returns the logical separator character splitting inflected form,
lemma correction token and a tag. Note that this character is a best-effort
conversion from a byte in
DictionaryMetadata.separator
and may not be valid in the target encoding (although this is highly unlikely).
-
-