Package morfologik.speller
Class Speller
- java.lang.Object
-
- morfologik.speller.Speller
-
public class Speller extends java.lang.Object
Finds spelling suggestions. Implements K. Oflazer's algorithm as described in: Oflazer, Kemal. 1996. "Error-Tolerant Finite-State Recognition with Applications to Morphological Analysis and Spelling Correction." Computational Linguistics 22 (1): 73–89.See Jan Daciuk's
s_fsa
package.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
Speller.CandidateData
Used to sort candidates according to edit distance, and possibly according to their frequency in the future.
-
Field Summary
Fields Modifier and Type Field Description private java.nio.ByteBuffer
byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private char[]
candidate
private int
candLen
private java.nio.CharBuffer
charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private boolean
containsSeparators
private java.nio.charset.CharsetDecoder
decoder
Charset decoder for the FSA.private DictionaryMetadata
dictionaryMetadata
Features of the compiled dictionary.private int
editDistance
private int
effectEditDistance
private java.nio.charset.CharsetEncoder
encoder
Charset encoder for the FSA.private ByteSequenceIterator
finalStatesIterator
An iterator for walking along the final states offsa
.(package private) static int
FIRST_RANGE_CODE
(package private) static int
FREQ_RANGES
private FSA
fsa
The FSA we are using.private HMatrix
hMatrix
private FSATraversal
matcher
An FSA used for lookups.private MatchResult
matchResult
Reusable match result.private static int
MAX_RECURSION_LEVEL
static int
MAX_WORD_LENGTH
Maximum length of the word to be checked.private static int
MIN_WORD_LENGTH
private java.util.Map<java.lang.Character,java.util.List<char[]>>
replacementsAnyToOne
private java.util.Map<java.lang.String,java.util.List<char[]>>
replacementsAnyToTwo
private java.util.Map<java.lang.String,java.util.List<java.lang.String>>
replacementsTheRest
private int
rootNode
FSA's root node.(package private) static int
UPPER_SEARCH_LIMIT
private int
wordLen
private char[]
wordProcessed
-
Constructor Summary
Constructors Constructor Description Speller(Dictionary dictionary)
Speller(Dictionary dictionary, int editDistance)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
addReplacement(java.util.List<Speller.CandidateData> candidates, java.lang.String replacement)
private boolean
areEqual(char x, char y)
private java.nio.ByteBuffer
charSequenceToBytes(java.lang.CharSequence word)
(package private) static boolean
containsNoDigit(java.lang.String s)
Checks whether a string contains a digit.boolean
convertsCase()
Used to determine whether the dictionary supports case conversions.private void
createReplacementsMaps()
int
cuted(int depth, int wordIndex, int candIndex)
Calculates cut-off edit distance.int
ed(int i, int j, int wordIndex, int candIndex)
Calculates edit distance.private void
findRepl(java.util.List<Speller.CandidateData> candidates, int depth, int node, byte[] prevBytes, int wordIndex, int candIndex)
java.util.ArrayList<Speller.CandidateData>
findReplacementCandidates(java.lang.String word)
Find and return suggestions by using K.private java.util.ArrayList<Speller.CandidateData>
findReplacementCandidates(java.lang.String word, boolean evenIfWordInDictionary)
java.util.ArrayList<java.lang.String>
findReplacements(java.lang.String word)
Find suggestions by using K.java.util.ArrayList<Speller.CandidateData>
findSimilarWordCandidates(java.lang.String word)
Find similar words even if the original word is a correct word that exists in the dictionaryjava.util.ArrayList<java.lang.String>
findSimilarWords(java.lang.String word)
java.util.List<java.lang.String>
getAllReplacements(java.lang.String str, int fromIndex, int level)
int
getCandLen()
int
getEffectiveED()
int
getFrequency(java.lang.CharSequence word)
Get the frequency value for a word form.int
getWordLen()
private java.lang.CharSequence
initialUppercase(java.lang.String wordToCheck)
(package private) boolean
isAllUppercase(java.lang.String str)
Returns true ifstr
is made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists).(package private) static boolean
isAlphabetic(int codePoint)
Copy-paste of Character.isAlphabetic() (needed as we require only 1.6)private boolean
isArcNotTerminal(int arc, int candIndex)
private boolean
isBeforeSeparator(int arc)
boolean
isCamelCase(java.lang.String str)
private boolean
isEndOfCandidate(int arc, int wordIndex)
boolean
isInDictionary(java.lang.CharSequence word)
Test whether the word is found in the dictionary.boolean
isMisspelled(java.lang.String word)
Checks whether the word is misspelled, by performing a series of checks according to properties of the dictionary.(package private) boolean
isMixedCase(java.lang.String str)
(package private) boolean
isNotAllLowercase(java.lang.String str)
Returns true ifstr
is made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists).(package private) boolean
isNotCapitalizedWord(java.lang.String str)
(package private) static boolean
isNotEmpty(java.lang.String str)
Helper method to replace calls to "".equals().private int
matchAnyToOne(int wordIndex, int candIndex)
private int
matchAnyToTwo(int wordIndex, int candIndex)
private static int
min(int a, int b, int c)
java.util.List<Speller.CandidateData>
replaceRunOnWordCandidates(java.lang.String original)
Propose suggestions for misspelled run-on words.java.util.List<java.lang.String>
replaceRunOnWords(java.lang.String original)
Propose suggestions for misspelled run-on words.(package private) void
setWordAndCandidate(java.lang.String word, java.lang.String candidate)
Sets up the word and candidate.
-
-
-
Field Detail
-
MAX_WORD_LENGTH
public static final int MAX_WORD_LENGTH
Maximum length of the word to be checked.- See Also:
- Constant Field Values
-
FREQ_RANGES
static final int FREQ_RANGES
- See Also:
- Constant Field Values
-
FIRST_RANGE_CODE
static final int FIRST_RANGE_CODE
- See Also:
- Constant Field Values
-
UPPER_SEARCH_LIMIT
static final int UPPER_SEARCH_LIMIT
- See Also:
- Constant Field Values
-
MIN_WORD_LENGTH
private static final int MIN_WORD_LENGTH
- See Also:
- Constant Field Values
-
MAX_RECURSION_LEVEL
private static final int MAX_RECURSION_LEVEL
- See Also:
- Constant Field Values
-
editDistance
private final int editDistance
-
effectEditDistance
private int effectEditDistance
-
hMatrix
private final HMatrix hMatrix
-
candidate
private char[] candidate
-
candLen
private int candLen
-
wordLen
private int wordLen
-
wordProcessed
private char[] wordProcessed
-
replacementsAnyToOne
private java.util.Map<java.lang.Character,java.util.List<char[]>> replacementsAnyToOne
-
replacementsAnyToTwo
private java.util.Map<java.lang.String,java.util.List<char[]>> replacementsAnyToTwo
-
replacementsTheRest
private java.util.Map<java.lang.String,java.util.List<java.lang.String>> replacementsTheRest
-
containsSeparators
private boolean containsSeparators
-
byteBuffer
private java.nio.ByteBuffer byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.
-
charBuffer
private java.nio.CharBuffer charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.
-
matchResult
private final MatchResult matchResult
Reusable match result.
-
dictionaryMetadata
private final DictionaryMetadata dictionaryMetadata
Features of the compiled dictionary.- See Also:
DictionaryMetadata
-
encoder
private final java.nio.charset.CharsetEncoder encoder
Charset encoder for the FSA.
-
decoder
private final java.nio.charset.CharsetDecoder decoder
Charset decoder for the FSA.
-
matcher
private final FSATraversal matcher
An FSA used for lookups.
-
rootNode
private final int rootNode
FSA's root node.
-
fsa
private final FSA fsa
The FSA we are using.
-
finalStatesIterator
private final ByteSequenceIterator finalStatesIterator
An iterator for walking along the final states offsa
.
-
-
Constructor Detail
-
Speller
public Speller(Dictionary dictionary)
-
Speller
public Speller(Dictionary dictionary, int editDistance)
-
-
Method Detail
-
createReplacementsMaps
private void createReplacementsMaps()
-
charSequenceToBytes
private java.nio.ByteBuffer charSequenceToBytes(java.lang.CharSequence word) throws UnmappableInputException
- Throws:
UnmappableInputException
-
isMisspelled
public boolean isMisspelled(java.lang.String word)
Checks whether the word is misspelled, by performing a series of checks according to properties of the dictionary. If the flagfsa.dict.speller.ignore-punctuation
is set, then all non-alphabetic characters are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-numbers
is set, then all words containing decimal digits are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-camel-case
is set, then all CamelCase words are considered to be correctly spelled. If the flagfsa.dict.speller.ignore-all-uppercase
is set, then all alphabetic words composed of only uppercase characters are considered to be correctly spelled. Otherwise, the word is checked in the dictionary. If the test fails, and the dictionary does not perform any case conversions (as set byfsa.dict.speller.convert-case
flag), then the method returns false. In case of case conversions, it is checked whether a non-mixed case word is found in its lowercase version in the dictionary, and for all-uppercase words, whether the word is found in the dictionary with the initial uppercase letter.- Parameters:
word
- - the word to be checked- Returns:
- true if the word is misspelled
-
initialUppercase
private java.lang.CharSequence initialUppercase(java.lang.String wordToCheck)
-
isInDictionary
public boolean isInDictionary(java.lang.CharSequence word)
Test whether the word is found in the dictionary.- Parameters:
word
- the word to be tested- Returns:
- True if it is found.
-
getFrequency
public int getFrequency(java.lang.CharSequence word)
Get the frequency value for a word form. It is taken from the first entry with this word form.- Parameters:
word
- the word to be tested- Returns:
- frequency value in range: 0..FREQ_RANGE-1 (0: less frequent).
-
replaceRunOnWordCandidates
public java.util.List<Speller.CandidateData> replaceRunOnWordCandidates(java.lang.String original)
Propose suggestions for misspelled run-on words. This algorithm is inspired by spell.cc in s_fsa package by Jan Daciuk.- Parameters:
original
- The original misspelled word.- Returns:
- The list of suggested pairs, as CandidateData with space-concatenated strings.
-
replaceRunOnWords
public java.util.List<java.lang.String> replaceRunOnWords(java.lang.String original)
Propose suggestions for misspelled run-on words. This algorithm is inspired by spell.cc in s_fsa package by Jan Daciuk.- Parameters:
original
- The original misspelled word.- Returns:
- The list of suggested pairs, as space-concatenated strings.
-
addReplacement
private void addReplacement(java.util.List<Speller.CandidateData> candidates, java.lang.String replacement)
-
findSimilarWordCandidates
public java.util.ArrayList<Speller.CandidateData> findSimilarWordCandidates(java.lang.String word)
Find similar words even if the original word is a correct word that exists in the dictionary- Parameters:
word
- The original word.- Returns:
- A list of suggested candidate replacements.
-
findSimilarWords
public java.util.ArrayList<java.lang.String> findSimilarWords(java.lang.String word)
-
findReplacements
public java.util.ArrayList<java.lang.String> findReplacements(java.lang.String word)
Find suggestions by using K. Oflazer's algorithm. See Jan Daciuk's s_fsa package, spell.cc for further explanation.- Parameters:
word
- The original misspelled word.- Returns:
- A list of suggested replacements.
-
findReplacementCandidates
public java.util.ArrayList<Speller.CandidateData> findReplacementCandidates(java.lang.String word)
Find and return suggestions by using K. Oflazer's algorithm. See Jan Daciuk's s_fsa package, spell.cc for further explanation. This method is identical tofindReplacements(java.lang.String)
, but returns candidate terms with their edit distance scores.- Parameters:
word
- The original misspelled word.- Returns:
- A list of suggested candidate replacements.
-
findReplacementCandidates
private java.util.ArrayList<Speller.CandidateData> findReplacementCandidates(java.lang.String word, boolean evenIfWordInDictionary)
-
findRepl
private void findRepl(java.util.List<Speller.CandidateData> candidates, int depth, int node, byte[] prevBytes, int wordIndex, int candIndex)
-
isArcNotTerminal
private boolean isArcNotTerminal(int arc, int candIndex)
-
isEndOfCandidate
private boolean isEndOfCandidate(int arc, int wordIndex)
-
isBeforeSeparator
private boolean isBeforeSeparator(int arc)
-
ed
public int ed(int i, int j, int wordIndex, int candIndex)
Calculates edit distance.- Parameters:
i
- length of first word (here: misspelled) - 1;j
- length of second word (here: candidate) - 1.wordIndex
- (TODO: javadoc?)candIndex
- (TODO: javadoc?)- Returns:
- Edit distance between the two words. Remarks: See Oflazer.
-
areEqual
private boolean areEqual(char x, char y)
-
cuted
public int cuted(int depth, int wordIndex, int candIndex)
Calculates cut-off edit distance.- Parameters:
depth
- current length of candidates.wordIndex
- (TODO: javadoc?)candIndex
- (TODO: javadoc?)- Returns:
- Cut-off edit distance. Remarks: See Oflazer.
-
matchAnyToOne
private int matchAnyToOne(int wordIndex, int candIndex)
-
matchAnyToTwo
private int matchAnyToTwo(int wordIndex, int candIndex)
-
min
private static int min(int a, int b, int c)
-
isAlphabetic
static boolean isAlphabetic(int codePoint)
Copy-paste of Character.isAlphabetic() (needed as we require only 1.6)- Parameters:
codePoint
- The input character.- Returns:
- True if the character is a Unicode alphabetic character.
-
containsNoDigit
static boolean containsNoDigit(java.lang.String s)
Checks whether a string contains a digit. Used for ignoring words with numbers- Parameters:
s
- Word to be checked.- Returns:
- True if there is a digit inside the word.
-
isAllUppercase
boolean isAllUppercase(java.lang.String str)
Returns true ifstr
is made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists).
-
isNotAllLowercase
boolean isNotAllLowercase(java.lang.String str)
Returns true ifstr
is made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists).
-
isNotCapitalizedWord
boolean isNotCapitalizedWord(java.lang.String str)
- Parameters:
str
- input string
-
isNotEmpty
static boolean isNotEmpty(java.lang.String str)
Helper method to replace calls to "".equals().- Parameters:
str
- String to check- Returns:
- true if string is empty OR null
-
isMixedCase
boolean isMixedCase(java.lang.String str)
- Parameters:
str
- input str- Returns:
- Returns true if str is MixedCase.
-
isCamelCase
public boolean isCamelCase(java.lang.String str)
- Parameters:
str
- The string to check.- Returns:
- Returns true if str is CamelCase. Note that German compounds with a dash (like "Waschmaschinen-Test") are also considered camel case by this method.
-
convertsCase
public boolean convertsCase()
Used to determine whether the dictionary supports case conversions.- Returns:
- boolean value that answers this question in a deep and meaningful way.
- Since:
- 1.9
-
getAllReplacements
public java.util.List<java.lang.String> getAllReplacements(java.lang.String str, int fromIndex, int level)
- Parameters:
str
- The string to find the replacements for.fromIndex
- The index from which replacements are found.level
- The recursion level. The search stops if level is > MAX_RECURSION_LEVEL.- Returns:
- A list of all possible replacements of a {#link str} given string
-
setWordAndCandidate
void setWordAndCandidate(java.lang.String word, java.lang.String candidate)
Sets up the word and candidate. Used only to test the edit distance in JUnit tests.- Parameters:
word
- the first wordcandidate
- the second word used for edit distance calculation
-
getWordLen
public final int getWordLen()
-
getCandLen
public final int getCandLen()
-
getEffectiveED
public final int getEffectiveED()
-
-