Class Speller


  • public class Speller
    extends java.lang.Object
    Finds spelling suggestions. Implements K. Oflazer's algorithm as described in: Oflazer, Kemal. 1996. "Error-Tolerant Finite-State Recognition with Applications to Morphological Analysis and Spelling Correction." Computational Linguistics 22 (1): 73–89.

    See Jan Daciuk's s_fsa package.

    • Field Detail

      • MAX_WORD_LENGTH

        public static final int MAX_WORD_LENGTH
        Maximum length of the word to be checked.
        See Also:
        Constant Field Values
      • editDistance

        private final int editDistance
      • effectEditDistance

        private int effectEditDistance
      • hMatrix

        private final HMatrix hMatrix
      • candidate

        private char[] candidate
      • candLen

        private int candLen
      • wordLen

        private int wordLen
      • wordProcessed

        private char[] wordProcessed
      • replacementsAnyToOne

        private java.util.Map<java.lang.Character,​java.util.List<char[]>> replacementsAnyToOne
      • replacementsAnyToTwo

        private java.util.Map<java.lang.String,​java.util.List<char[]>> replacementsAnyToTwo
      • replacementsTheRest

        private java.util.Map<java.lang.String,​java.util.List<java.lang.String>> replacementsTheRest
      • containsSeparators

        private boolean containsSeparators
      • byteBuffer

        private java.nio.ByteBuffer byteBuffer
        Internal reusable buffer for encoding words into byte arrays using encoder.
      • charBuffer

        private java.nio.CharBuffer charBuffer
        Internal reusable buffer for encoding words into byte arrays using encoder.
      • matchResult

        private final MatchResult matchResult
        Reusable match result.
      • encoder

        private final java.nio.charset.CharsetEncoder encoder
        Charset encoder for the FSA.
      • decoder

        private final java.nio.charset.CharsetDecoder decoder
        Charset decoder for the FSA.
      • matcher

        private final FSATraversal matcher
        An FSA used for lookups.
      • rootNode

        private final int rootNode
        FSA's root node.
      • fsa

        private final FSA fsa
        The FSA we are using.
      • finalStatesIterator

        private final ByteSequenceIterator finalStatesIterator
        An iterator for walking along the final states of fsa.
    • Constructor Detail

      • Speller

        public Speller​(Dictionary dictionary)
      • Speller

        public Speller​(Dictionary dictionary,
                       int editDistance)
    • Method Detail

      • createReplacementsMaps

        private void createReplacementsMaps()
      • isMisspelled

        public boolean isMisspelled​(java.lang.String word)
        Checks whether the word is misspelled, by performing a series of checks according to properties of the dictionary. If the flag fsa.dict.speller.ignore-punctuation is set, then all non-alphabetic characters are considered to be correctly spelled. If the flag fsa.dict.speller.ignore-numbers is set, then all words containing decimal digits are considered to be correctly spelled. If the flag fsa.dict.speller.ignore-camel-case is set, then all CamelCase words are considered to be correctly spelled. If the flag fsa.dict.speller.ignore-all-uppercase is set, then all alphabetic words composed of only uppercase characters are considered to be correctly spelled. Otherwise, the word is checked in the dictionary. If the test fails, and the dictionary does not perform any case conversions (as set by fsa.dict.speller.convert-case flag), then the method returns false. In case of case conversions, it is checked whether a non-mixed case word is found in its lowercase version in the dictionary, and for all-uppercase words, whether the word is found in the dictionary with the initial uppercase letter.
        Parameters:
        word - - the word to be checked
        Returns:
        true if the word is misspelled
      • initialUppercase

        private java.lang.CharSequence initialUppercase​(java.lang.String wordToCheck)
      • isInDictionary

        public boolean isInDictionary​(java.lang.CharSequence word)
        Test whether the word is found in the dictionary.
        Parameters:
        word - the word to be tested
        Returns:
        True if it is found.
      • getFrequency

        public int getFrequency​(java.lang.CharSequence word)
        Get the frequency value for a word form. It is taken from the first entry with this word form.
        Parameters:
        word - the word to be tested
        Returns:
        frequency value in range: 0..FREQ_RANGE-1 (0: less frequent).
      • replaceRunOnWordCandidates

        public java.util.List<Speller.CandidateData> replaceRunOnWordCandidates​(java.lang.String original)
        Propose suggestions for misspelled run-on words. This algorithm is inspired by spell.cc in s_fsa package by Jan Daciuk.
        Parameters:
        original - The original misspelled word.
        Returns:
        The list of suggested pairs, as CandidateData with space-concatenated strings.
      • replaceRunOnWords

        public java.util.List<java.lang.String> replaceRunOnWords​(java.lang.String original)
        Propose suggestions for misspelled run-on words. This algorithm is inspired by spell.cc in s_fsa package by Jan Daciuk.
        Parameters:
        original - The original misspelled word.
        Returns:
        The list of suggested pairs, as space-concatenated strings.
      • addReplacement

        private void addReplacement​(java.util.List<Speller.CandidateData> candidates,
                                    java.lang.String replacement)
      • findSimilarWordCandidates

        public java.util.ArrayList<Speller.CandidateData> findSimilarWordCandidates​(java.lang.String word)
        Find similar words even if the original word is a correct word that exists in the dictionary
        Parameters:
        word - The original word.
        Returns:
        A list of suggested candidate replacements.
      • findSimilarWords

        public java.util.ArrayList<java.lang.String> findSimilarWords​(java.lang.String word)
      • findReplacements

        public java.util.ArrayList<java.lang.String> findReplacements​(java.lang.String word)
        Find suggestions by using K. Oflazer's algorithm. See Jan Daciuk's s_fsa package, spell.cc for further explanation.
        Parameters:
        word - The original misspelled word.
        Returns:
        A list of suggested replacements.
      • findReplacementCandidates

        public java.util.ArrayList<Speller.CandidateData> findReplacementCandidates​(java.lang.String word)
        Find and return suggestions by using K. Oflazer's algorithm. See Jan Daciuk's s_fsa package, spell.cc for further explanation. This method is identical to findReplacements(java.lang.String), but returns candidate terms with their edit distance scores.
        Parameters:
        word - The original misspelled word.
        Returns:
        A list of suggested candidate replacements.
      • findReplacementCandidates

        private java.util.ArrayList<Speller.CandidateData> findReplacementCandidates​(java.lang.String word,
                                                                                     boolean evenIfWordInDictionary)
      • findRepl

        private void findRepl​(java.util.List<Speller.CandidateData> candidates,
                              int depth,
                              int node,
                              byte[] prevBytes,
                              int wordIndex,
                              int candIndex)
      • isArcNotTerminal

        private boolean isArcNotTerminal​(int arc,
                                         int candIndex)
      • isEndOfCandidate

        private boolean isEndOfCandidate​(int arc,
                                         int wordIndex)
      • isBeforeSeparator

        private boolean isBeforeSeparator​(int arc)
      • ed

        public int ed​(int i,
                      int j,
                      int wordIndex,
                      int candIndex)
        Calculates edit distance.
        Parameters:
        i - length of first word (here: misspelled) - 1;
        j - length of second word (here: candidate) - 1.
        wordIndex - (TODO: javadoc?)
        candIndex - (TODO: javadoc?)
        Returns:
        Edit distance between the two words. Remarks: See Oflazer.
      • areEqual

        private boolean areEqual​(char x,
                                 char y)
      • cuted

        public int cuted​(int depth,
                         int wordIndex,
                         int candIndex)
        Calculates cut-off edit distance.
        Parameters:
        depth - current length of candidates.
        wordIndex - (TODO: javadoc?)
        candIndex - (TODO: javadoc?)
        Returns:
        Cut-off edit distance. Remarks: See Oflazer.
      • matchAnyToOne

        private int matchAnyToOne​(int wordIndex,
                                  int candIndex)
      • matchAnyToTwo

        private int matchAnyToTwo​(int wordIndex,
                                  int candIndex)
      • min

        private static int min​(int a,
                               int b,
                               int c)
      • isAlphabetic

        static boolean isAlphabetic​(int codePoint)
        Copy-paste of Character.isAlphabetic() (needed as we require only 1.6)
        Parameters:
        codePoint - The input character.
        Returns:
        True if the character is a Unicode alphabetic character.
      • containsNoDigit

        static boolean containsNoDigit​(java.lang.String s)
        Checks whether a string contains a digit. Used for ignoring words with numbers
        Parameters:
        s - Word to be checked.
        Returns:
        True if there is a digit inside the word.
      • isAllUppercase

        boolean isAllUppercase​(java.lang.String str)
        Returns true if str is made up of all-uppercase characters (ignoring characters for which no upper-/lowercase distinction exists).
      • isNotAllLowercase

        boolean isNotAllLowercase​(java.lang.String str)
        Returns true if str is made up of all-lowercase characters (ignoring characters for which no upper-/lowercase distinction exists).
      • isNotCapitalizedWord

        boolean isNotCapitalizedWord​(java.lang.String str)
        Parameters:
        str - input string
      • isNotEmpty

        static boolean isNotEmpty​(java.lang.String str)
        Helper method to replace calls to "".equals().
        Parameters:
        str - String to check
        Returns:
        true if string is empty OR null
      • isMixedCase

        boolean isMixedCase​(java.lang.String str)
        Parameters:
        str - input str
        Returns:
        Returns true if str is MixedCase.
      • isCamelCase

        public boolean isCamelCase​(java.lang.String str)
        Parameters:
        str - The string to check.
        Returns:
        Returns true if str is CamelCase. Note that German compounds with a dash (like "Waschmaschinen-Test") are also considered camel case by this method.
      • convertsCase

        public boolean convertsCase()
        Used to determine whether the dictionary supports case conversions.
        Returns:
        boolean value that answers this question in a deep and meaningful way.
        Since:
        1.9
      • getAllReplacements

        public java.util.List<java.lang.String> getAllReplacements​(java.lang.String str,
                                                                   int fromIndex,
                                                                   int level)
        Parameters:
        str - The string to find the replacements for.
        fromIndex - The index from which replacements are found.
        level - The recursion level. The search stops if level is > MAX_RECURSION_LEVEL.
        Returns:
        A list of all possible replacements of a {#link str} given string
      • setWordAndCandidate

        void setWordAndCandidate​(java.lang.String word,
                                 java.lang.String candidate)
        Sets up the word and candidate. Used only to test the edit distance in JUnit tests.
        Parameters:
        word - the first word
        candidate - the second word used for edit distance calculation
      • getWordLen

        public final int getWordLen()
      • getCandLen

        public final int getCandLen()
      • getEffectiveED

        public final int getEffectiveED()