Class AbstractWordSplitter

  • Direct Known Subclasses:
    GermanWordSplitter

    public abstract class AbstractWordSplitter
    extends java.lang.Object
    This class can split compound words into their smallest parts (atoms). For example "Erhebungsfehler" will be split into "erhebung" and "fehler", if "erhebung" and "fehler" are in the dictionary and "erhebungsfehler" is not. Thus how words are split only depends on the contents of the dictionary. A dictionary for German is included.

    This is especially useful for German words but it will work with all languages. The order of the words in the collection will be identical to their appearance in the connected word. It's good to provide a large dictionary.

    Please note: We don't expect to have any special chars here (!":;,.-_, etc.). Only a set of characters and only one word.

    • Constructor Summary

      Constructors 
      Constructor Description
      AbstractWordSplitter​(boolean hideInterfixCharacters)
      Create a word splitter that uses the embedded dictionary.
      AbstractWordSplitter​(boolean hideInterfixCharacters, java.io.File plainTextDict)  
      AbstractWordSplitter​(boolean hideInterfixCharacters, java.io.InputStream plainTextDict)  
      AbstractWordSplitter​(boolean hideInterfixCharacters, java.util.Set<java.lang.String> words)  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      void addException​(java.lang.String completeWord, java.util.List<java.lang.String> wordParts)  
      private void cleanLeadingAndTrailingHyphens​(java.util.List<java.lang.String> disambiguatedParts)  
      private boolean endsWithInterfix​(java.lang.String word)  
      private java.lang.String findInterfixOrNull​(java.lang.String word)  
      java.util.List<java.util.List<java.lang.String>> getAllSplits​(java.lang.String word)
      Experimental: Split a word with unknown parts, typically because one part has a typo.
      (package private) java.util.List<java.util.List<java.lang.String>> getAllSplits​(java.lang.String word, boolean fromLeft)  
      protected abstract int getDefaultMinimumWordLength()  
      protected abstract GermanInterfixDisambiguator getDisambiguator()  
      private java.util.List<java.lang.String> getExceptionSplitOrNull​(java.lang.String rightPart, java.lang.String leftPart)  
      protected abstract java.util.Collection<java.lang.String> getInterfixCharacters()
      Interfix elements in lowercase, e.g.
      java.util.List<java.lang.String> getSubWords​(java.lang.String word)  
      protected abstract java.util.Set<java.lang.String> getWordList()  
      private java.util.Set<java.lang.String> getWordList​(java.io.File file)  
      protected abstract java.util.Set<java.lang.String> getWordList​(java.io.InputStream stream)  
      private boolean isLoopEnd​(boolean fromLeft, int i, java.lang.String word)  
      private boolean isSimpleWord​(java.lang.String part)  
      private java.lang.String removeInterfix​(java.lang.String word, java.lang.String interfixOrNull)  
      void setExceptionFile​(java.lang.String filename)  
      void setMaximumWordLength​(int len)
      Words longer than this will throw an IllegalArgumentException to avoid extremely long processing times.
      void setMinimumWordLength​(int len)  
      void setStrictMode​(boolean strictMode)
      When set to true, words will only be split if all parts are words.
      private java.util.List<java.lang.String> split​(java.lang.String word, boolean allowInterfixRemoval, boolean collectSubwords)  
      private java.util.List<java.lang.String> splitFromRight​(java.lang.String word, boolean collectSubwords)  
      java.util.List<java.lang.String> splitWord​(java.lang.String word)  
      java.util.List<java.lang.String> splitWord​(java.lang.String word, boolean collectSubwords)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • words

        protected java.util.Set<java.lang.String> words
      • hideInterfixCharacters

        private final boolean hideInterfixCharacters
      • strictMode

        private boolean strictMode
      • minimumWordLength

        private int minimumWordLength
      • maximumWordLength

        private int maximumWordLength
    • Constructor Detail

      • AbstractWordSplitter

        public AbstractWordSplitter​(boolean hideInterfixCharacters)
                             throws java.io.IOException
        Create a word splitter that uses the embedded dictionary.
        Parameters:
        hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
        Throws:
        java.io.IOException
      • AbstractWordSplitter

        public AbstractWordSplitter​(boolean hideInterfixCharacters,
                                    java.io.InputStream plainTextDict)
                             throws java.io.IOException
        Parameters:
        hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
        plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format
        Throws:
        java.io.IOException
      • AbstractWordSplitter

        public AbstractWordSplitter​(boolean hideInterfixCharacters,
                                    java.io.File plainTextDict)
                             throws java.io.IOException
        Parameters:
        hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
        plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format
        Throws:
        java.io.IOException
      • AbstractWordSplitter

        public AbstractWordSplitter​(boolean hideInterfixCharacters,
                                    java.util.Set<java.lang.String> words)
                             throws java.io.IOException
        Parameters:
        hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
        words - the compound part words
        Throws:
        java.io.IOException
        Since:
        4.1
    • Method Detail

      • getWordList

        protected abstract java.util.Set<java.lang.String> getWordList​(java.io.InputStream stream)
                                                                throws java.io.IOException
        Throws:
        java.io.IOException
      • getWordList

        protected abstract java.util.Set<java.lang.String> getWordList()
                                                                throws java.io.IOException
        Throws:
        java.io.IOException
      • getDefaultMinimumWordLength

        protected abstract int getDefaultMinimumWordLength()
      • getInterfixCharacters

        protected abstract java.util.Collection<java.lang.String> getInterfixCharacters()
        Interfix elements in lowercase, e.g. at least "s" for German.
      • getWordList

        private java.util.Set<java.lang.String> getWordList​(java.io.File file)
                                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • setMinimumWordLength

        public void setMinimumWordLength​(int len)
      • setMaximumWordLength

        public void setMaximumWordLength​(int len)
        Words longer than this will throw an IllegalArgumentException to avoid extremely long processing times. The default is 70.
        Since:
        4.2
      • setExceptionFile

        public void setExceptionFile​(java.lang.String filename)
                              throws java.io.IOException
        Parameters:
        filename - UTF-8 encoded file with exceptions in the classpath, one exception per line, using pipe as delimiter. Example: Pilot|sendung
        Throws:
        java.io.IOException
      • addException

        public void addException​(java.lang.String completeWord,
                                 java.util.List<java.lang.String> wordParts)
        Parameters:
        completeWord - the word for which an exception is to be defined (will be considered case-insensitive)
        wordParts - the parts in which the word is to be split (use a list with a single element if the word should not be split)
      • setStrictMode

        public void setStrictMode​(boolean strictMode)
        When set to true, words will only be split if all parts are words. Otherwise the splitting result might contain parts that are not words.
      • getAllSplits

        public java.util.List<java.util.List<java.lang.String>> getAllSplits​(java.lang.String word)
        Experimental: Split a word with unknown parts, typically because one part has a typo. This could be used to split three-part compounds where one part has a typo (the caller is then responsible for making useful corrections out of these parts). Results are returned in no specific order.
        Since:
        4.0
      • getAllSplits

        java.util.List<java.util.List<java.lang.String>> getAllSplits​(java.lang.String word,
                                                                      boolean fromLeft)
                                                               throws java.lang.InterruptedException
        Throws:
        java.lang.InterruptedException
      • isLoopEnd

        private boolean isLoopEnd​(boolean fromLeft,
                                  int i,
                                  java.lang.String word)
      • getSubWords

        public java.util.List<java.lang.String> getSubWords​(java.lang.String word)
        Since:
        4.2
      • splitWord

        public java.util.List<java.lang.String> splitWord​(java.lang.String word)
      • splitWord

        public java.util.List<java.lang.String> splitWord​(java.lang.String word,
                                                          boolean collectSubwords)
        Returns:
        a list of compound parts, with one element (the input word itself) if the input could not be split; returns an empty list if the input is null
        Since:
        4.2
      • cleanLeadingAndTrailingHyphens

        private void cleanLeadingAndTrailingHyphens​(java.util.List<java.lang.String> disambiguatedParts)
      • split

        private java.util.List<java.lang.String> split​(java.lang.String word,
                                                       boolean allowInterfixRemoval,
                                                       boolean collectSubwords)
      • splitFromRight

        private java.util.List<java.lang.String> splitFromRight​(java.lang.String word,
                                                                boolean collectSubwords)
      • getExceptionSplitOrNull

        private java.util.List<java.lang.String> getExceptionSplitOrNull​(java.lang.String rightPart,
                                                                         java.lang.String leftPart)
      • findInterfixOrNull

        private java.lang.String findInterfixOrNull​(java.lang.String word)
      • endsWithInterfix

        private boolean endsWithInterfix​(java.lang.String word)
      • removeInterfix

        private java.lang.String removeInterfix​(java.lang.String word,
                                                java.lang.String interfixOrNull)
      • isSimpleWord

        private boolean isSimpleWord​(java.lang.String part)