Package de.danielnaber.jwordsplitter
Class AbstractWordSplitter
- java.lang.Object
-
- de.danielnaber.jwordsplitter.AbstractWordSplitter
-
- Direct Known Subclasses:
GermanWordSplitter
public abstract class AbstractWordSplitter extends java.lang.Object
This class can split compound words into their smallest parts (atoms). For example "Erhebungsfehler" will be split into "erhebung" and "fehler", if "erhebung" and "fehler" are in the dictionary and "erhebungsfehler" is not. Thus how words are split only depends on the contents of the dictionary. A dictionary for German is included.This is especially useful for German words but it will work with all languages. The order of the words in the collection will be identical to their appearance in the connected word. It's good to provide a large dictionary.
Please note: We don't expect to have any special chars here (!":;,.-_, etc.). Only a set of characters and only one word.
-
-
Field Summary
Fields Modifier and Type Field Description private ExceptionSplits
exceptionSplits
private boolean
hideInterfixCharacters
private int
maximumWordLength
private int
minimumWordLength
private boolean
strictMode
protected java.util.Set<java.lang.String>
words
-
Constructor Summary
Constructors Constructor Description AbstractWordSplitter(boolean hideInterfixCharacters)
Create a word splitter that uses the embedded dictionary.AbstractWordSplitter(boolean hideInterfixCharacters, java.io.File plainTextDict)
AbstractWordSplitter(boolean hideInterfixCharacters, java.io.InputStream plainTextDict)
AbstractWordSplitter(boolean hideInterfixCharacters, java.util.Set<java.lang.String> words)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
addException(java.lang.String completeWord, java.util.List<java.lang.String> wordParts)
private void
cleanLeadingAndTrailingHyphens(java.util.List<java.lang.String> disambiguatedParts)
private boolean
endsWithInterfix(java.lang.String word)
private java.lang.String
findInterfixOrNull(java.lang.String word)
java.util.List<java.util.List<java.lang.String>>
getAllSplits(java.lang.String word)
Experimental: Split a word with unknown parts, typically because one part has a typo.(package private) java.util.List<java.util.List<java.lang.String>>
getAllSplits(java.lang.String word, boolean fromLeft)
protected abstract int
getDefaultMinimumWordLength()
protected abstract GermanInterfixDisambiguator
getDisambiguator()
private java.util.List<java.lang.String>
getExceptionSplitOrNull(java.lang.String rightPart, java.lang.String leftPart)
protected abstract java.util.Collection<java.lang.String>
getInterfixCharacters()
Interfix elements in lowercase, e.g.java.util.List<java.lang.String>
getSubWords(java.lang.String word)
protected abstract java.util.Set<java.lang.String>
getWordList()
private java.util.Set<java.lang.String>
getWordList(java.io.File file)
protected abstract java.util.Set<java.lang.String>
getWordList(java.io.InputStream stream)
private boolean
isLoopEnd(boolean fromLeft, int i, java.lang.String word)
private boolean
isSimpleWord(java.lang.String part)
private java.lang.String
removeInterfix(java.lang.String word, java.lang.String interfixOrNull)
void
setExceptionFile(java.lang.String filename)
void
setMaximumWordLength(int len)
Words longer than this will throw anIllegalArgumentException
to avoid extremely long processing times.void
setMinimumWordLength(int len)
void
setStrictMode(boolean strictMode)
When set to true, words will only be split if all parts are words.private java.util.List<java.lang.String>
split(java.lang.String word, boolean allowInterfixRemoval, boolean collectSubwords)
private java.util.List<java.lang.String>
splitFromRight(java.lang.String word, boolean collectSubwords)
java.util.List<java.lang.String>
splitWord(java.lang.String word)
java.util.List<java.lang.String>
splitWord(java.lang.String word, boolean collectSubwords)
-
-
-
Field Detail
-
words
protected java.util.Set<java.lang.String> words
-
hideInterfixCharacters
private final boolean hideInterfixCharacters
-
exceptionSplits
private ExceptionSplits exceptionSplits
-
strictMode
private boolean strictMode
-
minimumWordLength
private int minimumWordLength
-
maximumWordLength
private int maximumWordLength
-
-
Constructor Detail
-
AbstractWordSplitter
public AbstractWordSplitter(boolean hideInterfixCharacters) throws java.io.IOException
Create a word splitter that uses the embedded dictionary.- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)- Throws:
java.io.IOException
-
AbstractWordSplitter
public AbstractWordSplitter(boolean hideInterfixCharacters, java.io.InputStream plainTextDict) throws java.io.IOException
- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)plainTextDict
- a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format- Throws:
java.io.IOException
-
AbstractWordSplitter
public AbstractWordSplitter(boolean hideInterfixCharacters, java.io.File plainTextDict) throws java.io.IOException
- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)plainTextDict
- a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format- Throws:
java.io.IOException
-
AbstractWordSplitter
public AbstractWordSplitter(boolean hideInterfixCharacters, java.util.Set<java.lang.String> words) throws java.io.IOException
- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)words
- the compound part words- Throws:
java.io.IOException
- Since:
- 4.1
-
-
Method Detail
-
getWordList
protected abstract java.util.Set<java.lang.String> getWordList(java.io.InputStream stream) throws java.io.IOException
- Throws:
java.io.IOException
-
getWordList
protected abstract java.util.Set<java.lang.String> getWordList() throws java.io.IOException
- Throws:
java.io.IOException
-
getDisambiguator
protected abstract GermanInterfixDisambiguator getDisambiguator()
-
getDefaultMinimumWordLength
protected abstract int getDefaultMinimumWordLength()
-
getInterfixCharacters
protected abstract java.util.Collection<java.lang.String> getInterfixCharacters()
Interfix elements in lowercase, e.g. at least "s" for German.
-
getWordList
private java.util.Set<java.lang.String> getWordList(java.io.File file) throws java.io.IOException
- Throws:
java.io.IOException
-
setMinimumWordLength
public void setMinimumWordLength(int len)
-
setMaximumWordLength
public void setMaximumWordLength(int len)
Words longer than this will throw anIllegalArgumentException
to avoid extremely long processing times. The default is 70.- Since:
- 4.2
-
setExceptionFile
public void setExceptionFile(java.lang.String filename) throws java.io.IOException
- Parameters:
filename
- UTF-8 encoded file with exceptions in the classpath, one exception per line, using pipe as delimiter. Example: Pilot|sendung- Throws:
java.io.IOException
-
addException
public void addException(java.lang.String completeWord, java.util.List<java.lang.String> wordParts)
- Parameters:
completeWord
- the word for which an exception is to be defined (will be considered case-insensitive)wordParts
- the parts in which the word is to be split (use a list with a single element if the word should not be split)
-
setStrictMode
public void setStrictMode(boolean strictMode)
When set to true, words will only be split if all parts are words. Otherwise the splitting result might contain parts that are not words.
-
getAllSplits
public java.util.List<java.util.List<java.lang.String>> getAllSplits(java.lang.String word)
Experimental: Split a word with unknown parts, typically because one part has a typo. This could be used to split three-part compounds where one part has a typo (the caller is then responsible for making useful corrections out of these parts). Results are returned in no specific order.- Since:
- 4.0
-
getAllSplits
java.util.List<java.util.List<java.lang.String>> getAllSplits(java.lang.String word, boolean fromLeft) throws java.lang.InterruptedException
- Throws:
java.lang.InterruptedException
-
isLoopEnd
private boolean isLoopEnd(boolean fromLeft, int i, java.lang.String word)
-
getSubWords
public java.util.List<java.lang.String> getSubWords(java.lang.String word)
- Since:
- 4.2
-
splitWord
public java.util.List<java.lang.String> splitWord(java.lang.String word)
-
splitWord
public java.util.List<java.lang.String> splitWord(java.lang.String word, boolean collectSubwords)
- Returns:
- a list of compound parts, with one element (the input word itself) if the input
could not be split; returns an empty list if the input is
null
- Since:
- 4.2
-
cleanLeadingAndTrailingHyphens
private void cleanLeadingAndTrailingHyphens(java.util.List<java.lang.String> disambiguatedParts)
-
split
private java.util.List<java.lang.String> split(java.lang.String word, boolean allowInterfixRemoval, boolean collectSubwords)
-
splitFromRight
private java.util.List<java.lang.String> splitFromRight(java.lang.String word, boolean collectSubwords)
-
getExceptionSplitOrNull
private java.util.List<java.lang.String> getExceptionSplitOrNull(java.lang.String rightPart, java.lang.String leftPart)
-
findInterfixOrNull
private java.lang.String findInterfixOrNull(java.lang.String word)
-
endsWithInterfix
private boolean endsWithInterfix(java.lang.String word)
-
removeInterfix
private java.lang.String removeInterfix(java.lang.String word, java.lang.String interfixOrNull)
-
isSimpleWord
private boolean isSimpleWord(java.lang.String part)
-
-