Interface LanguageProfile
-
- All Known Implementing Classes:
LanguageProfileImpl
public interface LanguageProfile
A language profile knows the locale (language), and contains the n-grams and some statistics.It is built from a training text that should be fairly large and clean.
It contains the n-grams from the training text in the desired gram sizes (eg 2 and 3-grams), with possible text filters applied for cleaning. Also, rarely occurring n-grams may have been cut to reduce the noise and index size. Use a
LanguageProfileBuilder
.The profile may be created at runtime on-the-fly, or it may be loaded from a previously generated text file (see OldLangProfileConverter).
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description int
getFrequency(java.lang.String gram)
@NotNull java.util.List<java.lang.Integer>
getGramLengths()
Tells what the n in n-grams are used here.@NotNull LdLocale
getLocale()
long
getMaxGramCount(int gramLength)
Tells how often the n-gram with the highest amount of occurrences used in this profile occurred.long
getMinGramCount(int gramLength)
Tells how often the n-gram with the lowest amount of occurrences used in this profile occurred.long
getNumGramOccurrences(int gramLength)
Tells how often all n-grams of a certain length occurred, combined.int
getNumGrams()
Tells how many n-grams there are for all n-gram sizes combined.int
getNumGrams(int gramLength)
Tells how many different n-grams there are for a certain n-gram size.@NotNull java.lang.Iterable<java.util.Map.Entry<java.lang.String,java.lang.Integer>>
iterateGrams()
Iterates all ngram strings with frequency.@NotNull java.lang.Iterable<java.util.Map.Entry<java.lang.String,java.lang.Integer>>
iterateGrams(int gramLength)
Iterates all gramLength-gram strings with frequency.
-
-
-
Method Detail
-
getLocale
@NotNull @NotNull LdLocale getLocale()
-
getGramLengths
@NotNull @NotNull java.util.List<java.lang.Integer> getGramLengths()
Tells what the n in n-grams are used here. Example: [1,2,3]- Returns:
- Sorted from smaller to larger.
-
getFrequency
int getFrequency(java.lang.String gram)
- Parameters:
gram
- for example "a" or "foo".- Returns:
- 0-n, also zero if this profile does not use n-grams of that length (for example if no 4-grams are made).
-
getNumGrams
int getNumGrams(int gramLength)
Tells how many different n-grams there are for a certain n-gram size. For example the English language has about 57 different 1-grams, whereas Chinese in Hani has thousands.- Parameters:
gramLength
- 1-n- Returns:
- 0-n, returns zero if no such n-grams were made (for example if no 4-grams were made), or if all the training text did not contain such long words.
-
getNumGrams
int getNumGrams()
Tells how many n-grams there are for all n-gram sizes combined.- Returns:
- 0-n (0 only on an empty profile...)
-
getNumGramOccurrences
long getNumGramOccurrences(int gramLength)
Tells how often all n-grams of a certain length occurred, combined. This returns a much larger number thangetNumGrams(int)
.- Parameters:
gramLength
- 1-n- Returns:
- 0-n, returns zero if no such n-grams were made (for example if no 4-grams were made), or if all the training text did not contain such long words.
-
getMinGramCount
long getMinGramCount(int gramLength)
Tells how often the n-gram with the lowest amount of occurrences used in this profile occurred. Most likely there were n-grams with less (unless the returned number is 1), but they were eliminated in order to keep the profile reasonably small. This is the opposite of getMaxGramCount().- Parameters:
gramLength
- 1-n- Returns:
- 0-n, returns zero if no such n-grams were made or existed.
-
getMaxGramCount
long getMaxGramCount(int gramLength)
Tells how often the n-gram with the highest amount of occurrences used in this profile occurred. This is the opposite of getMinGramCount().- Parameters:
gramLength
- 1-n- Returns:
- 0-n, returns zero if no such n-grams were made or existed.
-
iterateGrams
@NotNull @NotNull java.lang.Iterable<java.util.Map.Entry<java.lang.String,java.lang.Integer>> iterateGrams()
Iterates all ngram strings with frequency.
-
iterateGrams
@NotNull @NotNull java.lang.Iterable<java.util.Map.Entry<java.lang.String,java.lang.Integer>> iterateGrams(int gramLength)
Iterates all gramLength-gram strings with frequency.
-
-