Package com.optimaize.langdetect
Class LanguageDetectorImpl
- java.lang.Object
-
- com.optimaize.langdetect.LanguageDetectorImpl
-
- All Implemented Interfaces:
LanguageDetector
public final class LanguageDetectorImpl extends java.lang.Object implements LanguageDetector
This class is immutable and thus thread-safe.
-
-
Field Summary
Fields Modifier and Type Field Description private double
alpha
private static double
ALPHA_WIDTH
TODO document what this is for, and why that value is chosen.private static int
BASE_FREQ
TODO document what this is for, and why that value is chosen.private static double
CONV_THRESHOLD
TODO document what this is for, and why that value is chosen.private static long
DEFAULT_SEED
This is used when no custom seed was passed in.private static int
ITERATION_LIMIT
TODO document what this is for, and why that value is chosen.private static org.slf4j.Logger
logger
private double
minimalConfidence
private static int
N_TRIAL
TODO document what this is for, and why that value is chosen.private NgramExtractor
ngramExtractor
private @NotNull NgramFrequencyData
ngramFrequencyData
private double
prefixFactor
private @org.jetbrains.annotations.Nullable double[]
priorMap
User-defined language priorities, in the same order aslanglist
.private static java.util.Comparator<DetectedLanguage>
PROBABILITY_SORTING_COMPARATOR
private double
probabilityThreshold
private com.google.common.base.Optional<java.lang.Long>
seed
private int
shortTextAlgorithm
private double
suffixFactor
-
Constructor Summary
Constructors Constructor Description LanguageDetectorImpl(@NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<java.lang.Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable java.util.Map<LdLocale,java.lang.Double> langWeightingMap, @NotNull NgramExtractor ngramExtractor)
Use theLanguageDetectorBuilder
.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description com.google.common.base.Optional<LdLocale>
detect(java.lang.CharSequence text)
Returns the best detected language if the algorithm is very confident.private @org.jetbrains.annotations.Nullable double[]
detectBlock(java.lang.CharSequence text)
private double[]
detectBlockLongText(java.util.List<java.lang.String> ngrams)
This is the original algorithm used for all text length.private double[]
detectBlockShortText(java.util.Map<java.lang.String,java.lang.Integer> ngrams)
java.util.List<DetectedLanguage>
getProbabilities(java.lang.CharSequence text)
Returns all languages with at least some likeliness.private double[]
initProbability()
Initialize the map of language probabilities.private @NotNull java.util.List<DetectedLanguage>
sortProbability(double[] prob)
Returns the detected languages sorted by probabilities descending.private boolean
updateLangProb(@org.jetbrains.annotations.NotNull double[] prob, @NotNull java.lang.String ngram, int count, double alpha)
update language probabilities with N-gram string(N=1,2,3)
-
-
-
Field Detail
-
logger
private static final org.slf4j.Logger logger
-
ALPHA_WIDTH
private static final double ALPHA_WIDTH
TODO document what this is for, and why that value is chosen.- See Also:
- Constant Field Values
-
ITERATION_LIMIT
private static final int ITERATION_LIMIT
TODO document what this is for, and why that value is chosen.- See Also:
- Constant Field Values
-
CONV_THRESHOLD
private static final double CONV_THRESHOLD
TODO document what this is for, and why that value is chosen.- See Also:
- Constant Field Values
-
BASE_FREQ
private static final int BASE_FREQ
TODO document what this is for, and why that value is chosen.- See Also:
- Constant Field Values
-
N_TRIAL
private static final int N_TRIAL
TODO document what this is for, and why that value is chosen.- See Also:
- Constant Field Values
-
DEFAULT_SEED
private static final long DEFAULT_SEED
This is used when no custom seed was passed in. By using the same seed for different calls, the results are consistent also. Changing this number means that users of the library might suddenly see other results after updating. So don't change it hastily. I chose a prime number *clueless*. See https://github.com/optimaize/language-detector/issues/14- See Also:
- Constant Field Values
-
PROBABILITY_SORTING_COMPARATOR
private static final java.util.Comparator<DetectedLanguage> PROBABILITY_SORTING_COMPARATOR
-
ngramFrequencyData
@NotNull private final @NotNull NgramFrequencyData ngramFrequencyData
-
priorMap
@Nullable private final @org.jetbrains.annotations.Nullable double[] priorMap
User-defined language priorities, in the same order aslanglist
.
-
alpha
private final double alpha
-
seed
private final com.google.common.base.Optional<java.lang.Long> seed
-
shortTextAlgorithm
private final int shortTextAlgorithm
-
prefixFactor
private final double prefixFactor
-
suffixFactor
private final double suffixFactor
-
probabilityThreshold
private final double probabilityThreshold
-
minimalConfidence
private final double minimalConfidence
-
ngramExtractor
private final NgramExtractor ngramExtractor
-
-
Constructor Detail
-
LanguageDetectorImpl
LanguageDetectorImpl(@NotNull @NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<java.lang.Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable @Nullable java.util.Map<LdLocale,java.lang.Double> langWeightingMap, @NotNull @NotNull NgramExtractor ngramExtractor)
Use theLanguageDetectorBuilder
.
-
-
Method Detail
-
detect
public com.google.common.base.Optional<LdLocale> detect(java.lang.CharSequence text)
Description copied from interface:LanguageDetector
Returns the best detected language if the algorithm is very confident.Note: you may want to use getProbabilities() instead. This here is very strict, and sometimes returns absent even though the first choice in getProbabilities() is correct.
- Specified by:
detect
in interfaceLanguageDetector
- Parameters:
text
- You probably want aTextObject
.- Returns:
- The language if confident, absent if unknown or not confident enough.
-
getProbabilities
public java.util.List<DetectedLanguage> getProbabilities(java.lang.CharSequence text)
Description copied from interface:LanguageDetector
Returns all languages with at least some likeliness.There is a configurable cutoff applied for languages with very low probability.
The way the algorithm currently works, it can be that, for example, this method returns a 0.99 for Danish and less than 0.01 for Norwegian, and still they have almost the same chance. It would be nice if this could be improved in future versions.
- Specified by:
getProbabilities
in interfaceLanguageDetector
- Parameters:
text
- You probably want aTextObject
.- Returns:
- Sorted from better to worse. May be empty. It's empty if the program failed to detect any language, or if the input text did not contain any usable text (just noise).
-
detectBlock
@Nullable private @org.jetbrains.annotations.Nullable double[] detectBlock(java.lang.CharSequence text)
- Returns:
- null if there are no "features" in the text (just noise).
-
detectBlockShortText
private double[] detectBlockShortText(java.util.Map<java.lang.String,java.lang.Integer> ngrams)
-
detectBlockLongText
private double[] detectBlockLongText(java.util.List<java.lang.String> ngrams)
This is the original algorithm used for all text length. It is inappropriate for short text.
-
initProbability
private double[] initProbability()
Initialize the map of language probabilities. If there is the specified prior map, use it as initial map.- Returns:
- initialized map of language probabilities
-
updateLangProb
private boolean updateLangProb(@NotNull @org.jetbrains.annotations.NotNull double[] prob, @NotNull @NotNull java.lang.String ngram, int count, double alpha)
update language probabilities with N-gram string(N=1,2,3)- Parameters:
count
- 1-n: how often the gram occurred.
-
sortProbability
@NotNull private @NotNull java.util.List<DetectedLanguage> sortProbability(double[] prob)
Returns the detected languages sorted by probabilities descending. Languages with less probability than PROB_THRESHOLD are ignored.
-
-