Class LanguageDetectorImpl

  • All Implemented Interfaces:
    LanguageDetector

    public final class LanguageDetectorImpl
    extends java.lang.Object
    implements LanguageDetector

    This class is immutable and thus thread-safe.

    • Constructor Summary

      Constructors 
      Constructor Description
      LanguageDetectorImpl​(@NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<java.lang.Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable java.util.Map<LdLocale,​java.lang.Double> langWeightingMap, @NotNull NgramExtractor ngramExtractor)
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      com.google.common.base.Optional<LdLocale> detect​(java.lang.CharSequence text)
      Returns the best detected language if the algorithm is very confident.
      private @org.jetbrains.annotations.Nullable double[] detectBlock​(java.lang.CharSequence text)  
      private double[] detectBlockLongText​(java.util.List<java.lang.String> ngrams)
      This is the original algorithm used for all text length.
      private double[] detectBlockShortText​(java.util.Map<java.lang.String,​java.lang.Integer> ngrams)  
      java.util.List<DetectedLanguage> getProbabilities​(java.lang.CharSequence text)
      Returns all languages with at least some likeliness.
      private double[] initProbability()
      Initialize the map of language probabilities.
      private @NotNull java.util.List<DetectedLanguage> sortProbability​(double[] prob)
      Returns the detected languages sorted by probabilities descending.
      private boolean updateLangProb​(@org.jetbrains.annotations.NotNull double[] prob, @NotNull java.lang.String ngram, int count, double alpha)
      update language probabilities with N-gram string(N=1,2,3)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        private static final org.slf4j.Logger logger
      • ALPHA_WIDTH

        private static final double ALPHA_WIDTH
        TODO document what this is for, and why that value is chosen.
        See Also:
        Constant Field Values
      • ITERATION_LIMIT

        private static final int ITERATION_LIMIT
        TODO document what this is for, and why that value is chosen.
        See Also:
        Constant Field Values
      • CONV_THRESHOLD

        private static final double CONV_THRESHOLD
        TODO document what this is for, and why that value is chosen.
        See Also:
        Constant Field Values
      • BASE_FREQ

        private static final int BASE_FREQ
        TODO document what this is for, and why that value is chosen.
        See Also:
        Constant Field Values
      • N_TRIAL

        private static final int N_TRIAL
        TODO document what this is for, and why that value is chosen.
        See Also:
        Constant Field Values
      • DEFAULT_SEED

        private static final long DEFAULT_SEED
        This is used when no custom seed was passed in. By using the same seed for different calls, the results are consistent also. Changing this number means that users of the library might suddenly see other results after updating. So don't change it hastily. I chose a prime number *clueless*. See https://github.com/optimaize/language-detector/issues/14
        See Also:
        Constant Field Values
      • PROBABILITY_SORTING_COMPARATOR

        private static final java.util.Comparator<DetectedLanguage> PROBABILITY_SORTING_COMPARATOR
      • ngramFrequencyData

        @NotNull
        private final @NotNull NgramFrequencyData ngramFrequencyData
      • priorMap

        @Nullable
        private final @org.jetbrains.annotations.Nullable double[] priorMap
        User-defined language priorities, in the same order as langlist.
      • alpha

        private final double alpha
      • seed

        private final com.google.common.base.Optional<java.lang.Long> seed
      • shortTextAlgorithm

        private final int shortTextAlgorithm
      • prefixFactor

        private final double prefixFactor
      • suffixFactor

        private final double suffixFactor
      • probabilityThreshold

        private final double probabilityThreshold
      • minimalConfidence

        private final double minimalConfidence
    • Constructor Detail

      • LanguageDetectorImpl

        LanguageDetectorImpl​(@NotNull
                             @NotNull NgramFrequencyData ngramFrequencyData,
                             double alpha,
                             com.google.common.base.Optional<java.lang.Long> seed,
                             int shortTextAlgorithm,
                             double prefixFactor,
                             double suffixFactor,
                             double probabilityThreshold,
                             double minimalConfidence,
                             @Nullable
                             @Nullable java.util.Map<LdLocale,​java.lang.Double> langWeightingMap,
                             @NotNull
                             @NotNull NgramExtractor ngramExtractor)
    • Method Detail

      • detect

        public com.google.common.base.Optional<LdLocale> detect​(java.lang.CharSequence text)
        Description copied from interface: LanguageDetector
        Returns the best detected language if the algorithm is very confident.

        Note: you may want to use getProbabilities() instead. This here is very strict, and sometimes returns absent even though the first choice in getProbabilities() is correct.

        Specified by:
        detect in interface LanguageDetector
        Parameters:
        text - You probably want a TextObject.
        Returns:
        The language if confident, absent if unknown or not confident enough.
      • getProbabilities

        public java.util.List<DetectedLanguage> getProbabilities​(java.lang.CharSequence text)
        Description copied from interface: LanguageDetector
        Returns all languages with at least some likeliness.

        There is a configurable cutoff applied for languages with very low probability.

        The way the algorithm currently works, it can be that, for example, this method returns a 0.99 for Danish and less than 0.01 for Norwegian, and still they have almost the same chance. It would be nice if this could be improved in future versions.

        Specified by:
        getProbabilities in interface LanguageDetector
        Parameters:
        text - You probably want a TextObject.
        Returns:
        Sorted from better to worse. May be empty. It's empty if the program failed to detect any language, or if the input text did not contain any usable text (just noise).
      • detectBlock

        @Nullable
        private @org.jetbrains.annotations.Nullable double[] detectBlock​(java.lang.CharSequence text)
        Returns:
        null if there are no "features" in the text (just noise).
      • detectBlockShortText

        private double[] detectBlockShortText​(java.util.Map<java.lang.String,​java.lang.Integer> ngrams)
      • detectBlockLongText

        private double[] detectBlockLongText​(java.util.List<java.lang.String> ngrams)
        This is the original algorithm used for all text length. It is inappropriate for short text.
      • initProbability

        private double[] initProbability()
        Initialize the map of language probabilities. If there is the specified prior map, use it as initial map.
        Returns:
        initialized map of language probabilities
      • updateLangProb

        private boolean updateLangProb​(@NotNull
                                       @org.jetbrains.annotations.NotNull double[] prob,
                                       @NotNull
                                       @NotNull java.lang.String ngram,
                                       int count,
                                       double alpha)
        update language probabilities with N-gram string(N=1,2,3)
        Parameters:
        count - 1-n: how often the gram occurred.
      • sortProbability

        @NotNull
        private @NotNull java.util.List<DetectedLanguage> sortProbability​(double[] prob)
        Returns the detected languages sorted by probabilities descending. Languages with less probability than PROB_THRESHOLD are ignored.