Class LanguageDetectorImpl

java.lang.Object
com.optimaize.langdetect.LanguageDetectorImpl
All Implemented Interfaces:
LanguageDetector

public final class LanguageDetectorImpl extends Object implements LanguageDetector

This class is immutable and thus thread-safe.

  • Field Details

    • logger

      private static final org.slf4j.Logger logger
    • ALPHA_WIDTH

      private static final double ALPHA_WIDTH
      TODO document what this is for, and why that value is chosen.
      See Also:
    • ITERATION_LIMIT

      private static final int ITERATION_LIMIT
      TODO document what this is for, and why that value is chosen.
      See Also:
    • CONV_THRESHOLD

      private static final double CONV_THRESHOLD
      TODO document what this is for, and why that value is chosen.
      See Also:
    • BASE_FREQ

      private static final int BASE_FREQ
      TODO document what this is for, and why that value is chosen.
      See Also:
    • N_TRIAL

      private static final int N_TRIAL
      TODO document what this is for, and why that value is chosen.
      See Also:
    • DEFAULT_SEED

      private static final long DEFAULT_SEED
      This is used when no custom seed was passed in. By using the same seed for different calls, the results are consistent also. Changing this number means that users of the library might suddenly see other results after updating. So don't change it hastily. I chose a prime number *clueless*. See https://github.com/optimaize/language-detector/issues/14
      See Also:
    • PROBABILITY_SORTING_COMPARATOR

      private static final Comparator<DetectedLanguage> PROBABILITY_SORTING_COMPARATOR
    • ngramFrequencyData

      @NotNull private final @NotNull NgramFrequencyData ngramFrequencyData
    • priorMap

      @Nullable private final @org.jetbrains.annotations.Nullable double[] priorMap
      User-defined language priorities, in the same order as langlist.
    • alpha

      private final double alpha
    • seed

      private final com.google.common.base.Optional<Long> seed
    • shortTextAlgorithm

      private final int shortTextAlgorithm
    • prefixFactor

      private final double prefixFactor
    • suffixFactor

      private final double suffixFactor
    • probabilityThreshold

      private final double probabilityThreshold
    • minimalConfidence

      private final double minimalConfidence
    • ngramExtractor

      private final NgramExtractor ngramExtractor
  • Constructor Details

    • LanguageDetectorImpl

      LanguageDetectorImpl(@NotNull @NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable @Nullable Map<LdLocale,Double> langWeightingMap, @NotNull @NotNull NgramExtractor ngramExtractor)
  • Method Details

    • detect

      public com.google.common.base.Optional<LdLocale> detect(CharSequence text)
      Description copied from interface: LanguageDetector
      Returns the best detected language if the algorithm is very confident.

      Note: you may want to use getProbabilities() instead. This here is very strict, and sometimes returns absent even though the first choice in getProbabilities() is correct.

      Specified by:
      detect in interface LanguageDetector
      Parameters:
      text - You probably want a TextObject.
      Returns:
      The language if confident, absent if unknown or not confident enough.
    • getProbabilities

      public List<DetectedLanguage> getProbabilities(CharSequence text)
      Description copied from interface: LanguageDetector
      Returns all languages with at least some likeliness.

      There is a configurable cutoff applied for languages with very low probability.

      The way the algorithm currently works, it can be that, for example, this method returns a 0.99 for Danish and less than 0.01 for Norwegian, and still they have almost the same chance. It would be nice if this could be improved in future versions.

      Specified by:
      getProbabilities in interface LanguageDetector
      Parameters:
      text - You probably want a TextObject.
      Returns:
      Sorted from better to worse. May be empty. It's empty if the program failed to detect any language, or if the input text did not contain any usable text (just noise).
    • detectBlock

      @Nullable private @org.jetbrains.annotations.Nullable double[] detectBlock(CharSequence text)
      Returns:
      null if there are no "features" in the text (just noise).
    • detectBlockShortText

      private double[] detectBlockShortText(Map<String,Integer> ngrams)
    • detectBlockLongText

      private double[] detectBlockLongText(List<String> ngrams)
      This is the original algorithm used for all text length. It is inappropriate for short text.
    • initProbability

      private double[] initProbability()
      Initialize the map of language probabilities. If there is the specified prior map, use it as initial map.
      Returns:
      initialized map of language probabilities
    • updateLangProb

      private boolean updateLangProb(@NotNull @org.jetbrains.annotations.NotNull double[] prob, @NotNull @NotNull String ngram, int count, double alpha)
      update language probabilities with N-gram string(N=1,2,3)
      Parameters:
      count - 1-n: how often the gram occurred.
    • sortProbability

      @NotNull private @NotNull List<DetectedLanguage> sortProbability(double[] prob)
      Returns the detected languages sorted by probabilities descending. Languages with less probability than PROB_THRESHOLD are ignored.