Class LanguageIdentifier

java.lang.Object
org.languagetool.language.LanguageIdentifier

public class LanguageIdentifier extends Object
Identify the language of a text. Note that some languages might never be detected because they are close to another language. Language variants like en-US or en-GB are not detected, the result will be en for those. By default, only the first 1000 characters of a text are considered. Email signatures that use \n-- \n as a delimiter are ignored.
Since:
2.9
  • Field Details

    • logger

      private static final org.slf4j.Logger logger
    • MINIMAL_CONFIDENCE

      private static final double MINIMAL_CONFIDENCE
      See Also:
    • K_HIGHEST_SCORES

      private static final int K_HIGHEST_SCORES
      See Also:
    • SHORT_ALGO_THRESHOLD

      private static final int SHORT_ALGO_THRESHOLD
      See Also:
    • CONSIDER_ONLY_PREFERRED_THRESHOLD

      private static final int CONSIDER_ONLY_PREFERRED_THRESHOLD
      See Also:
    • SIGNATURE

      private static final Pattern SIGNATURE
    • ignoreLangCodes

      private static final List<String> ignoreLangCodes
    • externalLangCodes

      private static final List<String> externalLangCodes
    • THRESHOLD

      private static final float THRESHOLD
      See Also:
    • languageDetector

      private final com.optimaize.langdetect.LanguageDetector languageDetector
    • textObjectFactory

      private final com.optimaize.langdetect.text.TextObjectFactory textObjectFactory
    • maxLength

      private final int maxLength
    • fasttextEnabled

      private boolean fasttextEnabled
    • fasttextProcess

      private Process fasttextProcess
    • fasttextIn

      private BufferedReader fasttextIn
    • fasttextOut

      private BufferedWriter fasttextOut
  • Constructor Details

    • LanguageIdentifier

      public LanguageIdentifier()
    • LanguageIdentifier

      public LanguageIdentifier(int maxLength)
      Parameters:
      maxLength - the maximum number of characters that will be considered - can help with performance. Don't use values below 100, as this would decrease accuracy.
      Throws:
      IllegalArgumentException - if maxLength is less than 10
      Since:
      4.2
  • Method Details

    • enableFasttext

      public void enableFasttext(File fasttextBinary, File fasttextModel)
    • getLanguageCodes

      private static List<String> getLanguageCodes()
    • loadProfiles

      private List<com.optimaize.langdetect.profiles.LanguageProfile> loadProfiles(List<String> langCodes) throws IOException
      Throws:
      IOException
    • detectLanguage

      @Nullable public @Nullable Language detectLanguage(String text)
      Returns:
      language or null if language could not be identified
    • detectLanguageWithDetails

      @Nullable @Experimental @Nullable DetectedLanguage detectLanguageWithDetails(String text)
      Returns:
      language or null if language could not be identified
    • detectLanguage

      @Nullable public @Nullable DetectedLanguage detectLanguage(String text, List<String> noopLangsTmp, List<String> preferredLangsTmp)
      Parameters:
      noopLangsTmp - list of codes that are detected but will lead to the NoopLanguage that has no rules
      Returns:
      language or null if language could not be identified
      Since:
      4.4 (new parameter noopLangs, changed return type to DetectedLanguage)
    • canLanguageBeDetected

      private boolean canLanguageBeDetected(String langCode, List<String> additionalLanguageCodes)
    • startFasttext

      private void startFasttext(File modelPath, File binaryPath) throws IOException
      Throws:
      IOException
    • getHighestScoringResult

      private Map.Entry<String,Double> getHighestScoringResult(Map<String,Double> probs)
    • runFasttext

      private Map<String,Double> runFasttext(String text, List<String> additionalLanguageCodes) throws IOException
      Throws:
      IOException
    • detectLanguageCode

      @Nullable private Map.Entry<String,Double> detectLanguageCode(String text)
      Returns:
      language or null if language could not be identified