Class LanguageIdentifier


  • public class LanguageIdentifier
    extends java.lang.Object
    Identify the language of a text. Note that some languages might never be detected because they are close to another language. Language variants like en-US or en-GB are not detected, the result will be en for those. By default, only the first 1000 characters of a text are considered. Email signatures that use \n-- \n as a delimiter are ignored.
    Since:
    2.9
    • Field Detail

      • logger

        private static final org.slf4j.Logger logger
      • CONSIDER_ONLY_PREFERRED_THRESHOLD

        private static final int CONSIDER_ONLY_PREFERRED_THRESHOLD
        See Also:
        Constant Field Values
      • SIGNATURE

        private static final java.util.regex.Pattern SIGNATURE
      • ignoreLangCodes

        private static final java.util.List<java.lang.String> ignoreLangCodes
      • externalLangCodes

        private static final java.util.List<java.lang.String> externalLangCodes
      • languageDetector

        private final com.optimaize.langdetect.LanguageDetector languageDetector
      • textObjectFactory

        private final com.optimaize.langdetect.text.TextObjectFactory textObjectFactory
      • maxLength

        private final int maxLength
      • fasttextEnabled

        private boolean fasttextEnabled
      • fasttextProcess

        private java.lang.Process fasttextProcess
      • fasttextIn

        private java.io.BufferedReader fasttextIn
      • fasttextOut

        private java.io.BufferedWriter fasttextOut
    • Constructor Detail

      • LanguageIdentifier

        public LanguageIdentifier()
      • LanguageIdentifier

        public LanguageIdentifier​(int maxLength)
        Parameters:
        maxLength - the maximum number of characters that will be considered - can help with performance. Don't use values below 100, as this would decrease accuracy.
        Throws:
        java.lang.IllegalArgumentException - if maxLength is less than 10
        Since:
        4.2
    • Method Detail

      • enableFasttext

        public void enableFasttext​(java.io.File fasttextBinary,
                                   java.io.File fasttextModel)
      • getLanguageCodes

        private static java.util.List<java.lang.String> getLanguageCodes()
      • loadProfiles

        private java.util.List<com.optimaize.langdetect.profiles.LanguageProfile> loadProfiles​(java.util.List<java.lang.String> langCodes)
                                                                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • detectLanguage

        @Nullable
        public @Nullable Language detectLanguage​(java.lang.String text)
        Returns:
        language or null if language could not be identified
      • detectLanguageWithDetails

        @Nullable
        @Experimental
        @Nullable DetectedLanguage detectLanguageWithDetails​(java.lang.String text)
        Returns:
        language or null if language could not be identified
      • detectLanguage

        @Nullable
        public @Nullable DetectedLanguage detectLanguage​(java.lang.String text,
                                                         java.util.List<java.lang.String> noopLangsTmp,
                                                         java.util.List<java.lang.String> preferredLangsTmp)
        Parameters:
        noopLangsTmp - list of codes that are detected but will lead to the NoopLanguage that has no rules
        Returns:
        language or null if language could not be identified
        Since:
        4.4 (new parameter noopLangs, changed return type to DetectedLanguage)
      • canLanguageBeDetected

        private boolean canLanguageBeDetected​(java.lang.String langCode,
                                              java.util.List<java.lang.String> additionalLanguageCodes)
      • startFasttext

        private void startFasttext​(java.io.File modelPath,
                                   java.io.File binaryPath)
                            throws java.io.IOException
        Throws:
        java.io.IOException
      • getHighestScoringResult

        private java.util.Map.Entry<java.lang.String,​java.lang.Double> getHighestScoringResult​(java.util.Map<java.lang.String,​java.lang.Double> probs)
      • runFasttext

        private java.util.Map<java.lang.String,​java.lang.Double> runFasttext​(java.lang.String text,
                                                                                   java.util.List<java.lang.String> additionalLanguageCodes)
                                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • detectLanguageCode

        @Nullable
        private java.util.Map.Entry<java.lang.String,​java.lang.Double> detectLanguageCode​(java.lang.String text)
        Returns:
        language or null if language could not be identified