Class LangProfile

  • All Implemented Interfaces:
    java.io.Serializable

    @Deprecated
    public class LangProfile
    extends java.lang.Object
    implements java.io.Serializable
    Deprecated.
    replaced by LanguageProfile
    LangProfile is a Language Profile Class. Users don't use this class directly. TODO split into builder and immutable class. TODO currently this only makes n-grams with the space before a word included. no n-gram with the space after the word. Example: "foo" creates " fo" as 3gram, but not "oo ". Either this is a bug, or if intended then needs documentation.
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private java.util.Map<java.lang.String,​java.lang.Integer> freq
      Deprecated.
      Key = ngram, value = count.
      private static int LESS_FREQ_RATIO
      Deprecated.
      Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10.
      private static int MINIMUM_FREQ
      Deprecated.
      n-grams that occur less than this often can be removed using omitLessFreq().
      private java.lang.String name
      Deprecated.
      The language name (identifier).
      private int[] nWords
      Deprecated.
      Tells how many occurrences of n-grams exist per gram length.
      private static long serialVersionUID
      Deprecated.
       
    • Constructor Summary

      Constructors 
      Constructor Description
      LangProfile()
      Deprecated.
      Constructor for JSONIC
      LangProfile​(java.lang.String name)
      Deprecated.
      Normal Constructor
    • Method Summary

      All Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      void add​(@NotNull java.lang.String gram)
      Deprecated.
      Add n-gram to profile
      java.util.Map<java.lang.String,​java.lang.Integer> getFreq()
      Deprecated.
       
      java.lang.String getName()
      Deprecated.
       
      int[] getNWords()
      Deprecated.
       
      void omitLessFreq()
      Deprecated.
      Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams.
      void setFreq​(java.util.Map<java.lang.String,​java.lang.Integer> freq)
      Deprecated.
       
      void setName​(java.lang.String name)
      Deprecated.
       
      void setNWords​(int[] nWords)
      Deprecated.
       
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • serialVersionUID

        private static final long serialVersionUID
        Deprecated.
        See Also:
        Constant Field Values
      • MINIMUM_FREQ

        private static final int MINIMUM_FREQ
        Deprecated.
        n-grams that occur less than this often can be removed using omitLessFreq(). This number can change, see LESS_FREQ_RATIO.
        See Also:
        Constant Field Values
      • LESS_FREQ_RATIO

        private static final int LESS_FREQ_RATIO
        Deprecated.
        Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10. 10 is larger than MINIMUM_FREQ (2), thus MINIMUM_FREQ remains at 2. All n-grams that occur less than 2 times can be removed as noise using omitLessFreq(). If the most frequent n-gram occurs 5000 times, then 5'000 / this (100'000) = 0.05. 0.05 is smaller than MINIMUM_FREQ (2), thus MINIMUM_FREQ becomes 0. No n-grams are removed because of insignificance when calling omitLessFreq().
        See Also:
        Constant Field Values
      • name

        private java.lang.String name
        Deprecated.
        The language name (identifier).
      • freq

        private java.util.Map<java.lang.String,​java.lang.Integer> freq
        Deprecated.
        Key = ngram, value = count. All n-grams are in here (1-gram, 2-gram, 3-gram).
      • nWords

        private int[] nWords
        Deprecated.
        Tells how many occurrences of n-grams exist per gram length. When making 1grams, 2grams and 3grams (currently) then this contains 3 entries where element 0 = number occurrences of 1-grams element 1 = number occurrences of 2-grams element 2 = number occurrences of 3-grams Example: if there are 57 1-grams (English language has about that many) and the training text is fairly long, then this number is in the millions.
    • Constructor Detail

      • LangProfile

        public LangProfile()
        Deprecated.
        Constructor for JSONIC
      • LangProfile

        public LangProfile​(java.lang.String name)
        Deprecated.
        Normal Constructor
        Parameters:
        name - language name
    • Method Detail

      • add

        public void add​(@NotNull
                        @NotNull java.lang.String gram)
        Deprecated.
        Add n-gram to profile
        Parameters:
        gram -
      • omitLessFreq

        public void omitLessFreq()
        Deprecated.
        Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams. Also removes ascii ngrams if the total number of ascii ngrams is less than one third of the total. This is done because non-latin text (such as Chinese) often has some latin noise in between. TODO split the 2 cleaning to separate methods. TODO distinguish ascii/latin, currently it looks for latin only, should include characters with diacritics, eg Vietnamese. TODO current code counts ascii, but removes any latin. is that desired? if so then this needs documentation.
      • getName

        public java.lang.String getName()
        Deprecated.
      • setName

        public void setName​(java.lang.String name)
        Deprecated.
      • getFreq

        public java.util.Map<java.lang.String,​java.lang.Integer> getFreq()
        Deprecated.
      • setFreq

        public void setFreq​(java.util.Map<java.lang.String,​java.lang.Integer> freq)
        Deprecated.
      • getNWords

        public int[] getNWords()
        Deprecated.
      • setNWords

        public void setNWords​(int[] nWords)
        Deprecated.