Class LangProfile
- java.lang.Object
-
- com.optimaize.langdetect.cybozu.util.LangProfile
-
- All Implemented Interfaces:
java.io.Serializable
@Deprecated public class LangProfile extends java.lang.Object implements java.io.Serializable
Deprecated.replaced by LanguageProfileLangProfile
is a Language Profile Class. Users don't use this class directly. TODO split into builder and immutable class. TODO currently this only makes n-grams with the space before a word included. no n-gram with the space after the word. Example: "foo" creates " fo" as 3gram, but not "oo ". Either this is a bug, or if intended then needs documentation.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.Map<java.lang.String,java.lang.Integer>
freq
Deprecated.Key = ngram, value = count.private static int
LESS_FREQ_RATIO
Deprecated.Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10.private static int
MINIMUM_FREQ
Deprecated.n-grams that occur less than this often can be removed using omitLessFreq().private java.lang.String
name
Deprecated.The language name (identifier).private int[]
nWords
Deprecated.Tells how many occurrences of n-grams exist per gram length.private static long
serialVersionUID
Deprecated.
-
Constructor Summary
Constructors Constructor Description LangProfile()
Deprecated.Constructor for JSONICLangProfile(java.lang.String name)
Deprecated.Normal Constructor
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
add(@NotNull java.lang.String gram)
Deprecated.Add n-gram to profilejava.util.Map<java.lang.String,java.lang.Integer>
getFreq()
Deprecated.java.lang.String
getName()
Deprecated.int[]
getNWords()
Deprecated.void
omitLessFreq()
Deprecated.Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams.void
setFreq(java.util.Map<java.lang.String,java.lang.Integer> freq)
Deprecated.void
setName(java.lang.String name)
Deprecated.void
setNWords(int[] nWords)
Deprecated.
-
-
-
Field Detail
-
serialVersionUID
private static final long serialVersionUID
Deprecated.- See Also:
- Constant Field Values
-
MINIMUM_FREQ
private static final int MINIMUM_FREQ
Deprecated.n-grams that occur less than this often can be removed using omitLessFreq(). This number can change, see LESS_FREQ_RATIO.- See Also:
- Constant Field Values
-
LESS_FREQ_RATIO
private static final int LESS_FREQ_RATIO
Deprecated.Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10. 10 is larger than MINIMUM_FREQ (2), thus MINIMUM_FREQ remains at 2. All n-grams that occur less than 2 times can be removed as noise using omitLessFreq(). If the most frequent n-gram occurs 5000 times, then 5'000 / this (100'000) = 0.05. 0.05 is smaller than MINIMUM_FREQ (2), thus MINIMUM_FREQ becomes 0. No n-grams are removed because of insignificance when calling omitLessFreq().- See Also:
- Constant Field Values
-
name
private java.lang.String name
Deprecated.The language name (identifier).
-
freq
private java.util.Map<java.lang.String,java.lang.Integer> freq
Deprecated.Key = ngram, value = count. All n-grams are in here (1-gram, 2-gram, 3-gram).
-
nWords
private int[] nWords
Deprecated.Tells how many occurrences of n-grams exist per gram length. When making 1grams, 2grams and 3grams (currently) then this contains 3 entries where element 0 = number occurrences of 1-grams element 1 = number occurrences of 2-grams element 2 = number occurrences of 3-grams Example: if there are 57 1-grams (English language has about that many) and the training text is fairly long, then this number is in the millions.
-
-
Method Detail
-
add
public void add(@NotNull @NotNull java.lang.String gram)
Deprecated.Add n-gram to profile- Parameters:
gram
-
-
omitLessFreq
public void omitLessFreq()
Deprecated.Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams. Also removes ascii ngrams if the total number of ascii ngrams is less than one third of the total. This is done because non-latin text (such as Chinese) often has some latin noise in between. TODO split the 2 cleaning to separate methods. TODO distinguish ascii/latin, currently it looks for latin only, should include characters with diacritics, eg Vietnamese. TODO current code counts ascii, but removes any latin. is that desired? if so then this needs documentation.
-
getName
public java.lang.String getName()
Deprecated.
-
setName
public void setName(java.lang.String name)
Deprecated.
-
getFreq
public java.util.Map<java.lang.String,java.lang.Integer> getFreq()
Deprecated.
-
setFreq
public void setFreq(java.util.Map<java.lang.String,java.lang.Integer> freq)
Deprecated.
-
getNWords
public int[] getNWords()
Deprecated.
-
setNWords
public void setNWords(int[] nWords)
Deprecated.
-
-