Package com.optimaize.langdetect.ngram
Class NgramExtractor
- java.lang.Object
-
- com.optimaize.langdetect.ngram.NgramExtractor
-
public class NgramExtractor extends java.lang.Object
Class for extracting n-grams out of a text.
-
-
Field Summary
Fields Modifier and Type Field Description private @Nullable NgramFilter
filter
private @NotNull java.util.List<java.lang.Integer>
gramLengths
private @Nullable java.lang.Character
textPadding
-
Constructor Summary
Constructors Modifier Constructor Description private
NgramExtractor(@NotNull java.util.List<java.lang.Integer> gramLengths, @Nullable NgramFilter filter, @Nullable java.lang.Character textPadding)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
_extractCounted(java.lang.CharSequence text, int gramLength, int len, java.util.Map<java.lang.String,java.lang.Integer> grams)
private java.lang.CharSequence
applyPadding(java.lang.CharSequence text)
@NotNull java.util.Map<java.lang.String,java.lang.Integer>
extractCountedGrams(@NotNull java.lang.CharSequence text)
@NotNull java.util.List<java.lang.String>
extractGrams(@NotNull java.lang.CharSequence text)
Creates the n-grams for a given text in the order they occur.NgramExtractor
filter(NgramFilter filter)
java.util.List<java.lang.Integer>
getGramLengths()
static NgramExtractor
gramLength(int gramLength)
static NgramExtractor
gramLengths(java.lang.Integer... gramLength)
private static int
guessNumDistinctiveGrams(int textLength, int gramLength)
This is trying to be smart.NgramExtractor
textPadding(char textPadding)
To ensure having border grams, this character is added to the left and right of the text.
-
-
-
Field Detail
-
gramLengths
@NotNull private final @NotNull java.util.List<java.lang.Integer> gramLengths
-
filter
@Nullable private final @Nullable NgramFilter filter
-
textPadding
@Nullable private final @Nullable java.lang.Character textPadding
-
-
Constructor Detail
-
NgramExtractor
private NgramExtractor(@NotNull @NotNull java.util.List<java.lang.Integer> gramLengths, @Nullable @Nullable NgramFilter filter, @Nullable @Nullable java.lang.Character textPadding)
-
-
Method Detail
-
gramLength
public static NgramExtractor gramLength(int gramLength)
-
gramLengths
public static NgramExtractor gramLengths(java.lang.Integer... gramLength)
-
filter
public NgramExtractor filter(NgramFilter filter)
-
textPadding
public NgramExtractor textPadding(char textPadding)
To ensure having border grams, this character is added to the left and right of the text.Example: when textPadding is a space ' ' then a text input "foo" becomes " foo ", ensuring that n-grams like " f" are created.
If the text already has such a character in that position (eg starts with), it is not added there.
- Parameters:
textPadding
- for example a space ' '.
-
getGramLengths
public java.util.List<java.lang.Integer> getGramLengths()
-
extractGrams
@NotNull public @NotNull java.util.List<java.lang.String> extractGrams(@NotNull @NotNull java.lang.CharSequence text)
Creates the n-grams for a given text in the order they occur.Example: extractSortedGrams("Foo bar", 2) => [Fo,oo,o , b,ba,ar]
- Parameters:
text
-- Returns:
- The grams, empty if the input was empty or if none for that gramLength fits.
-
extractCountedGrams
@NotNull public @NotNull java.util.Map<java.lang.String,java.lang.Integer> extractCountedGrams(@NotNull @NotNull java.lang.CharSequence text)
- Returns:
- Key = ngram, value = count The order is as the n-grams appeared first in the string.
-
_extractCounted
private void _extractCounted(java.lang.CharSequence text, int gramLength, int len, java.util.Map<java.lang.String,java.lang.Integer> grams)
-
guessNumDistinctiveGrams
private static int guessNumDistinctiveGrams(int textLength, int gramLength)
This is trying to be smart. It also depends on script (alphabet less than ideographic). So I'm not sure how good it really is. Just trying to prevent array copies... and for Latin it seems to work fine.
-
applyPadding
private java.lang.CharSequence applyPadding(java.lang.CharSequence text)
-
-