Class NgramExtractor


  • public class NgramExtractor
    extends java.lang.Object
    Class for extracting n-grams out of a text.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private @Nullable NgramFilter filter  
      private @NotNull java.util.List<java.lang.Integer> gramLengths  
      private @Nullable java.lang.Character textPadding  
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private NgramExtractor​(@NotNull java.util.List<java.lang.Integer> gramLengths, @Nullable NgramFilter filter, @Nullable java.lang.Character textPadding)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void _extractCounted​(java.lang.CharSequence text, int gramLength, int len, java.util.Map<java.lang.String,​java.lang.Integer> grams)  
      private java.lang.CharSequence applyPadding​(java.lang.CharSequence text)  
      @NotNull java.util.Map<java.lang.String,​java.lang.Integer> extractCountedGrams​(@NotNull java.lang.CharSequence text)  
      @NotNull java.util.List<java.lang.String> extractGrams​(@NotNull java.lang.CharSequence text)
      Creates the n-grams for a given text in the order they occur.
      NgramExtractor filter​(NgramFilter filter)  
      java.util.List<java.lang.Integer> getGramLengths()  
      static NgramExtractor gramLength​(int gramLength)  
      static NgramExtractor gramLengths​(java.lang.Integer... gramLength)  
      private static int guessNumDistinctiveGrams​(int textLength, int gramLength)
      This is trying to be smart.
      NgramExtractor textPadding​(char textPadding)
      To ensure having border grams, this character is added to the left and right of the text.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • gramLengths

        @NotNull
        private final @NotNull java.util.List<java.lang.Integer> gramLengths
      • filter

        @Nullable
        private final @Nullable NgramFilter filter
      • textPadding

        @Nullable
        private final @Nullable java.lang.Character textPadding
    • Constructor Detail

      • NgramExtractor

        private NgramExtractor​(@NotNull
                               @NotNull java.util.List<java.lang.Integer> gramLengths,
                               @Nullable
                               @Nullable NgramFilter filter,
                               @Nullable
                               @Nullable java.lang.Character textPadding)
    • Method Detail

      • gramLength

        public static NgramExtractor gramLength​(int gramLength)
      • gramLengths

        public static NgramExtractor gramLengths​(java.lang.Integer... gramLength)
      • textPadding

        public NgramExtractor textPadding​(char textPadding)
        To ensure having border grams, this character is added to the left and right of the text.

        Example: when textPadding is a space ' ' then a text input "foo" becomes " foo ", ensuring that n-grams like " f" are created.

        If the text already has such a character in that position (eg starts with), it is not added there.

        Parameters:
        textPadding - for example a space ' '.
      • getGramLengths

        public java.util.List<java.lang.Integer> getGramLengths()
      • extractGrams

        @NotNull
        public @NotNull java.util.List<java.lang.String> extractGrams​(@NotNull
                                                                      @NotNull java.lang.CharSequence text)
        Creates the n-grams for a given text in the order they occur.

        Example: extractSortedGrams("Foo bar", 2) => [Fo,oo,o , b,ba,ar]

        Parameters:
        text -
        Returns:
        The grams, empty if the input was empty or if none for that gramLength fits.
      • extractCountedGrams

        @NotNull
        public @NotNull java.util.Map<java.lang.String,​java.lang.Integer> extractCountedGrams​(@NotNull
                                                                                                    @NotNull java.lang.CharSequence text)
        Returns:
        Key = ngram, value = count The order is as the n-grams appeared first in the string.
      • _extractCounted

        private void _extractCounted​(java.lang.CharSequence text,
                                     int gramLength,
                                     int len,
                                     java.util.Map<java.lang.String,​java.lang.Integer> grams)
      • guessNumDistinctiveGrams

        private static int guessNumDistinctiveGrams​(int textLength,
                                                    int gramLength)
        This is trying to be smart. It also depends on script (alphabet less than ideographic). So I'm not sure how good it really is. Just trying to prevent array copies... and for Latin it seems to work fine.
      • applyPadding

        private java.lang.CharSequence applyPadding​(java.lang.CharSequence text)