Class ShingleBased

  • Direct Known Subclasses:
    Cosine, Jaccard, QGram, SorensenDice

    @Immutable
    public abstract class ShingleBased
    extends java.lang.Object
    Abstract class for string similarities that rely on set operations (like cosine similarity or jaccard index). k-shingling is the operation of transforming a string (or text document) into a set of n-grams, which can be used to measure the similarity between two strings or documents. Generally speaking, a k-gram is any sequence of k tokens. We use here the definition from Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", Cambridge University Press: Multiple subsequent spaces are replaced by a single space, and a k-gram is a sequence of k characters. Default value of k is 3. A good rule of thumb is to imagine that there are only 20 characters and estimate the number of k-shingles as 20^k. For small documents like e-mails, k = 5 is a recommended value. For large documents, such as research articles, k = 9 is considered a safe choice.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static int DEFAULT_K  
      private int k  
      private static java.util.regex.Pattern SPACE_REG
      Pattern for finding multiple following spaces.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int getK()
      Return k, the length of k-shingles (aka n-grams).
      java.util.Map<java.lang.String,​java.lang.Integer> getProfile​(java.lang.String string)
      Compute and return the profile of s, as defined by Ukkonen "Approximate string-matching with q-grams and maximal matches".
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • k

        private final int k
      • SPACE_REG

        private static final java.util.regex.Pattern SPACE_REG
        Pattern for finding multiple following spaces.
    • Constructor Detail

      • ShingleBased

        public ShingleBased​(int k)
        Parameters:
        k -
        Throws:
        java.lang.IllegalArgumentException - if k is <= 0
      • ShingleBased

        ShingleBased()
    • Method Detail

      • getK

        public final int getK()
        Return k, the length of k-shingles (aka n-grams).
        Returns:
        The length of k-shingles.
      • getProfile

        public final java.util.Map<java.lang.String,​java.lang.Integer> getProfile​(java.lang.String string)
        Compute and return the profile of s, as defined by Ukkonen "Approximate string-matching with q-grams and maximal matches". https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf The profile is the number of occurrences of k-shingles, and is used to compute q-gram similarity, Jaccard index, etc. Pay attention: the memory requirement of the profile can be up to k * size of the string
        Parameters:
        string -
        Returns:
        the profile of this string, as an unmodifiable Map