Class QGram

  • All Implemented Interfaces:
    StringDistance, java.io.Serializable

    @Immutable
    public class QGram
    extends ShingleBased
    implements StringDistance
    Q-gram distance, as defined by Ukkonen in "Approximate string-matching with q-grams and maximal matches". The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurences of each n-gram): SUM( |V1_i - V2_i| ). Q-gram distance is a lower bound on Levenshtein distance, but can be computed in O(m + n), where Levenshtein requires O(m.n).
    See Also:
    Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      QGram()
      Q-gram similarity and distance.
      QGram​(int k)
      Q-gram similarity and distance.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double distance​(java.lang.String s1, java.lang.String s2)
      The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurence of each k-shingle).
      double distance​(java.util.Map<java.lang.String,​java.lang.Integer> profile1, java.util.Map<java.lang.String,​java.lang.Integer> profile2)
      Compute QGram distance using precomputed profiles.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • QGram

        public QGram​(int k)
        Q-gram similarity and distance. Defined by Ukkonen in "Approximate string-matching with q-grams and maximal matches", http://www.sciencedirect.com/science/article/pii/0304397592901434 The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurences of each k-shingle). Q-gram distance is a lower bound on Levenshtein distance, but can be computed in O(|A| + |B|), where Levenshtein requires O(|A|.|B|)
        Parameters:
        k -
      • QGram

        public QGram()
        Q-gram similarity and distance. Defined by Ukkonen in "Approximate string-matching with q-grams and maximal matches", http://www.sciencedirect.com/science/article/pii/0304397592901434 The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurence of each k-shingle). Q-gram distance is a lower bound on Levenshtein distance, but can be computed in O(|A| + |B|), where Levenshtein requires O(|A|.|B|) Default k is 3.
    • Method Detail

      • distance

        public final double distance​(java.lang.String s1,
                                     java.lang.String s2)
        The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurence of each k-shingle).
        Specified by:
        distance in interface StringDistance
        Parameters:
        s1 - The first string to compare.
        s2 - The second string to compare.
        Returns:
        The computed Q-gram distance.
        Throws:
        java.lang.NullPointerException - if s1 or s2 is null.
      • distance

        public final double distance​(java.util.Map<java.lang.String,​java.lang.Integer> profile1,
                                     java.util.Map<java.lang.String,​java.lang.Integer> profile2)
        Compute QGram distance using precomputed profiles.
        Parameters:
        profile1 -
        profile2 -
        Returns: