Class Jaccard

java.lang.Object
info.debatty.java.stringsimilarity.ShingleBased
info.debatty.java.stringsimilarity.Jaccard
All Implemented Interfaces:
MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity, StringDistance, StringSimilarity, Serializable

@Immutable public class Jaccard extends ShingleBased implements MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity
Each input string is converted into a set of n-grams, the Jaccard index is then computed as |V1 inter V2| / |V1 union V2|. Like Q-Gram distance, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Distance is computed as 1 - cosine similarity. Jaccard index is a metric distance.
See Also:
  • Constructor Summary

    Constructors
    Constructor
    Description
    The strings are first transformed into sets of k-shingles (sequences of k characters), then Jaccard index is computed as |A inter B| / |A union B|.
    Jaccard(int k)
    The strings are first transformed into sets of k-shingles (sequences of k characters), then Jaccard index is computed as |A inter B| / |A union B|.
  • Method Summary

    Modifier and Type
    Method
    Description
    final double
    Distance is computed as 1 - similarity.
    final double
    Compute Jaccard index: |A inter B| / |A union B|.

    Methods inherited from class info.debatty.java.stringsimilarity.ShingleBased

    getK, getProfile

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • Jaccard

      public Jaccard(int k)
      The strings are first transformed into sets of k-shingles (sequences of k characters), then Jaccard index is computed as |A inter B| / |A union B|. The default value of k is 3.
      Parameters:
      k -
    • Jaccard

      public Jaccard()
      The strings are first transformed into sets of k-shingles (sequences of k characters), then Jaccard index is computed as |A inter B| / |A union B|. The default value of k is 3.
  • Method Details

    • similarity

      public final double similarity(String s1, String s2)
      Compute Jaccard index: |A inter B| / |A union B|.
      Specified by:
      similarity in interface StringSimilarity
      Parameters:
      s1 - The first string to compare.
      s2 - The second string to compare.
      Returns:
      The Jaccard index in the range [0, 1]
      Throws:
      NullPointerException - if s1 or s2 is null.
    • distance

      public final double distance(String s1, String s2)
      Distance is computed as 1 - similarity.
      Specified by:
      distance in interface MetricStringDistance
      Specified by:
      distance in interface StringDistance
      Parameters:
      s1 - The first string to compare.
      s2 - The second string to compare.
      Returns:
      1 - the Jaccard similarity.
      Throws:
      NullPointerException - if s1 or s2 is null.