Class ShingleBased
java.lang.Object
info.debatty.java.stringsimilarity.ShingleBased
- Direct Known Subclasses:
Cosine
,Jaccard
,QGram
,SorensenDice
Abstract class for string similarities that rely on set operations (like
cosine similarity or jaccard index).
k-shingling is the operation of transforming a string (or text document) into
a set of n-grams, which can be used to measure the similarity between two
strings or documents.
Generally speaking, a k-gram is any sequence of k tokens. We use here the
definition from Leskovec, Rajaraman & Ullman (2014), "Mining of Massive
Datasets", Cambridge University Press: Multiple subsequent spaces are
replaced by a single space, and a k-gram is a sequence of k characters.
Default value of k is 3. A good rule of thumb is to imagine that there are
only 20 characters and estimate the number of k-shingles as 20^k. For small
documents like e-mails, k = 5 is a recommended value. For large documents,
such as research articles, k = 9 is considered a safe choice.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
-
Field Details
-
DEFAULT_K
private static final int DEFAULT_K- See Also:
-
k
private final int k -
SPACE_REG
Pattern for finding multiple following spaces.
-
-
Constructor Details
-
ShingleBased
public ShingleBased(int k) - Parameters:
k
-- Throws:
IllegalArgumentException
- if k is <= 0
-
ShingleBased
ShingleBased()
-
-
Method Details
-
getK
public final int getK()Return k, the length of k-shingles (aka n-grams).- Returns:
- The length of k-shingles.
-
getProfile
Compute and return the profile of s, as defined by Ukkonen "Approximate string-matching with q-grams and maximal matches". https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf The profile is the number of occurrences of k-shingles, and is used to compute q-gram similarity, Jaccard index, etc. Pay attention: the memory requirement of the profile can be up to k * size of the string- Parameters:
string
-- Returns:
- the profile of this string, as an unmodifiable Map
-