All Classes and Interfaces
Class
Description
As an adjunct to CharacterSubstitutionInterface, this interface
allows you to specify the cost of deletion or insertion of a
character.
Used to indicate the cost of character substitution.
The similarity between the two strings is the cosine of the angle between
these two vectors representation.
Implementation of Damerau-Levenshtein distance with transposition (also
sometimes calls unrestricted Damerau-Levenshtein distance).
Each input string is converted into a set of n-grams, the Jaccard index is
then computed as |V1 inter V2| / |V1 union V2|.
The Jaro–Winkler distance metric is designed and best suited for short
strings such as person names, and to detect typos; it is (roughly) a
variation of Damerau-Levenshtein, where the substitution of 2 close
characters is considered less important then the substitution of 2 characters
that a far from each other.
The Levenshtein distance between two words is the minimum number of
single-character edits (insertions, deletions or substitutions) required to
change one string into the other.
The longest common subsequence (LCS) problem consists in finding the longest
subsequence common to two (or more) sequences.
Distance metric based on Longest Common Subsequence, from the notes "An
LCS-based string metric" by Daniel Bakkelund.
String distances that implement this interface are metrics.
N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance",
String Processing and Information Retrieval, Lecture Notes in Computer
Science Volume 3772, 2005, pp 115-126.
This distance is computed as levenshtein distance divided by the length of
the longest string.
Normalized string similarities return a similarity between 0.0 and 1.0.
Implementation of the the Optimal String Alignment (sometimes called the
restricted edit distance) variant of the Damerau-Levenshtein distance.
Example of computing cosine similarity with pre-computed profiles.
Q-gram distance, as defined by Ukkonen in "Approximate string-matching with
q-grams and maximal matches".
Ratcliff/Obershelp pattern recognition
The Ratcliff/Obershelp algorithm computes the similarity of two strings a
the doubled number of matching characters divided by the total number of
characters in the two strings.
Abstract class for string similarities that rely on set operations (like
cosine similarity or jaccard index).
Sift4 - a general purpose string distance algorithm inspired by JaroWinkler
and Longest Common Subsequence.
Similar to Jaccard index, but this time the similarity is computed as 2 * |V1
inter V2| / (|V1| + |V2|).
Implementation of Levenshtein that allows to define different weights for
different character substitutions.