Difference between revisions of "YASMEEN lexical measures"
(Created page with ""''Yet Another Species Matching Execution ENgine''" - common lexical measures Here follows a list of common lexical measures used by lexical matchlets in the YASMEEN [[YASME...") |
(→Levenshtein / Edit similarity) |
||
Line 19: | Line 19: | ||
== Levenshtein / Edit similarity == | == Levenshtein / Edit similarity == | ||
− | It is defined as | + | It is defined as 1.0 minus the Relative Levenshtein / Edit distance and measures 'how similar' (instead of 'how different') two strings being compared are. |
− | + | ||
− | + | ||
− | + | ||
− | and measures 'how similar' (instead of 'how different') two strings being compared are. | + | |
== Relative soundex similarity == | == Relative soundex similarity == |
Revision as of 09:10, 28 October 2013
"Yet Another Species Matching Execution ENgine" - common lexical measures
Here follows a list of common lexical measures used by lexical matchlets in the YASMEEN matching process.
These are used to compare two strings and return a measure either of their 'distance' (how different these two strings are) or (conversely) of their similarity and constitute the base of the lexical matchlets score computation.
Levenshtein / Edit distance
It is defined as the minimum number of changes (char deletions, char insertions) that need to be performed on the first string to get the second.
As such, it returns an integer value that cannot be greater than the length of the longest of the two strings being compared.
Relative Levenshtein / Edit distance
It is simply defined as the ratio between the Levenshtein / Edit distance and the length of the longest of the two strings being compared.
As such, it returns a decimal value in the range [0.0 .. 1.0]
Levenshtein / Edit similarity
It is defined as 1.0 minus the Relative Levenshtein / Edit distance and measures 'how similar' (instead of 'how different') two strings being compared are.
Relative soundex similarity
It is defined as the relative Levenshtein / Edit simlarity calculated for the soundexes of the two strings being compared.
Trigrams similarity
It is defined as the number of trigrams in common between the first and the second strings being compared.
Relative trigrams similarity
It is simply defined as the ratio between the Trigrams similarity and the cardinality of the biggest of the trigrams set extracted from the two strings.