YASMEEN lexical measures

From D4Science Wiki
Jump to: navigation, search

"Yet Another Species Matching Execution ENvironment" - common lexical measures

Here follows a list of common lexical measures used by lexical matchlets in the YASMEEN matching process.

These are used to compare two strings and return a measure either of their 'distance' (how different these two strings are) or (conversely) of their similarity and constitute the base of the lexical matchlets score computation.

Levenshtein / Edit distance

It is defined as the minimum number of changes (char deletions, char insertions) that need to be performed on the first string to get the second.

As such, it returns an integer value that cannot be greater than the length of the longest of the two strings being compared.

Relative Levenshtein / Edit distance

It is simply defined as the ratio between the Levenshtein / Edit distance and the length of the longest of the two strings being compared.

As such, it returns a decimal value in the range [0.0 .. 1.0]

Levenshtein / Edit similarity

It is defined as 1.0 minus the Relative Levenshtein / Edit distance and measures 'how similar' (instead of 'how different') two strings being compared are.

Relative soundex similarity

It is defined as the relative Levenshtein / Edit simlarity calculated for the soundexes of the two strings being compared.

Trigrams similarity

It is defined as the number of trigrams in common between the first and the second strings being compared.

Relative trigrams similarity

It is simply defined as the ratio between the Trigrams similarity and the cardinality of the biggest of the trigrams set extracted from the two strings.