Difference between revisions of "YASMEEN string transformations"
(→Simplification) |
(→Soundex) |
||
Line 45: | Line 45: | ||
== Soundex == | == Soundex == | ||
+ | |||
+ | The soundex is calculated for [[#Simplification|simplified]] string data only, thus ensuring that the soundex will always work on ASCII letter characters. | ||
+ | |||
+ | The algorithm is basically the original Soundex, | ||
== Trigrams == | == Trigrams == |
Revision as of 17:25, 26 October 2013
"Yet Another Species Matching Execution ENgine" - common string transformations
Here follows a list of common string transformations involved in the YASMEEN data conversion and matching processes.
Simplification
This is the process of removing all non-letter and unnecessary characters (symbols, digits, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set and return the uppercase version of such conversion.
Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the ICU Java libraries. In particular, the transliterator ID actually used during the simplification process is:
Any-Latin; NFD; [:nonspacing mark:] remove; NFC; Latin-ASCII;
Example of string simplification according to these rules are:
Glaucosoma hebraïcum -> GLAUCOSOMA HEBRAICUM Cælorhynchis melanosagmatus -> CAELORHYNCHIS MELANOSAGMATUS One "TwO" three!? :) -> ONE TWO THREE dOuBlE Sp4C3S -> DOUBLE SPCS 成开冷渔 -> CHENG KAI LENG YU حاج عبد الله عابدين -> ALHAJ BD ALLH ABDYN
Stemming
This is the process of removing common sequences of characters appearing at the end of genus and species names. It has the purpose to 'equalize' genus / species names that do differ only for these suffixes (e.g. different Latin genders, singular vs. plural Latin etc.).
It is achieved by means of the following RegEx substitutions:
1. (.*)(IG)(ER|RA|ROS|RUM|RUS)$ -> $1$2 2. (.*)(AE|AK|AM|AR|AS|AX|EA|ES|EX|II|IS|IX|NS|OK|ON|OR|OS|OX|UM|US|YS|YX)$ -> $1 3. (.*)(A|E|I|O|U|Y)$ -> $1
These substitutions are applied in sequence, from #1 to #3. The first substitution that actually produces changes in the data will halt the propagation.
An example set of stemmed versions for dummy input data (including the applied transformations) is:
NIGER -> NIG (#1) NIGRA -> NIG (#1) NIGRES -> NIGR (#2) NIGREA -> NIGR (#2) NIGRIGERES -> NIGRIGER (#2) NIGRICA -> NIGRIC (#3) NIGRICI -> NIGRIC (#3)
Soundex
The soundex is calculated for simplified string data only, thus ensuring that the soundex will always work on ASCII letter characters.
The algorithm is basically the original Soundex,