YASMEEN string transformations

From D4Science Wiki
Jump to: navigation, search

"Yet Another Species Matching Execution ENvironment" - common string transformations

Here follows a list of common string transformations involved in the YASMEEN data conversion and matching processes.

Simplification

This is the process of removing all non-letter and unnecessary characters (symbols, digits, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set, remove all accents, diacritics and other letter-related meta-symbols, and return the uppercase version of such conversion.

Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the ICU Java libraries. In particular, the transliterator ID actually used during the simplification process is:

Any-Latin; NFD; [:nonspacing mark:] remove; NFC; Latin-ASCII;

Example of string simplification according to these rules are:

Glaucosoma hebraïcum -> GLAUCOSOMA HEBRAICUM
Cælorhynchis melanosagmatus -> CAELORHYNCHIS MELANOSAGMATUS
One "TwO" three!? :) -> ONE TWO THREE
dOuBlE  Sp4C3S -> DOUBLE SPCS
成开冷渔 -> CHENG KAI LENG YU
حاج عبد الله عابدين -> ALHAJ BD ALLH ABDYN
Linné -> LINNE

Stemming

This is the process of removing common sequences of characters appearing at the end of genus and species names. It has the purpose to 'equalize' genus / species names that do differ only for these suffixes (e.g. different Latin genders, singular vs. plural Latin etc.).

It is achieved by means of the following RegEx substitutions:

1. (.*)(IG)(ER|RA|ROS|RUM|RUS)$ -> $1$2
2. (.*)(AE|AK|AM|AR|AS|AX|EA|ES|EX|II|IS|IX|NS|OK|ON|OR|OS|OX|UM|US|YS|YX)$ -> $1
3. (.*)(A|E|I|O|U|Y)$ -> $1 

These substitutions are applied in sequence, from #1 to #3. The first substitution that actually produces changes in the data will halt the propagation.

An example set of stemmed versions for dummy input data (including the applied transformations) is:

NIGER -> NIG (#1)
NIGRA -> NIG (#1)

NIGRES -> NIGR (#2)
NIGREA -> NIGR (#2)
NIGRIGERES -> NIGRIGER (#2)

NIGRICA -> NIGRIC (#3)
NIGRICI -> NIGRIC (#3)

Stemming normalization

This is the process of removing spaces and multiple repeated characters from a stemmed string.

An example set of normalized stemmed versions for dummy input data is:

INPUT -> STEMMED INPUT -> NORMALIZED STEMMED INPUT
pulcherrima -> PULCHERRIM -> PULCHERIM
macrolepis mahableshwarensis -> MACROLEPIS MAHABLESHWARENS -> MACROLEPISMAHABLESHWARENS
acroporidae -> ACROPORID -> ACROPORID

Soundex

Soundexes in YASMEEN are calculated for simplified string data, thus ensuring that they will always be the result of processing ASCII letter characters only.

The soundex is computed for every word in the simplified string and for the whole simplified string (spaces removed) according to the original soundex algorithm specification, that is by discarding vowels first and duplicates second. Additionally, no upper limit is set for the maximum length of a soundex.

Thus, the soundex of a multiple-word string will be a sequence of soundexes, one per each word plus the soundex of the overall string. If a word appears multiple times in the simplified string, then its soundex is calculated just once.

Examples of this soundex calculation are:

RHOPALAEA NEAPOLITANA -> R140 N1435 R1451435 (R140 for RHOPALAEA, N1435 for NEAPOLITANA and R1451435 for RHOPALAEANEAPOLITANA)
NIGER NIGER -> N260 N26526 (N260 for NIGER and N26526 for NIGERNIGER)

Trigrams

Trigrams in YASMEEN are calculated for simplified string data only and are built separately for each word appearing in the original string data. The final result is the union (with repetitions included) of the trigrams extracted from each word.

Examples of trigrams calculation are:

Text -> Simplified text -> Simplified text trigrams
camtschaticum -> CAMTSCHATICUM -> CA CAM AMT MTS TSC SCH CHA HAT ATI TIC ICU CUM UM
Lethenteron camtschaticum -> LETHENTERON CAMTSCHATICUM -> LE LET ETH THE HEN ENT NTE TER ERO RON ON CA CAM AMT MTS TSC SCH CHA HAT ATI TIC ICU CUM UM