YASMEEN string transformations

"Yet Another Species Matching Execution ENgine" - common string transformations

Here follows a list of common string transformations involved in the YASMEEN data conversion and matching processes.

Simplification

This is the process of removing all non-letter and unnecessary characters (symbols, digits, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set and return the uppercase version of such conversion.

Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the ICU Java libraries. In particular, the transliterator ID actually used during the process is:

Any-Latin; NFD; [:nonspacing mark:] remove; NFC; Latin-ASCII;

Stemming

This is the process of removing common sequence of characters appearing at the end of genus and species names. It has the purpose to 'equalize' genus / species names that do differ only for these suffixes (e.g. different genera).

It is achieved by means of the following RegEx substitutions:

1. (.*)(IG)(ER|RA|ROS|RUM|RUS)$ -> $1$2
2. (.*)(AE|AK|AM|AR|AS|AX|EA|ES|EX|II|IS|IX|NS|OK|ON|OR|OS|OX|UM|US|YS|YX)$ -> $1
3. (.*)(A|E|I|O|U|Y)$ -> $1

These are applied in sequence, from 1 to 3. The first transformation that actually produces changes in the input data will halt the propagation.

Example of stemmed version (including the applied transformations) are:

NIGER -> NIG (#1)
NIGRA -> NIG (#1)

NIGRES -> NIGR (#2)
NIGREA -> NIGR (#2)

NIGRICA -> NIGRIC (#3)
NIGRICI -> NIGRIC (#3)

YASMEEN string transformations

Simplification

Stemming

Soundex

Trigrams

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

D4Science

Capacity

Procedures

Policies

Documentation

Tools