Difference between revisions of "YASMEEN string transformations"

From D4Science Wiki
Jump to: navigation, search
(Simplification)
(Simplification)
Line 5: Line 5:
 
== Simplification ==
 
== Simplification ==
  
This is the process of removing all unnecessary characters (symbols, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set and return the uppercase version of such conversion.
+
This is the process of removing all non-letter and unnecessary characters (symbols, digits, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set and return the uppercase version of such conversion.
  
Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the [http://userguide.icu-project.org/icufaq/icu4j-faq ICU libraries]
+
Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the [http://icu-project.org/ ICU Java libraries]. In particular, the transliterator ID actually used during the process is:
 +
 
 +
Any-Latin; NFD; [:nonspacing mark:] remove; NFC; Latin-ASCII;
  
 
== Stemming ==
 
== Stemming ==

Revision as of 16:56, 26 October 2013

"Yet Another Species Matching Execution ENgine" - common string transformations

Here follows a list of common string transformations involved in the YASMEEN data conversion and matching processes.

Simplification

This is the process of removing all non-letter and unnecessary characters (symbols, digits, multiple spaces, leading / trailing spaces) from a string, convert the result in the ASCII character set and return the uppercase version of such conversion.

Unnecessary character substitutions is achieved by means of simple RegEx replacements whilst the ASCII character set conversion is delegated to the ICU Java libraries. In particular, the transliterator ID actually used during the process is:

Any-Latin; NFD; [:nonspacing mark:] remove; NFC; Latin-ASCII;

Stemming

Soundex

Trigrams