Difference between revisions of "Taxamatch Algorithm"

From D4Science Wiki
Jump to: navigation, search
(Algorithm)
Line 6: Line 6:
  
 
# Get the user's input of species. (e.g. Genus, Species or Genus+Species)
 
# Get the user's input of species. (e.g. Genus, Species or Genus+Species)
# Search for the exact match in the database.;
+
# Search for the exact match in the database.
* If there is an exact match, print that species;
+
## If there is an exact match, print that species;
* Else, normalize the user's input:
+
# Normalize the user's input:
  i. Transform any accented character to its non accented character.
+
## Transform any accented character to its non accented character.
  ii. Strips out any html character and drop any character other than A-Z, a-z and space.
+
## Strips out any html character and drop any character other than A-Z, a-z and space.
  iii. Remove multiple double letters or multiple spaces.
+
## Remove multiple double letters or multiple spaces.
* Get the normalized input and search it in the database.  
+
# Get the normalized input and search it in the database.  
  i. If there is an exact match, print that species.
+
## If there is an exact match, print that species.
  ii. Else, get the root of the normalized input and search it in the database.
+
# Get the root of the normalized input and search it in the database.
 
+
+
 
+
If there is no match,
+
 
+
 
# Run last search without any filter of genus and species
 
# Run last search without any filter of genus and species
 
# Filter the output using the functions: phonetic, mdld, and similarity.
 
# Filter the output using the functions: phonetic, mdld, and similarity.
Line 28: Line 23:
 
:the Modified Damerau-Levenshtein Distance test
 
:the Modified Damerau-Levenshtein Distance test
 
:the minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same
 
:the minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same
;SImilarity
+
;Similarity
 
:returns percentage similarity of input and data
 
:returns percentage similarity of input and data
 
      
 
      
 
It prints the species if it satisfy all the conditions below:
 
It prints the species if it satisfy all the conditions below:
  
a. the input is phonetic match with the data of greater than or equal to 0.4
+
* the input is phonetic match with the data of greater than or equal to 0.4
 
+
* the mdld of input and data is less than or equal to 4
b. the mdld of input and data is less than or equal to 4
+
* the similarity of input and data is:
 
+
** equals to 100%. This is an exact match.
c. the similarity of input and data is:
+
** between 50% and 100%. This is a near match.
 
+
i. equals to 100%. This is an exact match.
+
 
+
ii. between 50% and 100%. This is a near match.  
+
 
+
 
+
 
+
''' This page is under construction. '''
+

Revision as of 06:15, 14 November 2012

Defining of Terms

Taxamatch is

Algorithm

Here is the step by step procedures and cases used in running Taxamatch:

  1. Get the user's input of species. (e.g. Genus, Species or Genus+Species)
  2. Search for the exact match in the database.
    1. If there is an exact match, print that species;
  3. Normalize the user's input:
    1. Transform any accented character to its non accented character.
    2. Strips out any html character and drop any character other than A-Z, a-z and space.
    3. Remove multiple double letters or multiple spaces.
  4. Get the normalized input and search it in the database.
    1. If there is an exact match, print that species.
  5. Get the root of the normalized input and search it in the database.
  6. Run last search without any filter of genus and species
  7. Filter the output using the functions: phonetic, mdld, and similarity.
Phonetic
compare pronunciation similarity of input and data. 1 being the highest
MDLD
the Modified Damerau-Levenshtein Distance test
the minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same
Similarity
returns percentage similarity of input and data

It prints the species if it satisfy all the conditions below:

  • the input is phonetic match with the data of greater than or equal to 0.4
  • the mdld of input and data is less than or equal to 4
  • the similarity of input and data is:
    • equals to 100%. This is an exact match.
    • between 50% and 100%. This is a near match.