Difference between revisions of "Taxamatch Algorithm"

From D4Science Wiki
Jump to: navigation, search
(Algorithm)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Defining of Terms ==
+
= Defining of Terms and Functions =
Taxamatch is  
+
== Taxamatch ==
 +
It is a library that is used to search for an exact match or a near match scientific names. It has one main function that implements the taxamatch algorithm. The function's parameters are:
 +
* the user's input Genus,
 +
* the user's input Species,
 +
* Genus operator,
 +
* Species operator,
 +
* IP address,
 +
* database username,
 +
* database password,
 +
* and name of the database.
  
== Algorithm ==
+
;The user's input Genus and Species
 +
:It can be any string (e.g ''Rhincodon typus'', or ''Acipenser schypa'', or ''foa fo'').
 +
 
 +
;Genus and Species operator
 +
:The operator will only be used in searching for an exact match. The operator can be '''EQUAL''', '''NOT_EQUAL''', '''CONTAINS''', '''BEGINS_WITH''', or '''ENDS_WITH'''.
 +
* EQUAL -> search for the Genus or Species that matches the user's input string
 +
* NOT_EQUAL -> exclude the word in searching
 +
* CONTAINS -> search for the Genus or Species that contains the string in the name
 +
* BEGINS_WITH -> search for the Genus or Species that starts with the user's input string
 +
* ENDS_WITH -> search for the Genus or Species that ends with the user's input string
 +
 
 +
The output of the main function is an array with a length of two. The first value is the counter of the results, while the second value is a string composed of the concatenated scientific names that is believed to be a near match of the user's input string.
 +
 
 +
[[File:Output.jpg]]
 +
 
 +
== Normalize ==
 +
It accepts a string. It simplifies the string by reducing double spaces, removing symbols or numbers, and transforming accented characters into normal character. It returns the normalized string.
 +
 
 +
== Root ==
 +
It accepts a string. It simplifies the string by removing double characters, and getting the root word of the string. It returns the root of the string.
 +
 
 +
==Phonetic==
 +
It accepts two strings to compare and returns their pronunciation similarity, 1 being the highest. 1 means they have the same pronunciation.
 +
 
 +
==MDLD==
 +
Acronym for Modified Damerau-Levenshtein Distance test.
 +
The minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same. This is the letter difference.
 +
 
 +
==Similarity==
 +
It returns percentage similarity of input and data. This is the letter similarity.
 +
 
 +
= Algorithm =
 
Here is the step by step procedures and cases used in running '''Taxamatch''':
 
Here is the step by step procedures and cases used in running '''Taxamatch''':
  
# Get the user's input of species. (e.g. Genus, Species or Genus+Species)
+
1. Get the user's input of species. (e.g. Genus+Species)
# Search for the exact match in the database.
+
 
## If there is an exact match, print that species;
+
[[File:search1.jpg]]
# Normalize the user's input:
+
 
## Transform any accented character to its non accented character.
+
2. Search for the exact match in the database. The dataset to use are the joined tables of Species, Synonyms and Families. In searching for a match, it is based on the selected genus and species operators.
## Strips out any html character and drop any character other than A-Z, a-z and space.
+
a. If there is a match, print that species and end the connection;
## Remove multiple double letters or multiple spaces.
+
3. Normalize the user's input:
# Get the normalized input and search it in the database.  
+
3.1 Transform any accented character to its non accented character.
## If there is a match, print that species.
+
3.2 Strips out any html character and drop any character other than A-Z, a-z and space.
# Get the root of the normalized input and search it in the database.
+
3.3 Remove multiple double letters or multiple spaces.
## If there is a match, print that species.
+
[[File:Normalize.jpg]]
# Run last search without any filter of genus and species
+
 
# Filter the output using the functions: phonetic, mdld, and similarity.
+
4. Get the normalized input and search it in the database. The dataset is the same as in the previous query.
 +
 
 +
4.1 If there is a match, print that species and end the connection.
 +
 
 +
5. Get the root of the normalized input and search it in the database. The dataset to use is the Taxamatch table that contains the Normalize Genus and Normalize Species.
 +
 
 +
5.1. If there is a match, print that species and end the connection.
 +
 
 +
6. Run last search without any filter of genus and species. The dataset is all the rows in the Taxamatch table.
 +
 
 +
7. Filter the output using the functions: phonetic, mdld, and similarity.
  
;Phonetic
 
:compare pronunciation similarity of input and data. 1 being the highest
 
;MDLD
 
:the Modified Damerau-Levenshtein Distance test
 
:the minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same
 
;Similarity
 
:returns percentage similarity of input and data
 
 
      
 
      
 
It prints the species if it satisfy all the conditions below:
 
It prints the species if it satisfy all the conditions below:
  
* the input is phonetic match with the data of greater than or equal to 0.4
+
* the sound or pronunciation similarity is greater than or equal to 0.4
* the mdld of input and data is less than or equal to 4
+
* the mdld or the letter difference of input and data is less than or equal to 4
* the similarity of input and data is:
+
* the letter similarity of input and data is:
 
** equals to 100%. This is an exact match.
 
** equals to 100%. This is an exact match.
 
** between 50% and 100%. This is a near match.
 
** between 50% and 100%. This is a near match.

Latest revision as of 09:11, 25 March 2013

Defining of Terms and Functions

Taxamatch

It is a library that is used to search for an exact match or a near match scientific names. It has one main function that implements the taxamatch algorithm. The function's parameters are:

  • the user's input Genus,
  • the user's input Species,
  • Genus operator,
  • Species operator,
  • IP address,
  • database username,
  • database password,
  • and name of the database.
The user's input Genus and Species
It can be any string (e.g Rhincodon typus, or Acipenser schypa, or foa fo).
Genus and Species operator
The operator will only be used in searching for an exact match. The operator can be EQUAL, NOT_EQUAL, CONTAINS, BEGINS_WITH, or ENDS_WITH.
  • EQUAL -> search for the Genus or Species that matches the user's input string
  • NOT_EQUAL -> exclude the word in searching
  • CONTAINS -> search for the Genus or Species that contains the string in the name
  • BEGINS_WITH -> search for the Genus or Species that starts with the user's input string
  • ENDS_WITH -> search for the Genus or Species that ends with the user's input string

The output of the main function is an array with a length of two. The first value is the counter of the results, while the second value is a string composed of the concatenated scientific names that is believed to be a near match of the user's input string.

Output.jpg

Normalize

It accepts a string. It simplifies the string by reducing double spaces, removing symbols or numbers, and transforming accented characters into normal character. It returns the normalized string.

Root

It accepts a string. It simplifies the string by removing double characters, and getting the root word of the string. It returns the root of the string.

Phonetic

It accepts two strings to compare and returns their pronunciation similarity, 1 being the highest. 1 means they have the same pronunciation.

MDLD

Acronym for Modified Damerau-Levenshtein Distance test. The minimal number of characters you have to replace, insert, or delete to make two strings similar. If MDLD is equals to 0, it means the two strings are the same. This is the letter difference.

Similarity

It returns percentage similarity of input and data. This is the letter similarity.

Algorithm

Here is the step by step procedures and cases used in running Taxamatch:

1. Get the user's input of species. (e.g. Genus+Species)

Search1.jpg

2. Search for the exact match in the database. The dataset to use are the joined tables of Species, Synonyms and Families. In searching for a match, it is based on the selected genus and species operators.

a. If there is a match, print that species and end the connection;

3. Normalize the user's input:

3.1 Transform any accented character to its non accented character.
3.2 Strips out any html character and drop any character other than A-Z, a-z and space.
3.3 Remove multiple double letters or multiple spaces.

Normalize.jpg

4. Get the normalized input and search it in the database. The dataset is the same as in the previous query.

4.1 If there is a match, print that species and end the connection.

5. Get the root of the normalized input and search it in the database. The dataset to use is the Taxamatch table that contains the Normalize Genus and Normalize Species.

5.1. If there is a match, print that species and end the connection.

6. Run last search without any filter of genus and species. The dataset is all the rows in the Taxamatch table.

7. Filter the output using the functions: phonetic, mdld, and similarity.


It prints the species if it satisfy all the conditions below:

  • the sound or pronunciation similarity is greater than or equal to 0.4
  • the mdld or the letter difference of input and data is less than or equal to 4
  • the letter similarity of input and data is:
    • equals to 100%. This is an exact match.
    • between 50% and 100%. This is a near match.