Difference between revisions of "10.09.2013 BiolDiv"
Line 1: | Line 1: | ||
'''Meeting notes''': 10.09.2013 Afternoon | '''Meeting notes''': 10.09.2013 Afternoon | ||
− | '''Participants''': N.Bailly, E.Vanden Berghe | + | We had two meetings in the course of the afternoon; early afternoon etween Nicolas and Edward; late afternoon also GP and Lino joined. |
+ | |||
+ | '''Participants''': N.Bailly, E.Vanden Berghe | ||
'''Notes''' | '''Notes''' | ||
Workflow for matching names and taxa | Workflow for matching names and taxa | ||
+ | |||
+ | The idea is to construct a workflow, in which the user controls the order, and the selection of 'building blocks', or 'switches' of a taxon matching system. Each of the switches of the workflow should be first a transformation, then a match, then an annotation/separation of the input list in a part that is considered 'done', another part that can be sent on to the next switch (or to the end point of the matching workflow in case of the last switch). In principle, the transformations should be 'graded' in the sense that we start with the least radical transformations first, while transformations in later switches would alter the strings more and more radically. | ||
+ | |||
+ | Also part of the workflow should be a preprocessing step - consisting e.g. only of Dima's or Fabio's parser; and a final processing step - e.g. defining what should be done with the results of the matching. | ||
+ | |||
+ | Also part of the workflow should be a 'test facility' - using a workflow with 'known' misspellings (i.e. names for which we know what the correct spelling is), and use the results to calculate the performance of that particular workflow, against that particular test data set. | ||
Elements and workflow: | Elements and workflow: | ||
Line 14: | Line 22: | ||
# GSAy: Step by step approach: default selection of steps + possibility to remove/add some. | # GSAy: Step by step approach: default selection of steps + possibility to remove/add some. | ||
# TaxaMatch fuzzy matching (Tony Rees): Removing or replacing letters. Selection of letters. Possibility to select genus and/or species. | # TaxaMatch fuzzy matching (Tony Rees): Removing or replacing letters. Selection of letters. Possibility to select genus and/or species. | ||
− | # Lexical distances (Tony, Fabio, Casey): | + | # Lexical distances (Tony, Fabio, Casey): Levenshtein, other? Selection of distances. Selection of threshold. Possibility to select genus and/or species. |
# Soundex (Fabio, Casey): Java function. Selection of threshold. Possibility to select genus and/or species. | # Soundex (Fabio, Casey): Java function. Selection of threshold. Possibility to select genus and/or species. | ||
# Taxonomic disambiguation (Casey): Check if the matched name corresponds to one or several taxa. | # Taxonomic disambiguation (Casey): Check if the matched name corresponds to one or several taxa. | ||
Line 22: | Line 30: | ||
If the taxon was restricted (step 4), propose extension to other taxa for unmatched names, and restart the process. | If the taxon was restricted (step 4), propose extension to other taxa for unmatched names, and restart the process. | ||
− | From Step 5 to 9, and within them, each matching event gives a set of matched names, and non-matched. The matched names are | + | From Step 5 to 9, and within them, each matching event gives a set of matched names, and non-matched. The matched names are proposed for display for visual control. The unmatched names are send to the next step. |
==Step description== | ==Step description== | ||
Line 29: | Line 37: | ||
Users can submit: | Users can submit: | ||
- unstructured texts where scientific names are cited inside the text (Unstructured type) | - unstructured texts where scientific names are cited inside the text (Unstructured type) | ||
− | - semi-structured name | + | - semi-structured name lists, one name per line, but not atomized in Classification/Genus/Species/Author/Year |
- structured list of names. Then it has to follow the proposed structure | - structured list of names. Then it has to follow the proposed structure | ||
===Dima's Parser=== | ===Dima's Parser=== | ||
− | In the case of unstructured or semi-structured, the | + | In the case of unstructured or semi-structured, the names in the file need to be parsed to get a structured file. |
===Classification tree=== | ===Classification tree=== | ||
Line 47: | Line 55: | ||
===GSAy=== | ===GSAy=== | ||
Genus and species progressively but smartly degraded strings are checked, e.g., removing the gender agreement termination of specific epithets. | Genus and species progressively but smartly degraded strings are checked, e.g., removing the gender agreement termination of specific epithets. | ||
+ | |||
===TaxaMatch fuzzy matching=== | ===TaxaMatch fuzzy matching=== | ||
+ | The principle of fuzzy matching is that the same transformation is applied to the name that has to be checked, and the names of the reference list; the check for a match is done on the transformed names. As an example: a possible transformation is to replace all occurrences of 'y' with 'i' - that way a match would be found, even if the name is accidentally misspelled with an 'i' instead of a 'y'. | ||
+ | |||
+ | The set of substitutions should be nder the control of the user; a sensible default set should be provided, probably based on Tony's experiences, or from VLIZ (as fuzzy matching in WoRMS uses a similar procedure). | ||
===Lexical distances=== | ===Lexical distances=== | ||
Line 58: | Line 70: | ||
===Unmatched names=== | ===Unmatched names=== | ||
+ | |||
+ | |||
+ | '''Participants''': N.Bailly, E.Vanden Berghe, GP Coro, P.Pagano |
Revision as of 11:50, 11 September 2013
Meeting notes: 10.09.2013 Afternoon
We had two meetings in the course of the afternoon; early afternoon etween Nicolas and Edward; late afternoon also GP and Lino joined.
Participants: N.Bailly, E.Vanden Berghe
Notes
Workflow for matching names and taxa
The idea is to construct a workflow, in which the user controls the order, and the selection of 'building blocks', or 'switches' of a taxon matching system. Each of the switches of the workflow should be first a transformation, then a match, then an annotation/separation of the input list in a part that is considered 'done', another part that can be sent on to the next switch (or to the end point of the matching workflow in case of the last switch). In principle, the transformations should be 'graded' in the sense that we start with the least radical transformations first, while transformations in later switches would alter the strings more and more radically.
Also part of the workflow should be a preprocessing step - consisting e.g. only of Dima's or Fabio's parser; and a final processing step - e.g. defining what should be done with the results of the matching.
Also part of the workflow should be a 'test facility' - using a workflow with 'known' misspellings (i.e. names for which we know what the correct spelling is), and use the results to calculate the performance of that particular workflow, against that particular test data set.
Elements and workflow:
- Submitted file type: Unstructured text, Unstructured name list, Structured name list.
- Dima’s Parser: If file type is Text or Unstructured name, apply Dima’s parser.
- Reference files: CoL, WoRMS, CofF, FB, SLB. Possibility to select one or several or all.
- Classification tree: Specify the taxon (taxa) in the reference files.
- GSAy: Step by step approach: default selection of steps + possibility to remove/add some.
- TaxaMatch fuzzy matching (Tony Rees): Removing or replacing letters. Selection of letters. Possibility to select genus and/or species.
- Lexical distances (Tony, Fabio, Casey): Levenshtein, other? Selection of distances. Selection of threshold. Possibility to select genus and/or species.
- Soundex (Fabio, Casey): Java function. Selection of threshold. Possibility to select genus and/or species.
- Taxonomic disambiguation (Casey): Check if the matched name corresponds to one or several taxa.
- Taxonomic resolution (4D4Life classification comparison): Check if the family possibly provided by the user correspond to the family in the reference file. If not use the 4D4Life tool.
- For final unmatched names, visual check of genera in the lower taxon provided (e.g., family is usual), list of species names by genus, list of species by family, list of species by other taxon if provided.
If the taxon was restricted (step 4), propose extension to other taxa for unmatched names, and restart the process.
From Step 5 to 9, and within them, each matching event gives a set of matched names, and non-matched. The matched names are proposed for display for visual control. The unmatched names are send to the next step.
Step description
File type
Users can submit: - unstructured texts where scientific names are cited inside the text (Unstructured type) - semi-structured name lists, one name per line, but not atomized in Classification/Genus/Species/Author/Year - structured list of names. Then it has to follow the proposed structure
Dima's Parser
In the case of unstructured or semi-structured, the names in the file need to be parsed to get a structured file.
Classification tree
The user restricts the taxonomic framework in which he matches name, e.g., only fish names. It is done through a taxonomic tree. Several taxa may be selected (second priority option).
Nomenclatural and taxonomic authority files
The user selects which authority file he wants his names to be checked against. CoL, WoRMS, CofF, FB, possibly SLB (some names may be there not in WoRMS still). Several authority files may be selected (second priority option).
GSAy
Genus and species progressively but smartly degraded strings are checked, e.g., removing the gender agreement termination of specific epithets.
TaxaMatch fuzzy matching
The principle of fuzzy matching is that the same transformation is applied to the name that has to be checked, and the names of the reference list; the check for a match is done on the transformed names. As an example: a possible transformation is to replace all occurrences of 'y' with 'i' - that way a match would be found, even if the name is accidentally misspelled with an 'i' instead of a 'y'.
The set of substitutions should be nder the control of the user; a sensible default set should be provided, probably based on Tony's experiences, or from VLIZ (as fuzzy matching in WoRMS uses a similar procedure).
Lexical distances
Soundex
Taxonomic disambiguation
Taxonomic resolution
Unmatched names
Participants: N.Bailly, E.Vanden Berghe, GP Coro, P.Pagano