10.09.2013 BiolDiv

From D4Science Wiki
Jump to: navigation, search

Meeting notes: 10.09.2013 Afternoon

We had two meetings in the course of the afternoon; early afternoon between Nicolas and Edward; late afternoon also GP and Lino joined.


Workflow for matching names and taxa

Participants: N. Bailly, E. Vanden Berghe

Notes

The idea is to construct a workflow, in which the user controls the order, and the selection of 'building blocks', or 'switches' of a taxon matching system. Each of the switches of the workflow should be first a transformation, then a match, then an annotation/separation of the input list in a part that is considered 'done', another part that can be sent on to the next switch (or to the end point of the matching workflow in case of the last switch). In principle, the transformations should be 'graded' in the sense that we start with the least radical transformations first, while transformations in later switches would alter the strings more and more radically.

Also part of the workflow should be a preprocessing step - consisting e.g. only of Dima's or Fabio's parser; and a final processing step - e.g. defining what should be done with the results of the matching.

Also part of the workflow should be a 'test facility' - using a workflow with 'known' misspellings (i.e. names for which we know what the correct spelling is), and use the results to calculate the performance of that particular workflow, against that particular test data set.

Elements and workflow:

  1. Submitted file type: Unstructured text, Unstructured name list, Structured name list.
  2. Dima’s Parser: If file type is Text or Unstructured name, apply Dima’s parser.
  3. Reference files: CoL, WoRMS, CofF, FB, SLB. Possibility to select one or several or all.
  4. Classification tree: Specify the taxon (taxa) in the reference files.
  5. GSAy: Step by step approach: default selection of steps + possibility to remove/add some.
  6. TaxaMatch fuzzy matching (Tony Rees): Removing or replacing letters. Selection of letters. Possibility to select genus and/or species.
  7. Lexical distances (Tony, Fabio, Casey): Levenshtein, other? Selection of distances. Selection of threshold. Possibility to select genus and/or species.
  8. Soundex (Fabio, Casey): Java function. Selection of threshold. Possibility to select genus and/or species.
  9. Taxonomic disambiguation (Casey): Check if the matched name corresponds to one or several taxa.
  10. Taxonomic resolution (4D4Life classification comparison): Check if the family possibly provided by the user correspond to the family in the reference file. If not use the 4D4Life tool.
  11. For final unmatched names, visual check of genera in the lower taxon provided (e.g., family is usual), list of species names by genus, list of species by family, list of species by other taxon if provided.

If the taxon was restricted (step 4), propose extension to other taxa for unmatched names, and restart the process.

From Step 5 to 9, and within them, each matching event gives a set of matched names, and non-matched. The matched names are proposed for display for visual control. The unmatched names are send to the next step.

File type

Users can submit: - unstructured texts where scientific names are cited inside the text (Unstructured type) - semi-structured name lists, one name per line, but not atomized in Classification/Genus/Species/Author/Year - structured list of names. Then it has to follow the proposed structure

Dima's Parser

In the case of unstructured or semi-structured, the names in the file need to be parsed to get a structured file.

Classification tree

The user restricts the taxonomic framework in which he matches name, e.g., only fish names. It is done through a taxonomic tree. Several taxa may be selected (second priority option).

Nomenclatural and taxonomic authority files

The user selects which authority file he wants his names to be checked against. CoL, WoRMS, CofF, FB, possibly SLB (some names may be there not in WoRMS still). Several authority files may be selected (second priority option).

GSAy

Genus and species progressively but smartly degraded strings are checked, e.g., removing the gender agreement termination of specific epithets.

TaxaMatch fuzzy matching

The principle of fuzzy matching is that the same transformation is applied to the name that has to be checked, and the names of the reference list; the check for a match is done on the transformed names. As an example: a possible transformation is to replace all occurrences of 'y' with 'i' - that way a match would be found, even if the name is accidentally misspelled with an 'i' instead of a 'y'.

The set of substitutions should be nder the control of the user; a sensible default set should be provided, probably based on Tony's experiences, or from VLIZ (as fuzzy matching in WoRMS uses a similar procedure).

Lexical distances

Levenshtein distance seems to be the most commonly used measure of lexical distance. It is also built into Fabio's SpeciMEn.jar Any thresholds built into the matching scheme will have to take into account the length of the names being compared; Levenshtein distance is only of limited utility for short names.

Soundex

Is also built into Fabio's jar. This and Levenshtein distance are widespread methodologies, and exist in several implementations. Should be easy to deploy within the iMarine infrastructure with or without making use of Fabio's jar.

Taxonomic disambiguation

A name exactly matching with two different names from the reference list poses a problem, as it is no longer trivial to assign it automatically to a single name in the reference list; same with names that have the exact same non-zero Levenshtein (or other distance) with two or more different names from the reference list. In such cases, we have to look at the 'taxonomic' part of the comparison: is one of the spelling variations a known synonym of the other?

Taxonomic resolution

Taxon name matching only looks at strings, not at synonymy (with a possible exeption as described in taxonomic disambiguation). After the lexical matching, there should be a step of taxonomic resolution: where is the name in the classification? Is it a valid name, and if no, what is the valid name?

Unmatched names

At the end of the process, there may still be a series of unmatched names. There should be a post-processing process to define what has to be done with this remaining list. Also part of the post-processing could be to calculate performance statistics in case of a matching run of test data.

Late afternoon meeting

Participants: N.Bailly, E.Vanden Berghe, GP Coro, P.Pagano

Notes

Nicolas and Edward briefed the two others about the meeting earlier that afternoon. One of the urgent issues to be discussed is an abstract from a presentation in the TDWG meeting in Firenze.

GP stresses the importance of turning some of the activities with Edward into concrete results. It was argued that Edward should make sure there is ample documentation of the activities he undertook, so that the results of these activities can be translated into concrete functionality for the iMarine infrastructure.

For the TDWG meeting in Firenze, we have to produce 500-word abstract before 25 September. Edward will start working on this, and describe the concept of the iMarine taxon name matching system, which was christened 'BiOnym' in a meeting also involving Anton and Nicolas some time ago. Before the TDWG meeting actually takes place, we should have a pretty detailed view of what biOnym should look like. Concrete, this means that we want to have a detailed description of what a 'switch' of BiOnym should look like, so that CNR can implement; and what the decision criteria built into the switches should be.

A possible exercise of Species Distribution Mapping was discussed. This would involve generating zero-records (or absences) from specific categories of presence-only datasets. There seems to be quite a bit of discussion in the literature about the role of absence points, and the best way to deal with presence only data. Most of the elements we need for doing some 'experiments' are available in the infrastructure. This is possibly an activity with a fast return on investment. Concrete: Edward will provide queries to generate absence data from the OBIS data now available on the infrastructure; GP will look into implementing Maxent, and what other steps are needed to conduct this exercise.

The interpolation for the environmental layers was discussed once more. Edward would like to have a conference call with NODC/WDC people as soon as possible, so that he can leave this activity to others within iMarine.

Edward had to run for the train while discussions were in full swing. More later?