25.06.2013 Taxa Matching and Reconciliation

From D4Science Wiki
Jump to: navigation, search

Agenda

Time: Thu, June 25, 2013, 14:00 - 16:00 CEST

  • Taxonomic Name Reconciliation
  • Environmental data

Participants

  • Anton Ellenbroek (FAO)
  • Fabio Fiorellato (FAO)
  • Pasquale Pagano (CNR)
  • Gianpaolo Coro (CNR)
  • Edward Vanden Berghe


Discussion

1. Taxonomic Name Reconciliation

Gianpaolo presents his suggestion for an artificial-intelligence approach to names reconciliation. Fabio presents his lexical matcher. We discuss how to best make use of the name parser developed by Dmitry Mozzherin (Dima). It is clear that Dima’s parser could usefully be deployed as an input filter to Fabio’s application. Dima’s parser and other applications have to be investigated further, and common ground for efforts identified. It is also clear that Gianpaolo’s approach would be very new, and should be used in a comparison with the performance of the more classical rule-based approaches. Gianpaolo has most of the elements of a system ready, and would need two weeks to put it together. He hopes to do so in the near future. Also his approach involves the ‘Viterbi’ algorithm, which would be useful in the context for which Fabio is building his lexical matcher. Plenty of opportunity for synergies.

Edward explained the plans with taxonomic name reconciliation within the contract. The idea is to identify a set of specific rules to calculate distances between names; for example, variation in the gender-specific extension of a species name is a common problem, and should not be counted high when calculating such a distance score. He will extract some of the logic behind the name cleaning process in OBIS and translate it in natural language rules. These will be shared using the wiki. Edward will also have a new look at the taxon name reconciliation page on the biodiversity cluster’s page and update if necessary. The biology-specific rules might also be a contribution to the general lexical rules used by Fabio.

To allow for testing and comparing different systems, Edward will compile a list of common misspellings; it is also possible to extract any number of misspellings ‘from the wild’ – from lists received from data contributors to OBIS GBIF, FishBase…

The second part of taxonomic name reconciliation is the definition of a workflow in response to new data becoming available in the ecosystem of interlinked information systems. It was agreed that it would be premature to look for large-scale collaboration on this, but better first to produce a solid draft of proposed best practices, only then look for wide distribution and possible implementation. For now, the discussion would be between Nicolas, Anton and Edward, and possibly with Yde de Jong if he’s interested

Pending issues: further explore BioVel site; register iMarine services on https://www.biodiversitycatalogue.org/. Check with Lino on status of IRMNG

2. Environmental data

Gianpaolo and Edward will discuss this on the next day morning with Terradue.

Originally, CRIA would play a role in investigating what the effect of the tighter coupling between environmental and distribution data is on the quality of the predictions of the environmental envelope modeling. Not sure that this will happen now that OBIS has withdrawn from this part of the project. Edward suggests that he shares the terms of reference of his contract with Vanderlei, and hopes to identify areas of common interest. Before doing so, this should be discussed with Nicolas.

Proposals

  • Proposed challenges - What we expect as useful? Which are the benefits in exploiting a species names parser?
  • Available datasets - Which data sets are available for exploitation?

References

Here follows the list of links to access the sample apps shown by Fabio during the discussion: