Ecosystem Approach Community of Practice: TaxonReconciliation

From D4Science Wiki
Jump to: navigation, search

Taxon Reconciliation

The taxon matching facility will allow users in the Biodiversity community to load their datasets describing species, and reconcile the names and other descriptive features using matching algorithms against a pre-defined range of structured taxonomic names repositories. The output of the process will be identifiers of "mismatch" between the private dataset entries and entries in the remote repositories. The environment will be equipped with facilities where users can modify their taxonomic entries, persist the changes they have made, and publish their datasets to other e-infrastructure users and to subscribed e-mail addresses to inform when and what data have been reconciled.

BiOnym is the proposed VREVirtual Research Environment. that will deliver the name reconciliation combining several community developed reconciliation procedures such as FAO Name tool, FIN developed TaxaMatch, and OBIS R and plpgsql functions.

Progress

The progress can be followed here

Priority to CoPCommunity of Practice.

List proposed solution priority following the iMarine Board priority setting criteria:
  • The taxon matching aims at the biodiversity community involved in reconciling differences in taxonomic description between datasets. However, there are several use cases identified beyond the this target community. FAO already uses a very similar approach to entity mapping for it's vessel registry, and is willing to contribute to, and exploit functionality emerging in this product;
  • First level users (data owners) will be from FIN, IOC Unesco and FAO. Second level users (data managers) will be partners of FIN and IOC Unesco involved in maintaining and managing taxonomies. Third level users (consumers) will be potentially any biologist in need of access to detailed taxonomic information;
  • Potential for co-funding. The potential for co-funding can only be assessed after a prototype has been released (Q3 2013);
  • Structural allocation of resources. To be discussed;
  • The harmonization services are referred in DoW in T3.3, WP6 and WP9 and are thus highly relevant to the project objective;
  • The Business Case 2 for support to the biodiversity community is in need of reconciliation services in order to enable the EA-CoPCommunity of Practice. to interpret observational data across reporting and monitoring systems;
  • The proposed product aims to re-use gCube components for data access and management. In addition, it will provide specifc high value computational services to the biodiversity data providers, enabling them to reduce a specific and demanding work-load. These 2 aspects support the sustainability of the e-infrastructure by reducing development and operational costs to the EA-CoPCommunity of Practice.;
  • The consistency and compatibility of the product with EA-CoPCommunity of Practice. regulations and strategies (eg INSPIRE) has yet to be investigated;
  • The services re-use the components developed in the context of the Biodiversity Research Environment. It focuses on increasing their usability by adding matching algorithms between selected taxonomic entities, a means of persisting the results in a re-usable and re-accessible format, and sharing and notification services. The benefits to taxonomists are found in the field of single point data harmonization (eliminating double work and multiple error prone data entry steps), data processing (e.g. matching large datasets is very demanding), and flexibility. The aim is to offer all users a biodiversity 'compatibility' tool that allows them to use other organization's taxonomic data with trust; trust in the structure, the quality, and completeness of imported data;

Parentage

The Relation to existing CoPCommunity of Practice. Software and manual activity is evident.
  1. OBIS
  2. Tony Rees' TaxonMatcher.
  3. FAO Vessel registry tool.
Relation to D4S technologies; CNR to advise.

Productivity

Are the proposed measures effective?
Does it reduce a known workload?

Presentation

How must the component be delivered to users? (UI Design / on-line help / training material / support)

The community requires that several services are available for data source registration and communication:

  1. Reference data providers; taxonomic data repositories such as OBIS, GBIF, CoL, ....
  2. User management
  3. Data dissemination / feed-back services

For the actual reconciliation work, they expect to be able to step through the process in 5 steps.

  1. Data load; a set of Darwin Core (Archive) is loaded to a user account;
  2. Data reconciliation; select the target data-provider, the taxonomic rank from where to match down, boost / remove matching levels (e.g. boost Family level matching, ignore subspecies information)
  3. Data processing information
  4. Data review option with manual confirm / edit options
  5. Data sharing and feed back as a report to data owners

Policy

Are there any policies available that describe data access and sharing?

Add link

Have the Copyright / attribution / metadata / legal aspects been addressed from a user and technology perspective?

Add link

Detailed description

Description: DRAFT iMarine is in a unique position to improve the access and maintenance of taxonomic data used in biological sciences. It can offer access to large data repositories, and exploit the services of a large infrastructure to validate and analyze taxonomic data against these authoritative data sources. Taxonomic authority files are ..... They are living entities; their content is updated and expanded very regularly. They do not have a predefined or static structure. For none but trivially small groups, these authority files are complete. They are maintained by a taxon authority, who can delegate data insert and update responsibility to data-managers. Data management includes the upload of data and the validation of that data against a set of rules and existing data. Data may exist in single records, or be contained in larger Darwin Core files that are either uploaded manually, or retrieved through a WS.

In response to any completed update, a cascade of synchronization and update events should take place, first between the taxonomic authority files, and second between these files and other biodiversity datasets such as occurrence datasets. In addition, it may be required that reference data owners are notified of changes.

Taxonomic names are the ‘common currency’ between data systems – they facilitate the exchange of data between systems, and to link information pertaining to the same entities. Up-to-date names allow scientists to extract the most recent and highest quality data from global repositories such as OBIS and GBIF, which is crucial in the context of the Ecosystem Approach to Fisheries. Other biologists (i.e. not working as taxonomists) can also benefit if the datasets they analyze are automatically maintained and updated, or can be validated against taxonomic authority files to eliminate ambiguity. They can thus be certain that they obtain the most precise information without requiring their effort (which does not prevent them to use the data in a critical way).

The iMarine infrastructure offers for the first time the possibility to:

  1. synchronize the content of various marine taxonomic authority files, and
  2. reconcile various occurrence datasets commonly used in fisheries and biodiversity studies with those files.

The first task is to define, within the iMarine project, what the relevant data sets are, what the workflow of updating is, how to annotate and track updates, how to inform and who to inform of any changes. In defining these workflows, it is important to make a distinction between ‘authoritative’ sources of taxonomic names (such as WoRMS or the Catalogue of Life), and ‘consumers’ of taxonomic names, such as FAO Fisheries, BOLD, GBIF or OBIS. The involvement of these authoritative data providers is defining these workflows is an especially strong point of iMarine. iMarine must ensure that that providers of data receive feedback on the quality, and suggestions for corrections. For example, if a new version of ‘Catalog of Fishes’ becomes available, it should be checked against other sources of fish names (FishBase directly, FishBase within Catalogue of life and within WoRMS) and a list of updates to these should be suggested; once this synchronization is finished, reconciliation should percolate to FAO, OBIS, GBIF, VertNet, FishNet2… The second task is to develop, together with the technical teams within iMarine, the rules to be used in the reconciliation of names from different sources. These rules will be, in the first phase, specified in natural language, but examples of several rules using regular expressions are already available. Building the services to apply these rules will be the task of one of the technical teams; as soon as this engine is available, the system will be piloted on two groups: pelagic ostracods (approx. 200 species) and fish (approx. 20,000 species). For fish, Nicolas will act as expert to validate the work; for pelagic ostracods, collaboration will be sought with scientists at IOPAN.

The central services will be:

  1. Access to matching algorithms to reconcile taxon names from different sources. iMarine has had discussions with Tony Rees about his TaxaMatch algorithm, and with Alex Hardisty and Alec Gray from i4Life (Lino will circulate documentation on the latter, including a description of the implementation of their algorithm in iMarine). Also, Nicolas and Edward both have ample experience in this kind of matching, doing it in a semi-automated way. The original plan for the biodiversity cluster was to implement a rule-based system, where the rules are independent from the engine applying them; this could be done in collaboration with FAO, where expertise along these lines already exists. Clearly, many of the pieces of the puzzle already exist; it is a matter now of integrating those pieces.
  2. A central register of all names. Such a register has been created already by the Global Names Architecture. Permission will be sought to use this register, instead of re-creating one. Within the GNA list, ‘reconciliation groups’ could be defined: names of which our algorithms tell us that they are probably synonymous. This was always the intention with GNA, but due to lack of resources, progress with this activity has been slow. Both Edward and Nicolas are involved with the GNA, and will look for suitable contacts to define joint activities.
  3. A notification and subscription system to share reconciliation;