28.02.2013 Biodiversity cluster

MEETING NOTES; CALL ON ACTIVITIES PLANNING OF THE BIODIVERSITY CLUSTER

Date: 20 & 28 Feb. 2013 13.45 – 16.30 CET

Topics

Occurrence data cleaning
Name reconciliation

Participants: CRIA: D.A.L. Canhos, A. Marino; UNESCO: W. Appeltans; FIN: N. Bailly; FAO: A. Ellenbroek (part).

NOTES

DATA CLEANING TOOLS FOR SPECIES OCCURRENCE DATA

The discussion focused first on whether iMarine should provide a data cleaning service or whether it should focus on the development of filters to enable users using the Virtual Research Environments (end of the pipeline) to select data that attend their needs. Ward informed the group that, in the past, a lot of data quality control (QC) work was done at the OBIS project office (OBIS tier 1 node). At the last OBIS Steering Group meeting, it was decided that procedures and tools will be development to do the QC at the OBIS node level/data providers level (OBIS tier 2 and 3 nodes). Setting up data cleaning tools for OBIS in the iMarine infrastructure has not yet been discussed within the OBIS steering group. Some OBIS nodes have their own procedures (manually or semi-automatic) and some have already advanced online QC reporting tools. VLIZ is currently building data validation tools for EurOBIS, and because OBIS and EurOBIS run on the same servers in Oostende, implementing the EurOBIS data validation tools for OBIS makes sense, instead of developing new things in a different environment (and VLIZ has substantial permanent funding for EurOBIS). Therefore, data validation tools shall be available for these providers and OBIS nodes, and then the following issue was debated: is it realistic to implement these tools in the iMarine infrastructure?

If tools are developed in the infrastructure, what will happen if the infrastructure does not run anymore, e.g., after the end of the funded project.
Do partners and users of iMarine intend to use other species occurrence data than those from OBIS and GBIF, which would require cleaning? EurOBIS is currently harvesting all marine data in GBIF that is not yet in OBIS.
It was acknowledged that some tools will require high computing services, such as searching for duplicates and outliers in several tenths or hundreds of millions of records. For such applications, such a service provided by iMarine may be of interest.
What is the project timeline (iMarine only runs for one more year)?
The OBIS project will have a programmer from May 2013, but he has little experience in Java.

In summary there is a pending question if the iMarine infrastructure is a place for data management or only for data selection by scientists to perform their analysis and models. In the latter case, what would need to be developed is a powerful filtering tool rather than a data cleaning/quality reporting tool for data managers. In that case the current VREVirtual Research Environment. Biodiversity Research should be improved, and be better linked with the OBIS interface and/or strongly improved (e.g., search on higher taxa, datasets, geographic regions).

See below for further thinking and decisions.

Then successively Ward and Dora presented the requirements for quality control that OBIS is planning for the future and the data cleaning tools developed by CRIA .

1. OBIS (link to powerpoint)

OBIS will put more emphasis on quality control and standardizing data before data is uploaded to the database. Species names that are not recognized by WoRMS/IT IS/IRMNG/COL will not be deleted but will not be displayed on the portal. Problematic records will be sent back to providers for corrections. This change of policy is recent.
Checking names: The process is semi-automatic, and eventually, human scrutiny is needed (which is the case in all biodiversity information systems).
Geographic Quality Control: still working on specifications, some of the tools have not yet been developed
Authority files: There is a possibility that some quality controls are made using species environmental parameters (e.g., those from FishBase for fishes) to be checked against the World Ocean Atlas.

CRIA (http://splink.cria.org.br/dc/index?&system=&colecao=OBIS_BR&setlang=en)

CRIA developed a set of applications that combined produce as an output a data cleaning report that is published on-line for both curators (to correct possible errors and to standardize data) and users (to assess the quality of the data of each provider).
The report basically presents the analysis in 5 groups: (i) short summary about the total number of records, georeferenced, repeated, date of last update; (ii) Taxonomic data; (iii) Date; (iv) Locality data; and, (v) suggestions for blank fields.
Tools intensively use visualization through mapping.
Most of the procedures used or to be used by OBIS are already implemented, to the noticeable exception of those requiring heavy computing.
The tools are up and running in the Brazilian context, particularly the Brazilian OBIS node
All data cleaning tools were developed within the speciesLink environment and there is not a web service interface for external access.

The following discussion highlighted the fact that we did not realize that CRIA performs the data management for the Brazilian OBIS node that is managed by another institution in Sao Paolo. Hence, there is clearly a potential for a stronger collaboration to be established between the OBIS project office and CRIA because the tools developed by CRIA for the OBIS node in Brazil could also serve other OBIS nodes.

Since January 2013, the OBIS servers in Oostende became a node of the gCUBE network. If developments could be done on the servers in Oostende, the question on sustainability is less pertinent. The tools can run in a VREVirtual Research Environment. disconnect from the rest of the infrastructure, and if the code is open source, the risk to lose all the work is minimal (the same way we did for the AquaMaps VREVirtual Research Environment. in GEOMAR).

The discussion is still if this will conflict with the developments of EurOBIS. A possible scenario is that OBIS, EurOBIS (VLIZ) and CRIA work together in developing this. An advantage in this is that the gCUBE could provide resources/power for computing the heavier algorithms.

Ward will discuss this with VLIZ and the OBIS technical support team in order to take a decision if possible next week or the week after.

Tasks

In the meantime, to avoid further delays, two tasks are to be performed as soon as possible, deadlines to be decided next week:

CRIA will provide the specifications of the speciesLink’s cleaning tools as planned (describe the functionalities, but not detailed technical ones), but clearly now in the context of OBIS-BR. These specifications will be checked by the cluster for possible missing functionalities, and completed with the OBIS plans and Gianpaolo work presented in November.
In particular, CRIA will flag the ones that are potentially highly demanding in terms of computing capacity.
OBIS will provide the specifications of the filtering tool if we decide to move in that direction.

The implementation of what will be decided is to be done by CNR, with help of FIN programming time. Further discussion is needed with UNESCO for the programming time they could allocate.

2. NAME RECONCILIATION

Mainly discussed shortly between Ward and Nicolas after FAO and CRIA left the discussion. Nicolas reported the progress of Casey so far. The full algorithm used by WoRMS , based on the Tony Rees Taxamatch tool (CSIRO, Australia) is available in online: TAXAMATCH fuzzy matching algorithm by Tony Rees: http://www.cmar.csiro.au/datacentre/taxamatch.htm PHP/MySql port of TAXAMATCH by Michael Giddens:

Scientific Names Parser by Dmitry Mozzherin:

https://github.com/GlobalNamesArchitecture/biodiversity

Task:

Casey and Nicolas to study it asap next week for implementation in the infrastructure.

28.02.2013 Biodiversity cluster

MEETING NOTES; CALL ON ACTIVITIES PLANNING OF THE BIODIVERSITY CLUSTER

Topics

NOTES

DATA CLEANING TOOLS FOR SPECIES OCCURRENCE DATA

1. OBIS (link to powerpoint)

2. NAME RECONCILIATION

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

D4Science

Capacity

Procedures

Policies

Documentation

Tools