9.12.2013 BiolDiv
Meeting 9 December 2013, 11:00 am
Google Hangout
Present: Nicolas, Casey, GP, Fabio, Anton, Edward
Notes
Training sessions in Paris
GP briefed us on his activities in Paris. He held two training sessions on the use of iMarine infrastructure.
The first session was attended by mainly taxonomists, including also scientists from the museum. The BiOnym system was presented. People had a clear preference for the 'simple' interface. They were able to use this interface and to deal with names of their favourite groups. There will be an official report on their work later.
The second session was attended by a public consisting exclusively of students of the University of Paris. Their interest was mainly in oceanography, and BiOnym was not really presented.
In general the sessions were very successful; also, the iMarine infrastructure behaved very well, and was up to he very intense use during the sessions.
Experiences should go into validation report - to be discussed separately between GP and Anton.
Development of BiOnym
Edward developed scripts to generate test data sets based on OBIS and on WoRMS. The OBIS test data are the misspellings as originally received from the OBIS data providers - but now including a link with the correctly spelled names in WoRMS, rather than with the taxonomic table of OBIS. This should enable later testing of OBIS names against WoRMS, using the web service available from VLIZ' servers. The second type of test data takes WoRMS names, and introduces misspellings by random character substitutions. A further expansion of this type of test data would be to introduce random insertions and/or deletions. For the system to be used, a version of OBIS database should be available from the iMarine infrastructure, and accessible from the machine running RStudio; Edward made a back-up file available of an earlier version of OBIS, that contains all relevant schemata/tables. GP will look into having this restored on the infrastructure.
Nicolas will make available the misspellings of fish names from GBIF. [from last meeting: Nicolas will contact Aaike De Wever to check whether the taxonomy of FADA can be made available as well. Ideally the format would be DwCA.]
Fabio briefed us on progress with YASMEEN. There were some issues with generating the Taxonomic Authority Files; these have been resolved. All TAFs have now been made available on the infrastructure.
An extra parameter was introduced ("partitioning accuracy", a non negative integer value), so that the user can change its value for a matching process run; it is defined as the maximum absolute difference in length between name strings and is applied to input / reference genus, species and scientific names. If a pair of input / reference data exceeds the partitioning accuracy threshold, no further computation is performed and the reference data will not appear as a possible matching candidate for the corresponding input data in the output results. Some experiments were run; as expected, there is a trade-off between accuracy and time it takes to complete the calculations. Fabio shared the results of his experiments, available from http://goo.gl/hmzC46
Further work on YASMEEN involved improving the way work was divided over parallel processes: the overall memory footprint has been greatly reduced and improvements have also been made in the way reference data are streamed from a local or remote URL. The TaxaMatch matchlet has been extended and now returns fuzzy scores (falling in the [0.0...1.0] range) instead of boolean { 0.0, 1.0 } scores. Next step is to check whether Casey's implementation of the GSAy algorithm as a YASMEEN matchlet is giving the same result as the YASMEEN default GSAy implementation (which maps 1:1 the GSAy scores included by Nicolas in the TDWG presentation).
Casey and Nicolas have studied other interfaces for taxon matching. The elements of the interface are listed in this document: http://goo.gl/Li2Ika. Creating a user interface for iMarine's taxon matcher will happen in the first coming days, and a first version should be ready by Thursday this week. GP Passes the link to the tutorial on building interfaces for the statistical manager: https://gcube.wiki.gcube-system.org/gcube/index.php/Statistical_Manager_Tutorial.
Next meeting
Thursday 9 December 2013, 11 am. Unfortunately, GP will not be available.