Difference between revisions of "19.09.2013 BiolDiv"

From D4Science Wiki
Jump to: navigation, search
(Created page with "Meeting 19 September 2013, 12 noon Google Hangout Present: Anton, Lino, GP, Fabio, Edward Notes The main topic of the meeting was BiOnym, and what should be done at short notice...")
 
m
Line 1: Line 1:
Meeting 19 September 2013, 12 noon
+
'''Meeting 19 September 2013, 12 noon'''
Google Hangout
+
 
Present: Anton, Lino, GP, Fabio, Edward
+
'''Participants''': Anton, Lino, GP, Fabio, Edward
Notes
+
 
 +
'''Notes'''
  
 
The main topic of the meeting was BiOnym, and what should be done at short notice to have a working demonstration version of the system, which is still in concept at this point.
 
The main topic of the meeting was BiOnym, and what should be done at short notice to have a working demonstration version of the system, which is still in concept at this point.

Revision as of 09:35, 26 September 2013

Meeting 19 September 2013, 12 noon

Participants: Anton, Lino, GP, Fabio, Edward

Notes

The main topic of the meeting was BiOnym, and what should be done at short notice to have a working demonstration version of the system, which is still in concept at this point. In BiOnym as it was envisaged originally, the user can choose between different reference list to try and resolve his/her names. Some of these lists are the reference lists that were uploaded in the iMarine Infrastructure already (CoL, WoRMS, ITIS, FishBase…); in the original concept the user would also have the opportunity to upload his own reference list. It was decided that uploading user-defined reference lists should be postponed.

Reference lists can be accessed either through web services on the fly, or the iMarine infrastructure could store a cache. Using a reference list in BiOnym would probably involve some preparation of the reference list – such as making sure the format of the reference list is understood by our matching process, and calculating some extra fields such as soundex. For technical reasons, caching and creating these extra fields in a ‘pre-processing’ of the reference lists might be the best solution. This pre-processing could be defined as a separate activity (service? tool?) in the infrastructure.

Darwin Core Archive was discussed as a possible format for the pre-processed reference lists. The core of DwCA is a CSV text file with the data; other files are documenting the exact nature of the columns in the data file. Fabio’s SpeciMEn.jar works with reference lists that are incorporated in the .jar file, using a CSV format to store the data and the extra fields.

We need a technical discussion about exactly how to implement Fabio’s matching tool in the iMarine infrastructure. Fabio is open to assist with any solution that is most convenient for the infrastructure.

The overall structure of the application was discussed. The whole process is dependent on the availability of the reference lists – see above. The process of ‘matching’ would start with the definition of the workflow. In the initialization process, end users define which, and in which order, and with what settings, a number of ‘switches’ (Edward’s term) or ‘matchers’ (Fabio’s term) are applied. He/she choses a reference list (or, in future, uploads his own reference list and applies the reference list pre-processing on it), and uploads the data he wants to match/test. The test data go to a pre-processing step, consisting of a ‘cleaner’ (stripping extraneous information such as ‘cfr.’, ‘aff.’; harmonizing ‘var’, ‘v.’… to ‘var.’…) and ‘parser’ (atomizing the complete string in its constituents such as genus name, specific epitheton, authority, author year…). From there the data go to the series of switches as defined by the user while initializing the process. In each switch, some names are sent to the post-processing, others are sent on to the next switch. For the post-processing, several options should be defined – such as calculating performance statistics in an experimental run of the matching system; presenting the end user with a list of choices to be made in case a best match could not be made automatically; adding classification, rank and synonymy information to the matched names…

As there still seems to be some confusion over the concept, and as some of the documentation on the iMarine Wiki and Workspace seems to be overlooked, Edward will produce an overview document with links to relevant documentation on the iMarine web sites.

Integrating the different lines of work within iMarine that touch on BiOnym/name matching is long overdue. We urgently need to resolve this. We should have a discussion involving also Casey and Nicolas as soon as possible.