10.10.2013 BiolDiv
Meeting 10 October 2013, 11:00 am
Google Hangout
Present: Nicolas, Casey, Anton, Lino, GP, Fabio, Edward
Notes
BiOnym
GP has implemented a first draft of the BiOnym workflow on the iMarine infrastructure – available on https://dev.d4science.org/group/devvre/sm >SM > Taxa > Bionym. The interface is calling SpeciMEn.jar, and allows choosing some settings. GP used this implementation to do some experiments distributing a workload over different workers; statistics are available from http://goo.gl/uDRei2. We should investigate the best way of distributing the load over the workers. Alternatives are to split up the reference list or splitting up the input list; the former gave best results in preliminary testing by Fabio, but this might be a side effect of the fact that those tests were ran on a single machine; also, splitting up the reference list might cause complications in pulling the result for different test species together.
The interface should allow sub-selecting within the reference list (e.g. extract a single taxon).
Should the interface be complex or simple? For our own work, and for sophisticated users, the interface should provide maximal control over the process, and allow setting all parameters – for the process as a whole, and for individual matchers. For the occasional user, we should have a simple interface, with reasonable defaults for all parameters. We can re-define some standard workflows (e.g. GSAy, Taxamatch) and make those available to those users via a very simple interface.
Fabio told about recent developments within SpeciMEn.jar. There soon will be a possibility to use an external file or URL as reference list, instead of having to use an internal list embedded in SpeciMEn.jar. There is a tool to ‘prepare’ data for use by the .jar, taking DwCA as input. It is now possible to use the .jar to simulate the GSAy approach. This would be an elegant way to make GSAy (or other approaches) available as a separate functionality within iMarine.
We need to have a discussion on the level at which matchers, not developed by Fabio, should be integrated. One possibility is to implement as separate tool in the Statistical Manager; since we try and advocate a community approach to building tools for data matching, this is not ideal; the advantage is that we have a clear definition of the wrapper needed to build tools in the statistical manager, and that Casey and others are familiar with these definitions. One possibility in to ask external contributors to work with Fabio and contribute directly to the development of SpecieMEn.jar; drawback here is that this relies on personal involvement of Fabio, and excludes any contributors working not working with Java. Another option is to connect external tools through web services; this has already been successfully done by Fabio, in his connection with the GNI parser developed by Dima. Another option is to define a new type of wrapper, that specifies how an individual matcher should connect to the BiOnym workflow, taking input from upstream matchers or the pre-processor, and delivering output to the post-processor and downstream matchers; in this case we would also call SpeciMEn.jar from within such a wrapper, with command lines such that the .jar mimics a single matcher (or preprocessing step), instead of acting as the whole series of matchers that are now implemented within the jar. The advantage is that external contributors only need to understand the wrapper, not he internals of SpeciMEn.jar.
Edward is more or less ready with the R script illustrating his vision on BiOnym – except for the post-processor part. He will introduce the script to GP, Casey and Fabio during a conference call scheduled for Friday 11 October, 10 am, Google Hangout.
Next meeting
Wednesday 16 October 2013, 11:30 am. Most participants to the call will be in Rome for a meeting. The call will be initiated from Rome.