11.07.2013 Biodiv

From D4Science Wiki
Jump to: navigation, search

Meeting Notes: call on Biodiversity; Name reconciliation

Date: 11 July 2013; 12:00 – 13:05

Topics:

  1. Parallel R
  2. Taxon matching
  3. Update on biodiversity activities

Participants

  • FAO: A. Ellenbroek, F.Fiorellato
  • CNR: P.Pagano, GP.Coro
  • FIN: N. Bailly
  • E. vanden Berghe

Notes:

  1. CNR has established a network of nodes running R that with the statistical manger can now process parallel jobs. The first experiment was performed with 21 nodes with a Bayesian Model. GP experimented with parallelizing in R, but that did not deliver the gains that were expected.

a. CNR will prepare a message to iMarine Board on this result.

  1. Taxon matching;
    1. GP: Has an experiment been done with FF new code?
    2. EB: not yet, we have learned how to use. The new datasets have a problem in matching because is currently connected to AFSIS. FF enables a check against FishBase. EB is working on a list of common spelling mistakes specifically for fish; the two earlier datasets remain available.
    3. FF: System can be fed with any source which exposes sci or language names that can be transformed into a reference set. An example is the Fishbase, where I extracted the names, and added as a reference dataset.
    4. EB: Ideal use case, user selects the reference list to compare with.
    5. FF: The jar is now a commend line tool. When brought into the infrastructure, it can as well run in the infrastructure as a real service. A new jar will be released today.
    6. GC: With lexical comparisons it is not so easy to select the correct parameter settings. WRT the dataset, we can add the jar with the dataset inside soon. However, to make it more flexible we need more time. We can also put the .WAR file, and the statistical could interact with that through a wrapper.
    7. PP: The wrapper we aim to build will be more than a UI, it will run the app in a web service. On top of that the UI will be built. We aim to re-use the Statistical Service.
    8. FF: Please keep in mind that the current command line tool cannot inform on the progress. Asks for the OBIS dataset to perform testing before the holidays.
    9. PP: We have several instances of OBIS available, GP can provide access to FF. we cannot give entire database to FF, unless FF complies with data sharing policies.
    10. EB: Beware that OBIS is not a clean reference list. The reference should be WoRMS or CoL later.
    11. PP: Can data in DwC-A be used? We can set up such service for data input.
    12. FF: I can add such facility later.
    13. NB: Casey involvement better seen in parallel rather than as a joint effort, in order to ensure that later the entire work-flow is covered.
    14. FF: For the time being not distributing source code. It would be nice to compare results between 2 systems. Later we can compose a complete work-flow.
    15. GC: It is difficult to see how differences can be remediated where there are no entries that can be matched on a lexical base. To support this, we have to offer manual mapping facilities, to be discussed later.
    16. EB: To work with Casey and Anton to built a formal set of rules, including rules where matching exists that were manually created. We need software that is adaptable to different levels of quality in datasets. We aim to experiment with different sub-sets. There are 150.000 clean names in OBIS, and 400.000 rough names. This can be an excellent source for ‘fake’ reference lists (from the clean names) and misspelled names (from the rough names). Need to do some experiments with different groups: not all groups are in the same state of ‘cleanliness’.
  2. EB to work with AE and Casey on rule frame.

Follow-up actions

  • FF on holiday for 2 weeks
  • FF to provide a new jar for testing
  • Next call next week Thursday 18, 12:00, main topic rule frame (requirements)
  • CNR closes 1 – 20 Aug
  • GP to provide access to a test OBIS snapshot instance for FF to extract test list of species
  • FF to confirm to comply with iMarine data policy when working with OBIS data
  • EB, AE, Casey to work on a rule frame
  • AE to ask Terradue availability over summer