21.11.2013 BiolDiv

From D4Science Wiki
Jump to: navigation, search

Meeting 21 November 2013, 11:00 am

Google Hangout

Present: GP, Fabio, Lino, Edward

Notes

This is not only a summary of the meeting we had on 21 November, but also an attempt at pulling together some pending issues raised in email conversations.

Validation

Anton sent a mail on 19 November, suggesting we start thinking about validation of our BiOnym activities; he identifies three issues:

  1. when we can invite users to perform the activity.
  2. whom we can invite (E.g. Edward, Nicolas, Yde to propose 2 names)
  3. what we want to do with the results

Nicolas, in his email of 21 November, suggests making a time table to organise testing activities:

  1. deciding on test and authority files
  2. metrics to use to measure performance/effectiveness
  3. comparison with other systems

Neither Anton nor Nicolas participated, but the issue was discussed at some length. Coming up with a firm time table was felt premature, as some benchmarking should happen before users can be invited to the infrastructure. Everyone on the call was very much aware of time pressures, and the approaching end of the project an hence the need for a clear plan to generate outcomes.

Benchmarking

Initial testing threw up some surprising results (e.g. the surprisingly large distance between 'Palinurus' and 'Panulirus' and the species in those two genera). Nicolas and Edward to further explore and investigate the behaviour of the present system. To check: which exact distance measure was used.

On 18 November, GP suggested the following for benchmarking:

  1. set the authority table (reference data) to WoRMS-Animalia
  2. select a list of "real" raw species names (misspellings "from the wild") with associated correct transcription and authorship. Nicolas suggested to start from fin-fishes
  3. generate a list of misspelling errors for the same species (simulated misspellings)
  4. submit both the real and simulated wilds to WoRMS
  5. submit the same to BiOnym
  6. for real and simulated misspellings separately calculate Precision, Recall and F-measure
  7. draw conclusions by understanding the complementary errors between the two systems on the two data sets

Edward will construct the test data sets. Fabio will continue working on YASMEEN: deal with the fact that fuzzy matching returns a boolean, but other matchers return a continuous score between 0 and 1; and put YASMEEN on a diet. Neither Nicolas nor Casey were on line; not clear what the progress with GSAy is at this point.

Next steps

A list of steps was briefly discussed:

  • First benchmarked version of BiOnym
    • Complete system to set up workflows
      • Needs work on YASMEEN
      • Define standard workflows to use as defaults
    • Construct benchmark data sets (Edward to pull together)
    • Compare performance of default workflows with other systems such as WoRMS taxon Matcher or IRMNG/Rees taxamatch
      • GP to analyse efficiency/performance in computational terms
      • Nicolas and Edward to evaluate effectiveness; compare matches/non-matches between different systems, an with different workflows of BiOnym
  • Ask 'friendly' users to test-drive
  • Work on facilities to allow customisations; can go in parallel with previous step
    • user can upload their own authority file
    • user can customise other aspects of matching, such as character substitutions in fuzzy matching...
  • Ask 'naive' users to test-drive system

This list should be amended and expanded, and finalised during the next meeting (Thursday 28 November). The meeting after that we should agree on timing.


Next meetings

Thursday 28 November, 11:00 am. Topics for this meeting:

  • Hear from Nicolas about the TDWG meeting (as this time Nicolas missed the meeting)
  • Further discuss development
    • discuss and set a series of milestones, based on the steps defined during the discussions of 21 November as reported above
    • start thinking about deadlines to go with the milestones, with a view to finalise those in the meeting planned for Wednesday 4 December
  • Long-term strategy

Wednesday 4 December, 11 am. Topics for this meeting (tentative, conditional on discussions of 28 November):

  • deadlines for the milestones decided on in the meeting of 28 November
  • details of the validation process as suggested by Anton in his mail of 19 November