27.01.2014 BiolDiv

From D4Science Wiki
Jump to: navigation, search

BiOnym meeting

face-to-face meeting, Brussels (RBINS), 27 Jan 2014

Present: Nicolas, Edward

Notes

Nicolas and Edward used the opportunity of Nicolas being in Brussels to meet up and discuss some BiOnym stuff.


Algorithm

The present tool is a chain, in series, like originally envisaged; GP has done some testing of individual matchers as well, to investigate the 'complementarity of the matchers': which matches are found with one matcher, and not with another?

We should test the performance of individual matchers. Can an individual matcher be more performant than the workflow? Yes: an upstream match which is a false positive can prevent a positive match downstream.

We should check what is happening with the GNI parser - the GNI parser is surprisingly bad, also with the misspellings from OBIS What happens with the authorities? Also check with more complicated cases, e.g. including subgenus, subspecies...

Final testing

Should be done by biologists! We need some limited packaging/documentation to communicate with people from the biodiversity community. Contact Yde, Dima, Tony, Markus et al at GBIF...

Create trigraphs - based on the R version?

How does the R version compare to BiOnym on the infrastructure?

Packaging

Documentation should include examples (also showing similarities against thresholds...); so not only a technical explanation on how things work, but also explain how different settings relate to the scientific names, how one can expect settings to affect the outcomes...

In the current interface there are labels that are jargon, and would not be meaningful for the normal user; have to be revised

How to deliver the results? Should the results be written to a database? Should we work together with GNI (write to Paddy Patterson, Dima, Rich Pyle, jointly with Nicolas)? Make a contribution by contributing to reconciliation group; check for stuff done under GNITE. GNITE is stalled now after developer has left - Nicolas will look for more information. Also look for things done within SpeciesFile (David Eades).

Possible role for high-throughput form of bionym: creating the reconciliation groups in GNITE database (which now is 16 million names or something)

Availability on the web: there should be a separate Biodiversity VREVirtual Research Environment., with BiOnym as one of the things on offer; this wouldn't stop us from offering BiOnym in different guise through statistical modeller.

Provide BiOnym as a service!

Distribute the Java and R versions separately for people to run locally? Potential problem: IP issues related to distributing Taxonomic Authority Files

Next steps

Testing on two different classes of test data (actual and simulated misspellings); third class, expert-simulated misspellings, to be generated.

Other potential sources for misspellings: GBIF (Checking fresh-water fish species?); WoRMS (with a refreshed version of the fish; leave the rest there for the false positives); IRMNG (2000 misspelled names from Tony, referenced to IRMNG)

For the paper:

  • Create trigrams
  • Does the paper need a material and methods section? - on the level of infrastructure; reference lists...