Difference between revisions of "23.01.2014 BiolDiv"
(→BiOnym Performance) |
(→BiOnym Performance) |
||
Line 10: | Line 10: | ||
==BiOnym Performance== | ==BiOnym Performance== | ||
− | We discussed BiOnym performance further, after GP expanded his analysis following last week's meeting. The performance gap between TaxaMatch and BiOnym, for 'artificial' (simulated) misspellings is confirmed. Apparently our methods of testing performance are critically dependent on the type of misspellings. We decide to introduce a new category of misspellings: simulated, but informed by the type of misspellings we normally encounter in 'real' misspellings (as found for example in the raw names contributed to OBIS). The present method of generating misspellings is just to do a set of random character substitutions; this might give an undue advantage to both the simple parser (as compared to the GNI parser) and to the generic methods like Levenshtein and trigram, as compared to GSAy and TaxaMatch. Edward will work on the system to generate the new set of misspellings; GP will re-run the scripts for the performance analysis as soon as the new test data sets become available. | + | We discussed BiOnym performance further, after GP expanded his analysis following last week's meeting. The performance gap between TaxaMatch and BiOnym, for 'artificial' (simulated) misspellings is confirmed. Apparently our methods of testing performance are critically dependent on the type of misspellings. We decide to introduce a new category of misspellings: simulated, but informed by the type of misspellings we normally encounter in 'real' misspellings (as found for example in the raw names contributed to OBIS). The present method of generating misspellings is just to do a set of random character substitutions; this might give an undue advantage to both the simple parser (as compared to the GNI parser) and to the generic methods like Levenshtein and trigram (to a much lesser extent), as compared to GSAy and TaxaMatch. Edward will work on the system to generate the new set of misspellings; GP will re-run the scripts for the performance analysis as soon as the new test data sets become available. |
The document containing the performance evaluation can be found here: http://goo.gl/bPxUfl | The document containing the performance evaluation can be found here: http://goo.gl/bPxUfl |
Latest revision as of 09:19, 24 January 2014
Google Hangout
Present: Nicolas, Anton, Fabio, GP, Edward
Notes
Anton only attended the first few minutes of the meeting; Fabio joined late in the meeting; Edward experienced connectivity issues (again!!). Casey or her family might have some health issues and didn't attend.
BiOnym Performance
We discussed BiOnym performance further, after GP expanded his analysis following last week's meeting. The performance gap between TaxaMatch and BiOnym, for 'artificial' (simulated) misspellings is confirmed. Apparently our methods of testing performance are critically dependent on the type of misspellings. We decide to introduce a new category of misspellings: simulated, but informed by the type of misspellings we normally encounter in 'real' misspellings (as found for example in the raw names contributed to OBIS). The present method of generating misspellings is just to do a set of random character substitutions; this might give an undue advantage to both the simple parser (as compared to the GNI parser) and to the generic methods like Levenshtein and trigram (to a much lesser extent), as compared to GSAy and TaxaMatch. Edward will work on the system to generate the new set of misspellings; GP will re-run the scripts for the performance analysis as soon as the new test data sets become available.
The document containing the performance evaluation can be found here: http://goo.gl/bPxUfl
Next Steps
Extra testing, with different class of misspelled names as described above.
It is important at this point for the biologists to take over, in several steps:
- Internally: Nicolas and Edward, to take over exhaustive testing from GP and Fabio
- Limited circle of outsiders (such as Yde, Tony, Dima, and some people at GBIF)
- Wider biodiversity community
We need documentation of the system before we can communicate with outsiders; GP's description of the statistical manager, distributed by email in another context (collaboration with LifeWatch Greece) will serve as a first approach.
We'll need an interface to the cloud-based BiOnym, and a web-based interface to a simple version of BiOnym; FIN has to play its role here.
We have to write a paper; GP will produce an outline.
Nicolas will be in Brussels 27 January and will arrange to meet with Edward, to dicsuss a proposed plan for BiOnym activities between now and the end of the project.
Next meeting
Thursday 30 January 2014, 11 am.