Difference between revisions of "19.09.2013 BiolDiv"
m |
|||
Line 1: | Line 1: | ||
'''Meeting 19 September 2013, 12 noon''' | '''Meeting 19 September 2013, 12 noon''' | ||
− | '''Participants''': Anton, Lino, GP, Fabio, Edward | + | '''Participants''': Anton, Lino, GP, Fabio, Edward; continued in the afternoon with only GP and Edward present |
'''Notes''' | '''Notes''' | ||
− | + | We discussed the taxon name matcher BiOnym a bit further, following up on issues that were raised in the noon meeting. We agree that it is important that the different activity streams come together. We don’t know exactly how; Edward’s problem is that he doesn’t know much about Casey’s work so it’s difficult for him to suggest a way forward. GP will pass on some instruction on how to explore what Casey has done on gCube; hopefully this will assist in finding a way forward. We will explore whether Casey’s work can be transformed into a plug-in in Fabio’s .jar, or, alternatively, the output of Casey’s procedure can be merged with Fabio’s. To be continued. | |
− | + | ||
− | + | GP will study Edward’s R script to investigate calling Fabio’s jar from R. The .jar, being java, is OS independent. The jar calls an OS-native helper program (compiled from C) to calculate Levenshtein distance. Depending on the OS on which this will run, we’ll have to install one of the libraries available on http://goo.gl/smYDh0; the native library is not essential, only speeds up calculations. | |
− | + | The R scripts to analyse performance of matching procedures now lives on the iMarine/d4Science SVN (https://svn.d4science.research-infrastructures.eu/gcube/trunk/data-analysis/onym/imarine). The scripts will be duplicated to the workspace. | |
− | + | The abstract for the presentation of the TDWG meeting is considered final. Edward will contact Yde to see what the next steps are – do we need to register; which session will this go in… | |
− | + | Edward will try and create a kind of a ‘prototype’ in R illustrating the whole BiOnym approach; where possible, he’ll put the work on postgres (for many things I have a choice of having things done by R or by PostgreSQL; in fact much of the calculations of the performance statistics are implemented in both, to cross-validate results while developing). The PostgreSQL can be re-used directly by calling the statements from Java. | |
+ | Other topics: | ||
− | + | Edward extracted inferred absences for cod (Gadus morhua) from the OBIS database; this is done on the basis of datasets which have survey information, looking for a well-defined and limited number of species. Datasets of this type where cod has been recorded clearly have cod on their monitoring list; if there are records in this dataset for other species, but not for cod, this can be inferred to be an absence point for cod. A script selecting what are most probably relevant datasets, and then selecting cod presence and inferred absence points was developed by Edward and shared with GP. Apart from selecting relevant points, the script also plots the data. The R script is now also placed on the iMarine workspace (http://goo.gl/A3UUDX). | |
− | + | After seeing the dataset on cod presences and calculated/inferred absences, GP is sceptical on whether this will be analysable with Species Distribution Modelling algorithms. There doesn’t seem to be much separating the absence from the presence points (see map; red=presence, green=absence. Many points obscure each other). We will run a number of tests, using bio-oracle combined with ETOPO (or GEBCO) as environmental information. We’ll combine the predicted range map from presence-only methods (including AquaMaps and Mahalanobis distance) with methods requiring absence data (such as logistic regression, Maxent and random forests). The algorithms are available in the R Package Dismo; we shouldn’t wait till openModeller becomes available in iMarine as this might be too late for us. | |
+ | |||
+ | [[File:map: presence and inferred absence points for cod]] | ||
+ | |||
+ | A possible solution to the problem of the ‘closeness’ of presence and inferred absence points is to summarise the data in square corresponding to the resolution of Bio-Oracle, and take n(presence)/(n(presence)+n(inferred absence)) as a proxy for abundance (or more formally, for the probability to encounter cod in a sample taken in that square). Then we can use this statistic with SDM methods using abundance. | ||
+ | |||
+ | The person creating bio-oracle now works at VLIZ. Edward will contact him and ask him whether he can make bio-oracle data available as geo-located layers (e.g. NetCDF-CF GRID), as GP has had some trouble with the ASCII format in which the data is now available. | ||
+ | |||
+ | GP will look into LOF as an alternative to DBScan (as requested by Nicolas in his task list for GP); if it’s implemented in the iMarine infrastructure, it will be via a Java library; GP will send information to Edward on how to call this library from R, as the R implementations of LOF are in rather obscure packages. | ||
+ | |||
+ | There are still several articles Edward should comment on, including the paper led by Leonardo. For this paper, we will look for another potential journal, as PLoS One doesn’t seem to be best suited. Ecological Informatics (which I don’t know) and Ecography (Edward will look for the instructions to authors) are the two we’ll explore now. |
Revision as of 09:42, 26 September 2013
Meeting 19 September 2013, 12 noon
Participants: Anton, Lino, GP, Fabio, Edward; continued in the afternoon with only GP and Edward present
Notes
We discussed the taxon name matcher BiOnym a bit further, following up on issues that were raised in the noon meeting. We agree that it is important that the different activity streams come together. We don’t know exactly how; Edward’s problem is that he doesn’t know much about Casey’s work so it’s difficult for him to suggest a way forward. GP will pass on some instruction on how to explore what Casey has done on gCube; hopefully this will assist in finding a way forward. We will explore whether Casey’s work can be transformed into a plug-in in Fabio’s .jar, or, alternatively, the output of Casey’s procedure can be merged with Fabio’s. To be continued.
GP will study Edward’s R script to investigate calling Fabio’s jar from R. The .jar, being java, is OS independent. The jar calls an OS-native helper program (compiled from C) to calculate Levenshtein distance. Depending on the OS on which this will run, we’ll have to install one of the libraries available on http://goo.gl/smYDh0; the native library is not essential, only speeds up calculations.
The R scripts to analyse performance of matching procedures now lives on the iMarine/d4Science SVN (https://svn.d4science.research-infrastructures.eu/gcube/trunk/data-analysis/onym/imarine). The scripts will be duplicated to the workspace.
The abstract for the presentation of the TDWG meeting is considered final. Edward will contact Yde to see what the next steps are – do we need to register; which session will this go in…
Edward will try and create a kind of a ‘prototype’ in R illustrating the whole BiOnym approach; where possible, he’ll put the work on postgres (for many things I have a choice of having things done by R or by PostgreSQL; in fact much of the calculations of the performance statistics are implemented in both, to cross-validate results while developing). The PostgreSQL can be re-used directly by calling the statements from Java. Other topics:
Edward extracted inferred absences for cod (Gadus morhua) from the OBIS database; this is done on the basis of datasets which have survey information, looking for a well-defined and limited number of species. Datasets of this type where cod has been recorded clearly have cod on their monitoring list; if there are records in this dataset for other species, but not for cod, this can be inferred to be an absence point for cod. A script selecting what are most probably relevant datasets, and then selecting cod presence and inferred absence points was developed by Edward and shared with GP. Apart from selecting relevant points, the script also plots the data. The R script is now also placed on the iMarine workspace (http://goo.gl/A3UUDX).
After seeing the dataset on cod presences and calculated/inferred absences, GP is sceptical on whether this will be analysable with Species Distribution Modelling algorithms. There doesn’t seem to be much separating the absence from the presence points (see map; red=presence, green=absence. Many points obscure each other). We will run a number of tests, using bio-oracle combined with ETOPO (or GEBCO) as environmental information. We’ll combine the predicted range map from presence-only methods (including AquaMaps and Mahalanobis distance) with methods requiring absence data (such as logistic regression, Maxent and random forests). The algorithms are available in the R Package Dismo; we shouldn’t wait till openModeller becomes available in iMarine as this might be too late for us.
File:Map: presence and inferred absence points for cod
A possible solution to the problem of the ‘closeness’ of presence and inferred absence points is to summarise the data in square corresponding to the resolution of Bio-Oracle, and take n(presence)/(n(presence)+n(inferred absence)) as a proxy for abundance (or more formally, for the probability to encounter cod in a sample taken in that square). Then we can use this statistic with SDM methods using abundance.
The person creating bio-oracle now works at VLIZ. Edward will contact him and ask him whether he can make bio-oracle data available as geo-located layers (e.g. NetCDF-CF GRID), as GP has had some trouble with the ASCII format in which the data is now available.
GP will look into LOF as an alternative to DBScan (as requested by Nicolas in his task list for GP); if it’s implemented in the iMarine infrastructure, it will be via a Java library; GP will send information to Edward on how to call this library from R, as the R implementations of LOF are in rather obscure packages.
There are still several articles Edward should comment on, including the paper led by Leonardo. For this paper, we will look for another potential journal, as PLoS One doesn’t seem to be best suited. Ecological Informatics (which I don’t know) and Ecography (Edward will look for the instructions to authors) are the two we’ll explore now.