Biodiversity cluster

From D4Science Wiki
Revision as of 12:25, 8 February 2012 by Anton.ellenbroek (Talk | contribs) (Created page with " == iMarine Partners Biodiversity Position == The management of biodiversity data covers the observations of species occurrences, their distribution mapping, and the visual a...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


iMarine Partners Biodiversity Position

The management of biodiversity data covers the observations of species occurrences, their distribution mapping, and the visual and statistical analysis of both observations (point data) and distributions (areas). It requires import of structured data in various formats, in particular compliant with the Darwin Core as xml or csv datasets. The purpose of the management is to produce reliable datasets and maps on species and their potential current, historic and future distribution and habitats. Compared to other initiatives, iMarine already offers the basic components to load, share, publish and analyze data. This makes the iMarine infrastructure an attractive option for the further development of Biodiversity components. Many data owners in the marine biodiversity domain have difficulty in gaining access to biodiversity and environmental data in enough detail and with quality metadata. There are concerns about the sheer number of datasets that have to be maintained, with multiple data streams and formats putting pressure on software developers. Concerns about the interoperability of software and the related risk of exploding support costs for software maintenance make OS development in a CoPCommunity of Practice. a potentially attractive proposition. In D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.-II, considerable experience has been gained in the acquisition and management of species occurrence data. In addition, many services marshalling the data from “Sea to Shelf” are available; curation, metadata collection, transformation, mapping and repository services, to name a few. The iMarine Biodiversity partners aim is a stronger, more resilient and flexible framework for Biodiversity Data Management. Collaboration in an Ecosystem Approach Community of PracticeA term coined to capture an "activity system" that includes individuals who are united in action and in the meaning that "action" has for them and for the larger collective. The communities of practice are "virtual", ''i.e.'', they are not formal structures, such as departments or project teams. Instead, these communities exist in the minds of their members, are glued together by the connections they have with each other, as well as by their specific shared problems or areas of interest. The generation of knowledge in communities of practice occurs when people participate in problem solving and share the knowledge necessary to solve the problems. can help to achieve that. An effective Biodiversity data policy is an iMarine EA-CoPCommunity of Practice. policy; it needs to be defined and approved by the iMarine Board. After all, making clear, fair agreements on the component development of OS software and effectively enforcing these rules will increase political and CoPCommunity of Practice. support for iMarine. This proposal lists some of the components that can help achieve this.

We are aware that Biodiversity is just one facet in the iMarine decision-making processes. To facilitate this, not only components propose implementation actions, but also describe the background and the anticipated impact. We are keen to enter into discussion with other iMarine partners and iMarine supporting institutions. We invite other parties (Board partners, iMarine institutions and CoPCommunity of Practice.) to have a say.

The Biodiversity proposal: objective and means

The main objective in the management of Biodiversity Data is to check the data flow between primary data providers and a varying number of aggregators such as FishBase, OBIS (possibly FishNet) and GBIF, with the general rules:

  1. detailed specific data flows to the more general aggregator,
  2. data is best managed at the most detailed level,
  3. data must be shared using well-known schemas,
  4. data is never static; it can evolve over time and space, and should be easily updatable; version control is needed to be able to repeat analysis later on exact same datasets
  5. reference data, such as codes for species, locations, and countries, are often resource specific (=stored with the data),
  6. mapping between reference data is most important for biological taxonomies, and should be contextual,
  7. data is never public; it has to be published by an authorizer.

The data exchange with GBIF is still not an easy process, and needs considerable iMarine effort to achieve a level of acceptable stability. Initial activities should discuss, define and develop these data-flows.

The OBIS and AquaMaps system are growing closer, which opens interesting perspectives on mutual data services such as import, validation checks, mapping and sharing of data and reference data. Specifically, OBIS can offer species names services and distribution validation to AM, while the GIS and R integration in AquaMaps/D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. is of interest to OBIS. A first objective in iMarine is an analysis of the mutual services that can be provided.

The validation of data may require advanced statistical, geospatial and environmental data processing. These can now not be defined, but in the second semester of the project effort will be needed for their description.

Participation and integration are essential for successful software development and for coherent usage and sharing policies. Every component should be expected to function in a software ecosystem and to be maintained by a community. This implies a careful approach to software architecture and development, aiming at sustainability after the project.

The Biodiversity community proposals for achieving a stronger, safer and more prosperous EA CoPCommunity of Practice. relate to the following iMarine components:

  • Data import and selection related to Darwin core
  • Catalogue of Life
  • Freshwater VREVirtual Research Environment.
  • Species Distribution related components

Depending on any relevant developments that take place, other Software components could also play a role in the future. These include components for statistical analysis, GIS tools, geospatial tools, and semantic technology support.

Each section looks at the following: • Background to the proposal • What is proposed • Anticipated impact.


Darwin Core related Data Management Tools

Background

As described above, the production of Darwin Core data is the main activity expected to be supported by the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure. The CoPCommunity of Practice. already provides software that can be evaluated (list ...), although these have been used with varying results in the past. The Community seeks in iMarine the development and release of an environment for the management of marine observational data that offers:

  • Selection of data from a source (browse repository, filter).
  • More flexibility in occurrence datasets that can be used today (OBIS, ...),
  • Interactive occurrence selection in a GIS (e.g. sliders to-from dates),
  • Occurrences data cleaning: DarwinCore is already the standard for exchange, but there is no standard to exchange corrections (e.g., if we find an error, how do we send the feed-back to the provider in such a way that he can use what we send easily: obviously data are structured by DwC but what about the indications of what was changed and why).
  • This includes also:
    • the private data, they will be integrated through the DwC, but what about conveying assessment on quality, on possible traps and known issues.
    • data editing incl. through interactive maps.
    • Environmental data: I am not sure if we have a good generic standard with indication of scale, time, etc.
    • Statistics on probabilities, e.g., a summary on a given area, bootstrap or any other robustness measures.
    • Change of scale (e.g. size of grid) up to changing the grid system (for an equal area grid system: the current cells have 4 times more surface near equator than near the poles) and provide transformation standards.
    • Managing shape files and related statistics, graphs, etc: can be a transect, along a coast line, a geographical area. This includes sharing maps, and analyses.


Proposed measure

As of Jan 2012, no measures were proposed

Anticipated impact

If the project manages to provide a solid environment for the management of biodiversity data, the potential interest in an even wider community will certainly be raised. However, to make an impact, several conditions must be met: Open source, using well known technologies, etc.  

Catalog of life related Data Management Tools

Background

Most species information, including occurrence data, is hooked to a scientific species name that is coined to designate a taxon at specific level. Data from various sources can be integrated for a species only because they are attached to the same name. Another alternative would be to attach information to specimens but for various reasons not detailed here, it is not realistic. However both taxonomy (the way to split the living organism diversity in well-defined and identifiable species in a hierarchical classification) and nomenclature (the proper way to assign unique names to the different taxa) is a work constantly in progress leading to the difficult situation where a species may be designated by several names (synonymies), or that a name may designate several species (homonymies), along the time or simultaneously according to different authors. The Catalogue of Life (CoL) aims at gathering all species names and their synonymy and homonymy relationships so that interoperability between various information datasets can be automated. As of November 2011, CoL gathers the names for almost 1.4 million species over the 1,9 million estimated to be known to science from more that 110 separated Global Species Databases. It is most important to have an easy access within the iMarine ecosystem to facilitate huge dataset compilation from various sources (e.g., not using the same name for the same species) or to link them to other sources. In the marine domain, the main provider of GSDs to CoL is WoRMS; it is WoRMS that is used as the primary (but not exclusive) name reference by OBIS.


Proposed measures

  • As a separated VREVirtual Research Environment. or App together with WoRMS, which can be decided after the examination of the integration of freshwater data or not.
  • Tools to check names from OBIS and GBIF;
  • Potentially could be extended to a check name VREVirtual Research Environment. for primary data providers to check their names themselves (since the primary providers are the ones who know their own data best), before they provide to aggregators.


Anticipated impact

  • Significantly decrease the human resources spent to cross-check names over multiple datasets;
  • Increase the quality of the primary data sources by sending them feed-back on issues (up to provide the service for themselves to check their data);
  • Integrate the developed tools in the efforts developed by the Biodiversity Informatics community to establish a Global Name Architecture in order to implement in the VREVirtual Research Environment. the technological solutions proposed by this project.

Freshwater VREVirtual Research Environment. Data Management Tools

The freshwater VREVirtual Research Environment. would require co-funding which has not been sourced yet. It is inlcuded here to provide scope for future extensions, and reference for developers where requirements are likely to emerge.

Background

The existing infrastructure manages data at a rather low spatial and temporal resolution. However, there are no technical limits to increase the resolution. In the past, data and processing power limited the use of predictive modeling to larger habitats, and few algorithms were developed to describe species distributions over space and time. With the rapid advance in technology, both data and processing can now much better cope with high resolutions, and this opened the possibility to include fresh-water models in the infrastructure. The proposed measures here serve not only the freshwater community, but aim to bring much higher resolutions and integrated management of metadata on reliability, completeness and ownership to the data products.


Proposed measures

The extension into freshwater is a very welcome addition, although the project is about marine data. However, the main requirements for freshwater models are also very useful to boost the marine niche modeling. In addition, the contribution of inland fisheries and estuaries to capture fisheries aquaculture is large, and cannot be neglected. A partnership with the FP7 funded project BioFresh will permit iMarine to focus on the technical integration in a VREVirtual Research Environment. while the content could be managed by BioFresh. The proposed measures aim to boost the precision over temporal and spatial scales, yet build on the existing infrastructure for taxonomic, spatial and environmental data acquisition and management. The need to use higher resolution data calls for access to data providers in the following domains:

  • Species occurrence and distribution; very precise locations (meters!)
  • Environmental data; with temporal resolutions from days to years, averages over time and space, and seasonal products. This requires that the calculations are brought to the data, as it is unlikely that D4science can contain all relevant data.

The range of algorithms has to be extended, and the proposed contributions of the OpenModeler community and software will be analyzed for suitability and correctness before they are linked to the infrastructure. Not only the algorithms themselves, but also the management of reliability and Finally, the generated products require a stable publication environment. Advanced calculations are of little use if the loose the context or can only be repeated with difficulty. This implies that ALL steps are brought under control of a workflow, and that ALL products contain provenance and quality metadata. Also, in order to be able to analyze the results, the infrastructure must be able to generate statistics with each product, ranging from the number of data-points used, to correlations between spatial products. Only then will there be a truly scientific contribution from the project, and a sustainable interest from a scientific community.


Anticipated impact

  • Many diadromous species are important economically (sturgeons, eels, salmons, some gobies, etc.) and it is important to model and manage their life cycle in the two environment, marine and freshwater if one wants to achieve the ecosystem Approach to Fisheries;
  • Many primary datasets contain both marine and freshwater information. It will help to clean marine dataset from freshwater data and reverse.

Species Distribution Data Management Tools

Background

The existing data infrastructure already contains and maintains a species distribution modeling environment; AquaMaps. This meets many of the WFC requirements, but may need extensions to better serve the existing and new CoPCommunity of Practice. members. The VREVirtual Research Environment. and gCubeApp are based on a PostgreSQL database with geospatial extensions, and geoserver. This architecture will be the reference model for other geospatial data management facilities. In addition, access to environmental datasets is provided by GENESI-DEC, and this can be extended to providers dissemeinating datasets more geographically restricted. The AquaMaps VREVirtual Research Environment. can also be used to compare map-products. In D4science-II, a new model was introduced for the spatial re-allocation of captures. One of the information sources used were species distribution maps, as it is obvious that species captures are defined by species distributions. Bringing these resources together is no easy feat, yet D4science-II has proven the feasibility to define VREs that contain data from different domains.


Proposed measures

The component defined here would offer a range of analytical tools to scientist to generate maps using

  1. different information sources,
  2. different formats (maps, KML),
  3. different distribution analysis methods (visual, statistical),
  4. predefined re-allocation mechanisms to change the spatial resolution of a reported occurrence in an area.

The many different reference data used, and their varying quality requires a harmonization module that converts data from one definition scheme to another. Here, the use of a semantic KB may be required for translations, mapping across coding schemas, disambiguation etc.

The definition of codelists an their use may require the services from a CodelistManager VREVirtual Research Environment..

The maintanance of mappings between elements in codelists across spatio-temporal dimensions may require the services from a CodelistMapper VREVirtual Research Environment.

Species Distribution related components

  • Management of “shape files” of the distribution with various probabilities from number to ordinal qualification (e.g., present, doubtful; similar to the recent FAO maps);
  • Standards to create and exchange distribution statements: e,g, Western central Pacific from the Philippines to Marquesas Islands, from Ryukyu Is. to New Caledonia. In other words, textual descriptions of the shape files. This may require support from a semantic KB.


Anticipated impact

  • Amplify the capacity of analyses of data;
  • Moving from global perspectives and trends to more restricted areas, e.g.. at region and possibly to country level;
  • Propose better prediction tools with respect to the climate change.