Ede issues

From D4Science Wiki
Jump to: navigation, search

iMarine Environmental Data Enrichment

Background

Many organisations are collecting, managing and redistributing marine environmental data, of all kinds:

  • ‘Static’ data that doesn’t change or changes very slowly (such as bathymetry), versus ‘dynamic’ data, that changes all the time (most other oceanographic variables, such as temperature or salinity)
  • Spatially 2D or 3D data. 2D are for example sea surface measurements (SST, SSS) versus bottom characteristics (e.g. bathymetry, sediment composition). 3D are measurements, taken in the water column, that very with depth (e.g. temperature and salinity). 2D data deserve special attention:
    • Surface data, because there are so many of them (partly through remote sensing, but also through ‘ships of opportunity’ that take surface-only measurements
    • Bottom data, because a very large component of the biodiversity is benthic or demersal
  • Remotely-sensed data, versus in-situ measurements. Remote sensing is largely restricted to sea surface measurements (e.g. wave energy, ‘colour’). In situ measurements can be anything, from automated data from Argo floats, to species counts of phytoplankton, and can be made anywhere in the water column

Many of these data sets are publicly available; unfortunately the large number of different formats makes it non-trivial to access the data. A few unifying tools/standards exist to facilitate access, but again, their use is not trivial, and many potential end-users of the vast amount of freely available data are limited in the use of the data by this.

A specific problem for biologists/species distribution modellers is that the available biogeographical data are not necessarily co-located with the environmental data. Even in cases where salinity, temperature… were measured when the biological sample was taken, often the connection between the two streams of data was lost, and the biologist is left to reconstruct the environmental conditions from the public archives discussed above.

As a short-cut, many (including OBIS) reconstruct the environmental conditions on the basis of summarised data such as gridded data; in case of OBIS this is the World Ocean Atlas. Clearly this is less than optimal as the resolution of the horizontal gridding of WOA is 1 degree, and the depth is brought back to 33 standard depths. Specifically for OBIS, the data were extracted from the climatological means, so the temporal aspect (including seasonality) is lost. So for many species distribution models, the resolution with which the (physical) environment is known, is limiting the type of models we can sensibly run.

The problem

There are two aspects that limit the use of environmental data.

  • The heterogeneity and complexity of the sources of environmental data
  • The second is the lack of environmental data at the exact position (in time and 3D space – below referred to as 4D), and the coarse resolution of data summaries such as the World Ocean Atlas.

A service that could help end users of environmental data

Specialist users will obviously find all relevant data (at least, if they’re worth their salary). But for the non-specialist user, for example the biogeographer trying to feed his/her species distribution models with environmental data, it would be useful to have two tools:

  • A tool extracting data within a 4D area of interest; this area of interest will probably be centered around his biological observations
  • A tool to interpolate environmental conditions from the data extracted above, to the precise 4D position (or 4D range) of the biological observation.
    • If the 4D position is a single 4D point, the service should return value for the environmental parameter, plus metrics on the reliability of the interpolation (e.g. number of actual observations on the environmental parameter that the estimate was based on, confidence interval…
    • Often the 4D position will be a ‘range’, so a 4D volume: either the biological sample was the result of a tow with non-trivial distance (horizontal and/or vertical), the precision with which the coordinates is known is not perfect, we only know the month, not the precise date/time of the observation… In these cases we need to know at least the range of the environmental parameter in this 4D volume, plus reliability of the minimum and maximum of the interpolation.

Ideally, the end user would send to a service:

  • A list of 4D positions (or 4D ranges), and a list of environmental parameters he’s interested in. As an alternative to ‘depth’, he can request ‘value at surface’, or ‘value at bottom’

and get back

  • The list of 4D positions (or ranges) expanded with, for each of the environmental parameters: an estimate of that parameter (or minimum and maximum), and metrics of the reliability

In addition, and if there are several alternative sources for the same environmental parameter, the end-user could (optionally!!) limit/select the source of the returned data. The source of the returned data should also be returned together with the data – to leave an ‘audit trail’, and to facilitate data citation.