Difference between revisions of "Statistical cluster"

From D4Science Wiki
Jump to: navigation, search
(OpenSDMX CodelistManager)
(Appendices (Budget, Resources, Documents, Schedule and Others))
 
(26 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
{| align="right"
 +
||__TOC__
 +
|}
  
== Components ==
+
== Position ==
 +
The management of statistical data is a large domain, and ranges from the collection of observations on species occurrences or capture, the curation and aggregation of data, the visualization on maps, and the visual and statistical analysis of both observations and time-series. It requires import of structured data in various formats, with an emphasis on SDMX datasets. The purpose of the cluster is to produce a low-cost, versatile and reliable data-suite to cover the work-flow of data from collection to publication and to manage an appropriate set of metadata on dataset describing e.g. the provenance, ownership, and quality.
  
 +
Compared to other initiatives, iMarine already offers the basic components to load, share, publish and analyze data. This makes the iMarine infrastructure an attractive option for the further development of statistical data components. In addition, the powerful services for data-processing are epected to offer substantial benefits to statistical data managers.
  
== OpenSDMX CodelistManager ==
+
Many data owners in the marine statistical domain have difficulty in gaining consistent access to capture data in enough detail and with relevant metadata. There are concerns about the sheer number of datasets that have to be maintained, with multiple data streams and formats putting pressure on software developers. Concerns about the interoperability of software and the related risk of exploding support costs for software maintenance make OS development in a CoP a potentially attractive proposition.
  
The further development of OpenSDMX in the iMarine project context aims to position OpenSDMX, and thus the iMarine project, as
+
This cluster can build on the considerable experience that has been gained in the acquisition and management of data in previous projects. In addition, many services marshalling the data from “Sea to Shelf” are available; curation, metadata collection, transformation, mapping and repository services, to name a few. The iMarine Biodiversity partners aim is to provide a stronger, more resilient and flexible framework for Statistical Data Management. The EA-CoP expects that services that are difficult to maintain in a single organization, such as for data mining, time-series analysis, and modeling, can be offered in a very cost-effective manner through iMarine. Collaboration in an Ecosystem Approach Community of Practice can help to achieve that.
# a supplier of services to other SDMX infrastructures, or
+
# a range of services that can interpret SDMX, e.g. by offering SDMX data access and processing services.  
+
  
Where using the word D4Science, the D4Science Infrastructure is meant, which is used in the iMarine project.  
+
This collaboration needs to be based on reliable and free resources. Access to and maintenance of these resources is the responsability of all partners of the EA-CoP, and users of the supporting eInfrastructure will have to develop and commit to an open data policy.
  
This document starts with describing premises and the SDMX Scoping, followed by the proposed functions to implement:
+
An effective statistical data policy is an iMarine EA-CoP policy; it needs to be defined and approved by the iMarine Board. After all, making clear, fair agreements on the component development of OS software and effectively enforcing these rules will increase political and CoP support for iMarine. This proposal lists some of the components that can help achieve this.
* CodelistManager
+
* Validation/Curation
+
* Artifact Selector
+
* Data Visualization
+
* SDMX2RDF
+
In a related context, CNR has identified the SDMX processing as an opportunity to pursue. The use cases for data-mining and transformation that could benefit from the processing services are not described here.
+
  
'''Premises'''
+
We are aware that statistics are just one, albeit important, facet in the iMarine decision-making processes. To facilitate this, not only components propose implementation actions, but also describe the background and the anticipated impact. We are keen to enter into discussion with other iMarine partners and iMarine supporting institutions. We invite other parties (Board partners, iMarine institutions and CoP) to express their needs.
  
Adopting OpenSDMX is lightweight. Clients may be reluctant to adopt D4Science. They may be exposed to D4Science capabilities through OpenSDMX, and consider to migrate services from OpenSDMX to D4Science.  Doing so, OpenSDMX can be a cost-effective enabler for D4Science, getting its clients familiar with D4Science.  Therefore these premises are defined:
+
== The Statistical Cluster Work Plan ==
* OpenSDMX does not have a dependency with the D4Science infrastructure.
+
* All OpenSDMX artefacts are portable into the D4Science infrastructure.
+
* Developments are done in the context of the OpenSDMX community, directly on the OpenSDMX codebase and follow the OpenSDMX release lifecycle.
+
  
'''Scoping'''
+
=== Goals and Objectives (The Outputs) ===
  
The SDMX specification defines these artefacts: datastructure (DSD), metadatastructure, categoryscheme, conceptscheme, codelist, hierarchicalcodelist, organisationscheme, agencyscheme, dataproviderscheme, dataconsumerscheme, organisationunitscheme, dataflow, metadataflow, reportingtaxonomy, provisionagreement, structureset, process, categorisation, contentconstraint, attachmentconstraint, structure, metadata, schema, data
+
The ensemble of components constituting the statistical cluster can be summarized as:
  
The artefacts written in bold are selected to be part of the iMarine project at this stage (datastructure, conceptscheme, organisationscheme,  codelist, dataflow and data).  
+
* ICIS - The Data Suite offering data management, analysis and production facilities;
 +
* SPREAD -  Right at the middle between the statistical and [[Geospatial_cluster#Goals_and_Objectives_.28The_Outputs.29 geospatial]] services, SPREAD will manage the spatial re-allocation of capture data following political and environmental boundaries.
 +
* [[CodelistManager]] - The shared development of Cotrix, acting as a persistent storage 'agnostic' facility;
 +
* Statistical service - The iMarine Specific container and work flow organizer to bring the power of the infra to scientific users;
 +
* R - The interface to the tool of preference of the EA-CoP; either as an integrated tool in ICIS, or as a service in a WPS Hadoop process;
 +
* SDMX registry - The persistence and user orchestrator of the statistical products;
 +
* FLUX - TBD.
  
The artifact process can be further discussed to be taken on board or not. On the long term this one is definitely need to be taken into account because it can reflect the process of data and metadata in the system. The Bank of Italy is using their proprietary Expression Language for this purpose. Adopting either in D4Science will require at least a MOU with a large ‘SDMX’ partner. It will not be discussed here. 
+
==== Considerations ====
  
OpenSDMX is divided in 2 parts, core and plus. OpenSDMX-Core is the implementation of the SDMX REST specification with the concept of adapters. OpenSDMX-Plus contains all the functions which are additions to core, like CodelistManager, Validation, Artefact Selector, SDMX2RDF and DataVisualization.  
+
The statistical service aims to leverage the power of the e-infratructure in a comprehensible data-management environment. data would 'live' in this infrastructure from their collection through their publication in a variety of formats to external systems. At all stages, these data would be enriched with a metadataset that provide information on the life-cycle and quality of the data.
  
See the diagram below for the dependencies of the different software components and how the components relate together.  
+
At a high level, this vision takes inspiration from the GSBPM, where data are also approached from an integrated perspective.  
  
[[File:FishFrame2Sdmx.PNG]]
+
However, the approach in iMarine goes further, in that the direct avaiablity of, for instance, geospatial and biodiversity data enable rich products that are difficult to find in other e-infrastructures.
  
'''CodelistManager'''
+
=== Resources and Constraints (The Inputs) ===
  
Functions distinguished for a CodelistManager are:
+
The '''Business Cases''' requirements are inputs for the cluster, they come from 3 Business Cases that are grouped as follows:
* Maintenance (adding, changing or deleting codes and/or descriptions)
+
* Importing from CSV/SDMX / RDF / FishFrame
+
* Versioning
+
* Publishing (of a new version)
+
* Validity (Where when for who is it authoritative / reference / candidate)
+
Possible contexts in which these functions need to be performed are
+
* The codelist is already stored in an existing datastore (a datastore can be a database or a data access layer):
+
** All functions are performed on this datastore (A).
+
** An initial codelist is loaded from the datastore and will be copied in the CodelistManager.  The subsequent lifecycle will happen in the CodelistManager (B).
+
* The codelist is a file. The file is loaded in the CodelistManager. Most of functions are performed in the CodelistManager. Additions of codes may happen by uploading new codes for the Codelist (B).
+
  
 +
* the EU Common Fishery Policy;
 +
* the FAO deep seas fisheries programme;
 +
* and the UN EAF Ecosystem Approach to fisheries. 
  
Impact of Option A: the OpenSDMX instance does not have its own database.
+
'''Use cases''' are often not specific to one of the above, but are either a generic statistical data function, or the very opposite, very generic data storage environments. Some examples, starting from the framwork level doen to detailed requirements are:
+
* Generic storage and distribution solution that can be consumed by external parties; Here the geonetwork and SDMX-registry are positioned
Impact of option B, the OpenSDMX instance does have its own database.  
+
* Data mining and pattern recognition;
 +
* A generic tool for data processing, such as R;
 +
* Validation, QA and QC functionality.  
  
'''Validation/Curation'''
+
'''Other inputs'''
  
Vision on this has been worked out already here:
+
In this cluster, the expected datasets to be managed are contained in:
http://opensdmxdevelopers.wikispaces.com/Curation
+
The discussion on the level of validation in the context of SDMX is currently led by Eurostat. Involved parties are the Bank of Italy, Metadata Technology, Agillis and FAO. Apart from the precise outcome of these discussions, it is clear that there is a need for an infrastructure which can load/cure/validate SDMX datasets.
+
  
'''SDMX Artefact Selector'''
+
* FAO Global and Regional capture datasets;
 +
* FAO reference data exposed through e.g. the SDMX registry;
 +
* Community Data sources FAO Tuna Atlas (Tropical tuna data) IRD Tuna Atlas (Tropical tuna data)
 +
* Other Fisheries data: (catches of fisheries targeting tuna, bycacth of tuna fisheries scientific tagging data),
 +
* Vessel position data;
 +
* Species occurrence data.
  
This scenario is inspired by my interpretation of the data.fao.org principle:
+
Species distributions, occurrences data of other fisheries databases, statistical data by geospatial area, biological parameters.  
* Guide the user to the data or metadata in a highly user friendly and pleasant way
+
* Give the data to the user
+
* So the user can go away to do whatever he wants to do with the data.
+
The data.fao.org offers a simple way to find SDMX data, using the SDMX REST API. There is a need for a user interface which leads the user in a simple way to the SDMX data and metadata.
+
The SDMX Artefact Selector could also be called a SDMX Registry and Repository Browser.  
+
  
'''SDMX2RDF'''
+
'''Constraints'''  
  
http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/index.html
+
Very often, the data that feed this cluster are of 'poor' quality. That does not mean they are unreliable, but their history, precision and accuracy can not be deducted from the datasets themselves. In addition, often the data providers are bound by contractual obligations to not disclose data or their metadata (if these are produced).  
  
There is an interesting group working on the transformation of the SDMX model into RDF. This work can be adopted in order to publish SDMX datasets also in RDF.
+
=== Strategy and Actions (from Inputs to Outputs) ===
  
'''SDMX Data Visualization'''
+
The statistical cluster can build on several long-term residents in the D4Sciecne infrastructure:
 +
* ICIS will be the tool for ingestion and curation. It will be the base on which additional functionality will be developed.
 +
* CLM will remain the tool of choice for the identification of reference data in datasets. However, it will have to be enriched with capacities to interoperate with community software for code list management.
 +
* The statistical service will be the tool where data analysis will be performed. This may require the incorporation of a data warehouse in the infrastructure, e.g. to provide trend analysis and frequency analysis on time-series
 +
* R, the EA-CoP tool of choice, will have to be made available in interoperable scenarios, i.e. not as in the current implementation where data flow in one direction only.
  
In order to make data visible and findable for search engines, an user interface is needed to visualize the SDMX artefacts. The first artifact to visualize is the SDMX dataset. The DSD can be used to express the data in the different languages.  
+
The partners in WP3 carry a responsibility to offer not only requirements and use cases, but also to contribute with tools, that may have to be adjusted for inclusion in the wider e-infrastructure.
  
'''FishFrame2SDMX'''
+
=== Appendices (Resources, Documents, Schedules and Others) ===
  
FishFrame is an upcoming standard for data collection and dissemination in the Fisheries domain, read more here: http://km.fao.org/FIGISwiki/index.php/FishFrame
+
==== Documents ====
  
In addition, a dedicated section in this Cluster page outlines the FishFrame plans in iMarine.
+
[[User Interface Harmonization]]
 
+
IRD is using FishFrame for data dissemination. FishFrame as a standard and intention is similar to SDMX, however only for the Fisheries domain. IRD advised that a conversion from FishFrame to SDMX makes more sense than the other way around.  Rational behind is that it is important to publish the FishFrame format according standards like SDMX, accepted outside the Fisheries community. Conversion from FishFrame to SDMX will also result in having a profound understanding how the two standards relate together. This is highly valuable knowledge for the iMarine project. Conversion from SDMX to FishFrame is not planned yet, however not excluded on the long term.
+
 
+
The picture below shows the position of the FishFrame2SMX converter:
+
[[File:FishFrame2Sdmx.PNG]]
+
 
+
The converter will generate SDMX codelists, datastructures and datasets.
+
 
+
FishFrame does not have a dissemination protocol like the SDMX REST API.  OpenSDMX implements this protocol and the converter can be packaged as an adapter in order to publish FishFrame as SDMX artefacts through the SDMX REST API.
+
[[File:OpenSdmxArtifactFishframe.PNG]]
+
+
The above pattern can be applied one to one in D4Science.
+

Latest revision as of 13:39, 12 June 2013

Position

The management of statistical data is a large domain, and ranges from the collection of observations on species occurrences or capture, the curation and aggregation of data, the visualization on maps, and the visual and statistical analysis of both observations and time-series. It requires import of structured data in various formats, with an emphasis on SDMX datasets. The purpose of the cluster is to produce a low-cost, versatile and reliable data-suite to cover the work-flow of data from collection to publication and to manage an appropriate set of metadata on dataset describing e.g. the provenance, ownership, and quality.

Compared to other initiatives, iMarine already offers the basic components to load, share, publish and analyze data. This makes the iMarine infrastructure an attractive option for the further development of statistical data components. In addition, the powerful services for data-processing are epected to offer substantial benefits to statistical data managers.

Many data owners in the marine statistical domain have difficulty in gaining consistent access to capture data in enough detail and with relevant metadata. There are concerns about the sheer number of datasets that have to be maintained, with multiple data streams and formats putting pressure on software developers. Concerns about the interoperability of software and the related risk of exploding support costs for software maintenance make OS development in a CoPCommunity of Practice. a potentially attractive proposition.

This cluster can build on the considerable experience that has been gained in the acquisition and management of data in previous projects. In addition, many services marshalling the data from “Sea to Shelf” are available; curation, metadata collection, transformation, mapping and repository services, to name a few. The iMarine Biodiversity partners aim is to provide a stronger, more resilient and flexible framework for Statistical Data Management. The EA-CoPCommunity of Practice. expects that services that are difficult to maintain in a single organization, such as for data mining, time-series analysis, and modeling, can be offered in a very cost-effective manner through iMarine. Collaboration in an Ecosystem Approach Community of PracticeA term coined to capture an "activity system" that includes individuals who are united in action and in the meaning that "action" has for them and for the larger collective. The communities of practice are "virtual", ''i.e.'', they are not formal structures, such as departments or project teams. Instead, these communities exist in the minds of their members, are glued together by the connections they have with each other, as well as by their specific shared problems or areas of interest. The generation of knowledge in communities of practice occurs when people participate in problem solving and share the knowledge necessary to solve the problems. can help to achieve that.

This collaboration needs to be based on reliable and free resources. Access to and maintenance of these resources is the responsability of all partners of the EA-CoPCommunity of Practice., and users of the supporting eInfrastructure will have to develop and commit to an open data policy.

An effective statistical data policy is an iMarine EA-CoPCommunity of Practice. policy; it needs to be defined and approved by the iMarine Board. After all, making clear, fair agreements on the component development of OS software and effectively enforcing these rules will increase political and CoPCommunity of Practice. support for iMarine. This proposal lists some of the components that can help achieve this.

We are aware that statistics are just one, albeit important, facet in the iMarine decision-making processes. To facilitate this, not only components propose implementation actions, but also describe the background and the anticipated impact. We are keen to enter into discussion with other iMarine partners and iMarine supporting institutions. We invite other parties (Board partners, iMarine institutions and CoPCommunity of Practice.) to express their needs.

The Statistical Cluster Work Plan

Goals and Objectives (The Outputs)

The ensemble of components constituting the statistical cluster can be summarized as:

  • ICIS - The Data Suite offering data management, analysis and production facilities;
  • SPREAD - Right at the middle between the statistical and Geospatial_cluster#Goals_and_Objectives_.28The_Outputs.29 geospatial services, SPREAD will manage the spatial re-allocation of capture data following political and environmental boundaries.
  • CodelistManager - The shared development of Cotrix, acting as a persistent storage 'agnostic' facility;
  • Statistical service - The iMarine Specific container and work flow organizer to bring the power of the infra to scientific users;
  • R - The interface to the tool of preference of the EA-CoPCommunity of Practice.; either as an integrated tool in ICIS, or as a service in a WPS Hadoop process;
  • SDMX registry - The persistence and user orchestrator of the statistical products;
  • FLUX - TBD.

Considerations

The statistical service aims to leverage the power of the e-infratructure in a comprehensible data-management environment. data would 'live' in this infrastructure from their collection through their publication in a variety of formats to external systems. At all stages, these data would be enriched with a metadataset that provide information on the life-cycle and quality of the data.

At a high level, this vision takes inspiration from the GSBPM, where data are also approached from an integrated perspective.

However, the approach in iMarine goes further, in that the direct avaiablity of, for instance, geospatial and biodiversity data enable rich products that are difficult to find in other e-infrastructures.

Resources and Constraints (The Inputs)

The Business Cases requirements are inputs for the cluster, they come from 3 Business Cases that are grouped as follows:

  • the EU Common Fishery Policy;
  • the FAO deep seas fisheries programme;
  • and the UN EAF Ecosystem Approach to fisheries.

Use cases are often not specific to one of the above, but are either a generic statistical data function, or the very opposite, very generic data storage environments. Some examples, starting from the framwork level doen to detailed requirements are:

  • Generic storage and distribution solution that can be consumed by external parties; Here the geonetwork and SDMX-registry are positioned
  • Data mining and pattern recognition;
  • A generic tool for data processing, such as R;
  • Validation, QA and QC functionality.

Other inputs

In this cluster, the expected datasets to be managed are contained in:

  • FAO Global and Regional capture datasets;
  • FAO reference data exposed through e.g. the SDMX registry;
  • Community Data sources FAO Tuna Atlas (Tropical tuna data) IRD Tuna Atlas (Tropical tuna data)
  • Other Fisheries data: (catches of fisheries targeting tuna, bycacth of tuna fisheries scientific tagging data),
  • Vessel position data;
  • Species occurrence data.

Species distributions, occurrences data of other fisheries databases, statistical data by geospatial area, biological parameters.

Constraints

Very often, the data that feed this cluster are of 'poor' quality. That does not mean they are unreliable, but their history, precision and accuracy can not be deducted from the datasets themselves. In addition, often the data providers are bound by contractual obligations to not disclose data or their metadata (if these are produced).

Strategy and Actions (from Inputs to Outputs)

The statistical cluster can build on several long-term residents in the D4Sciecne infrastructure:

  • ICIS will be the tool for ingestion and curation. It will be the base on which additional functionality will be developed.
  • CLM will remain the tool of choice for the identification of reference data in datasets. However, it will have to be enriched with capacities to interoperate with community software for code list management.
  • The statistical service will be the tool where data analysis will be performed. This may require the incorporation of a data warehouse in the infrastructure, e.g. to provide trend analysis and frequency analysis on time-series
  • R, the EA-CoPCommunity of Practice. tool of choice, will have to be made available in interoperable scenarios, i.e. not as in the current implementation where data flow in one direction only.

The partners in WP3 carry a responsibility to offer not only requirements and use cases, but also to contribute with tools, that may have to be adjusted for inclusion in the wider e-infrastructure.

Appendices (Resources, Documents, Schedules and Others)

Documents

User Interface Harmonization