Statistical cluster

From D4Science Wiki
Revision as of 13:55, 27 January 2012 by Anton.ellenbroek (Talk | contribs) (OpenSDMX CodelistManager)

Jump to: navigation, search

Components

OpenSDMX CodelistManager

The further development of OpenSDMX in the iMarine project context aims to position OpenSDMX, and thus the iMarine project, as

  1. a supplier of services to other SDMX infrastructures, or
  2. a range of services that can interpret SDMX, e.g. by offering SDMX data access and processing services.

Where using the word D4ScienceAn e-Infrastructure operated by the D4Science.org initiative., the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Infrastructure is meant, which is used in the iMarine project.

This document starts with describing premises and the SDMX Scoping, followed by the proposed functions to implement:

  • CodelistManager
  • Validation/Curation
  • Artifact Selector
  • Data Visualization
  • SDMX2RDF

In a related context, CNR has identified the SDMX processing as an opportunity to pursue. The use cases for data-mining and transformation that could benefit from the processing services are not described here.

Premises

Adopting OpenSDMX is lightweight. Clients may be reluctant to adopt D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.. They may be exposed to D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. capabilities through OpenSDMX, and consider to migrate services from OpenSDMX to D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.. Doing so, OpenSDMX can be a cost-effective enabler for D4ScienceAn e-Infrastructure operated by the D4Science.org initiative., getting its clients familiar with D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.. Therefore these premises are defined:

  • OpenSDMX does not have a dependency with the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure.
  • All OpenSDMX artefacts are portable into the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure.
  • Developments are done in the context of the OpenSDMX community, directly on the OpenSDMX codebase and follow the OpenSDMX release lifecycle.

Scoping

The SDMX specification defines these artefacts: datastructure (DSD), metadatastructure, categoryscheme, conceptscheme, codelist, hierarchicalcodelist, organisationscheme, agencyscheme, dataproviderscheme, dataconsumerscheme, organisationunitscheme, dataflow, metadataflow, reportingtaxonomy, provisionagreement, structureset, process, categorisation, contentconstraint, attachmentconstraint, structure, metadata, schema, data

The artefacts written in bold are selected to be part of the iMarine project at this stage (datastructure, conceptscheme, organisationscheme, codelist, dataflow and data).

The artifact process can be further discussed to be taken on board or not. On the long term this one is definitely need to be taken into account because it can reflect the process of data and metadata in the system. The Bank of Italy is using their proprietary Expression Language for this purpose. Adopting either in D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. will require at least a MOU with a large ‘SDMX’ partner. It will not be discussed here.

OpenSDMX is divided in 2 parts, core and plus. OpenSDMX-Core is the implementation of the SDMX REST specification with the concept of adapters. OpenSDMX-Plus contains all the functions which are additions to core, like CodelistManager, Validation, Artefact Selector, SDMX2RDF and DataVisualization.

See the diagram below for the dependencies of the different software components and how the components relate together.

FishFrame2Sdmx.PNG

CodelistManager

Functions distinguished for a CodelistManager are:

  • Maintenance (adding, changing or deleting codes and/or descriptions)
  • Importing from CSV/SDMX / RDF / FishFrame
  • Versioning
  • Publishing (of a new version)
  • Validity (Where when for who is it authoritative / reference / candidate)

Possible contexts in which these functions need to be performed are

  • The codelist is already stored in an existing datastore (a datastore can be a database or a data access layer):
    • All functions are performed on this datastore (A).
    • An initial codelist is loaded from the datastore and will be copied in the CodelistManager. The subsequent lifecycle will happen in the CodelistManager (B).
  • The codelist is a file. The file is loaded in the CodelistManager. Most of functions are performed in the CodelistManager. Additions of codes may happen by uploading new codes for the Codelist (B).


Impact of Option A: the OpenSDMX instance does not have its own database.

Impact of option B, the OpenSDMX instance does have its own database.

Validation/Curation

Vision on this has been worked out already here: http://opensdmxdevelopers.wikispaces.com/Curation The discussion on the level of validation in the context of SDMX is currently led by Eurostat. Involved parties are the Bank of Italy, Metadata Technology, Agillis and FAO. Apart from the precise outcome of these discussions, it is clear that there is a need for an infrastructure which can load/cure/validate SDMX datasets.

SDMX Artefact Selector

This scenario is inspired by my interpretation of the data.fao.org principle:

  • Guide the user to the data or metadata in a highly user friendly and pleasant way
  • Give the data to the user
  • So the user can go away to do whatever he wants to do with the data.

The data.fao.org offers a simple way to find SDMX data, using the SDMX REST API. There is a need for a user interface which leads the user in a simple way to the SDMX data and metadata. The SDMX Artefact Selector could also be called a SDMX Registry and Repository Browser.

SDMX2RDF

http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/index.html

There is an interesting group working on the transformation of the SDMX model into RDF. This work can be adopted in order to publish SDMX datasets also in RDF.

SDMX Data Visualization

In order to make data visible and findable for search engines, an user interface is needed to visualize the SDMX artefacts. The first artifact to visualize is the SDMX dataset. The DSD can be used to express the data in the different languages.

FishFrame2SDMX

FishFrame is an upcoming standard for data collection and dissemination in the Fisheries domain, read more here: http://km.fao.org/FIGISwiki/index.php/FishFrame

In addition, a dedicated section in this Cluster page outlines the FishFrame plans in iMarine.

IRD is using FishFrame for data dissemination. FishFrame as a standard and intention is similar to SDMX, however only for the Fisheries domain. IRD advised that a conversion from FishFrame to SDMX makes more sense than the other way around. Rational behind is that it is important to publish the FishFrame format according standards like SDMX, accepted outside the Fisheries community. Conversion from FishFrame to SDMX will also result in having a profound understanding how the two standards relate together. This is highly valuable knowledge for the iMarine project. Conversion from SDMX to FishFrame is not planned yet, however not excluded on the long term.

The picture below shows the position of the FishFrame2SMX converter: FishFrame2Sdmx.PNG

The converter will generate SDMX codelists, datastructures and datasets.

FishFrame does not have a dissemination protocol like the SDMX REST API. OpenSDMX implements this protocol and the converter can be packaged as an adapter in order to publish FishFrame as SDMX artefacts through the SDMX REST API. OpenSdmxArtifactFishframe.PNG

The above pattern can be applied one to one in D4ScienceAn e-Infrastructure operated by the D4Science.org initiative..