Statistical cluster

From D4Science Wiki
Revision as of 15:29, 18 March 2013 by Anton.ellenbroek (Talk | contribs) (The Statistical Cluster Work Plan)

Jump to: navigation, search

Position

The management of statistical data is a large domain, and ranges from the collection of observations on species occurrences or capture, the curation and aggregation of data, the visualization on maps, and the visual and statistical analysis of both observations and time-series. It requires import of structured data in various formats, with an emphasis on SDMX datasets. The purpose of the cluster is to produce a low-cost, versatile and reliable data-suite to cover the work-flow of data from collection to publication and to manage an appropriate set of metadata on dataset describing e.g. the provenance, ownership, and quality.

Compared to other initiatives, iMarine already offers the basic components to load, share, publish and analyze data. This makes the iMarine infrastructure an attractive option for the further development of statistical data components. In addition, the powerful services for data-processing are epected to offer substantial benefits to statistical data managers.

Many data owners in the marine statistical domain have difficulty in gaining consistent access to capture data in enough detail and with relevant metadata. There are concerns about the sheer number of datasets that have to be maintained, with multiple data streams and formats putting pressure on software developers. Concerns about the interoperability of software and the related risk of exploding support costs for software maintenance make OS development in a CoPCommunity of Practice. a potentially attractive proposition.

This cluster can build on the considerable experience that has been gained in the acquisition and management of data in previous projects. In addition, many services marshalling the data from “Sea to Shelf” are available; curation, metadata collection, transformation, mapping and repository services, to name a few. The iMarine Biodiversity partners aim is to provide a stronger, more resilient and flexible framework for Statistical Data Management. The EA-CoPCommunity of Practice. expects that services that are difficult to maintain in a single organization, such as for data mining, time-series analysis, and modeling, can be offered in a very cost-effective manner through iMarine. Collaboration in an Ecosystem Approach Community of PracticeA term coined to capture an "activity system" that includes individuals who are united in action and in the meaning that "action" has for them and for the larger collective. The communities of practice are "virtual", ''i.e.'', they are not formal structures, such as departments or project teams. Instead, these communities exist in the minds of their members, are glued together by the connections they have with each other, as well as by their specific shared problems or areas of interest. The generation of knowledge in communities of practice occurs when people participate in problem solving and share the knowledge necessary to solve the problems. can help to achieve that.

This collaboration needs to be based on reliable and free resources. Access to and maintenance of these resources is the responsability of all partners of the EA-CoPCommunity of Practice., and users of the supporting eInfrastructure will have to develop and commit to an open data policy.

An effective statistical data policy is an iMarine EA-CoPCommunity of Practice. policy; it needs to be defined and approved by the iMarine Board. After all, making clear, fair agreements on the component development of OS software and effectively enforcing these rules will increase political and CoPCommunity of Practice. support for iMarine. This proposal lists some of the components that can help achieve this.

We are aware that statistics are just one, albeit important, facet in the iMarine decision-making processes. To facilitate this, not only components propose implementation actions, but also describe the background and the anticipated impact. We are keen to enter into discussion with other iMarine partners and iMarine supporting institutions. We invite other parties (Board partners, iMarine institutions and CoPCommunity of Practice.) to express their needs.

The Statistical Cluster Work Plan

  • Abstract or Executive Summary;
  • Introduction and Background (The Problems);


Goals and Objectives (The Outputs)

The ensemble of components constituting the statistical cluster can be summarized as:

  • ICIS - The Data Suite offering data management, analysis and production facilities;
  • SPREAD - Right at the middle between the statistical and Geospatial_cluster#Goals_and_Objectives_.28The_Outputs.29 geospatial services, SPREAD will manage the spatial re-allocation of capture data following political and environmental boundaries.
  • CodelistManager - The shared development of Cotrix, acting as a persistent storage 'agnostic' facility;
  • Statistical service - The iMarine Specific container and work flow organizer to bring the power of the infra to scientific users;
  • R - The interface to the tool of preference of the EA-CoPCommunity of Practice.; either as an integrated tool in ICIS, or as a service in a WPS Hadoop process;
  • SDMX registry - The persistence and user orchestrator of the statistical products;
  • FLUX - TBD.

Resources and Constraints (The Inputs)

The Business Cases requirements are inputs for the cluster, they come from 3 Business Cases that are grouped as follows:

  • the EU Common Fishery Policy;
  • the FAO deep seas fisheries programme;
  • and the UN EAF Ecosystem Approach to fisheries.

Use cases are often not specific to one of the above, but are either a generic statistical data function, or the very opposite, very generic data storage environments. Some examples, starting from the framwork level doen to detailed requirements are:

  • Generic storage and distribution solution that can be consumed by external parties; Here the geonetwork and SDMX-registry are positioned
  • Data mining and pattern recognition;
  • A generic tool for data processing, such as R;
  • Validation, QA and QC functionality.

Other inputs


  • FAO Global and Regional capture datasets;
  • FAO reference data exposed through e.g. the SDMX registry;
  • Community Data sources FAO Tuna Atlas (Tropical tuna data) IRD Tuna Atlas (Tropical tuna data)
  • Other Fisheries data: (catches of fisheries targeting tuna, bycacth of tuna fisheries scientific tagging data),
  • Vessel position data;
  • Species occurrence data.

Species distributions, occurrences data of other fisheries databases, statistical data by geospatial area, biological parameters.

Constraints


Strategy and Actions (from Inputs to Outputs)

Appendices (Budget, Resources, Documents, Schedule and Others)

The Overall Flow of the Statistical cluster Work Plan

The work plan consists of a main text and appendices. The appendices may include budgets, agreements, external resources, data formats, etc. They are put into appendices at the end of the work plan, as they do not form part of the argument.

Components