ICIS Draft Work Plan Q2-3 2012

From D4Science Wiki
Jump to: navigation, search

The iMarine BC1 goal is described in iMarine Business Cases page, support for the EU Fisheries Policy.

The iMarine BC2 goal is described in iMarine Business Cases page, support to FAO’s deep seas fisheries program.

Both BC’s contain the management of statistical and reference data as an important objective. The ICIS VREVirtual Research Environment. already offers the technical infrastructure to realize the basic data management facilities. This work plan will focus on the activity to align the existing ICIS VREVirtual Research Environment. with curation expectations emerging from the BC’s.

More advanced or related Work Plans will cover e.g. statistical analysis of data, geospatial analysis of data, advanced curation using external vocabularies, knowledge bases or services, etc.

This ICIS work plan aims to identify overlapping objectives of the Business Cases and translate them in technical objectives. This clearly evidences the potential re-use of components, driving down development cost, time and maintenance costs, while improving quality.

The work plan describes the activities for a 6-months period of time, where both iMarine Board and technical teams feel comfortable they can meet the requirements. It also informs decision makers such as the iMarine Board on the potential solution, and may later guide them in their management and review of the activities, and help validate the results.

To be approved, the ICIS work plan will be discussed with relevant iMarine Board members, WP3 representatives, PEB and WP6.


PERIOD COVERED

This version of the Work Plan covers the six months' period to the next iMarine Board Meeting in September 2012.

The period has to cover several activities related to different technologies,

It could not start before April 2012 because the goal and objectives needed negotiating, and staffing was difficult in several partners.


EXECUTIVE SUMMARY

ICIS is the environment to manage statistical data-flows. Statistical here means that the data are measured, either directly by instruments, or by collection and / estimating aggregated values. These can be either offered as Time-Series or Observations, where a series of observations can be combined to create e new Time Series.

The typical data-set is a group of dimensional values, attributes and one or more values presented as rows in a tabular format. In the current ICIS, the curation misses functionality to persist the settings of a curation. This means that if a user curates a set 'XYZ2010', the same settings cannot be copied to curate set 'XYZ2011'. This will have to be tackled by the ICIS back-end developers. SDMX provides several validation facilities that may be considered, but this requires careful discussion with CoPCommunity of Practice. partners before this solution is chosen. In addition, many tools exist that already provide validation features, and ICIS should not allocate too much resources to improve validation of data.

The proposed solution will also allow for the management of the contents of the dimensional columns, separate datasets are available as reference data, and usually the allowed values in a dataset are elements from the population in the reference dataset, although there can be exceptions to that rule.

Reference data are not one-dimensional, and they may contain e.g. descriptions, cover a range (a class such as 1-10, a quartile, a domain such as fuel-types), have geospatial or temporal constraints etc.

An important feature of ICIS is to match a column and the values it contains to a reference dataset, with the help of a matching tool. The objectives for that matching feature will be described in other work-plans; for codelist-management and codelist-mapping.

• The VREVirtual Research Environment. for Codelist management will be described here: link • The VREVirtual Research Environment. for Codelist Mapping will be described here: link

ICIS already is enriched with many features that merit their own work-plans, and a full list is not yet ready, but may include:

  • VTI; for vessel TimeSeries plotting and analysis
  • The Statistical service; for the analysis of data contained in a TS
  • SPREAD for the geospatial re-allocation and visualization and geospatial analysis time series.

The result of a curated dataset is not an isolated entity, but may have to be merged with another dataset. Two operations can be identified here; the joining with a previous entity with an identical structural format, or the horizontal merge where the value columns are added to another entity over the dimensional values.

The results of a curation have little value if they cannot be published, or can be interactively used in e.g. R. • The VREVirtual Research Environment. for SDMX publishing is described here: link • The VREVirtual Research Environment. for R integration is described here: link

The power of the D4Sciecne infrastructure is in providing access to and processing of data from external infrastructures. Once a stable ICIS is available, it can enrich Time Series with information calculated on demand to e.g. include environmnental data • The VREVirtual Research Environment. for spatial data enrichment is described here: link • The VREVirtual Research Environment. for BI is described here: link

THE INTRODUCTION AND BACKGROUND

This ICIS Work Plan aims to provide a guide to enrich the facility for data management already available in iMarine; the ICIS VREVirtual Research Environment.. The enrichment sought is derived from

  1. The D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.-II validation report and recommendations;
  2. the iMarine partners that have joined the project, and their requirements, especially in BC2;
  3. The progress in CodelistManagement, and the development facilities offered through adopting Maven;
  4. Para's in the DoW
  5. Requirements stemming from SPREAD and other use-cases.

The DoW, while excellent in providing a longer term vision, and identifying Work Packages, does not contain specific enough information to commence with collaborative development effort. In addition, the iMarine project has to support 3 business cases that have many overlapping requirements. Therefore, a clustering of requirements was recommended in the DoW, and one of these clusters; for statistical data, is best served by extending the existing ICIS VREVirtual Research Environment..

The statistical data cluster collects requirements for facilities to harmonize data, to transform data from one resolution to another, to re-allocate data in spatio-temporal dimensions, and make data discoverable and sharable by using widely accepted formats and data repositories such as for SDMX.

Not all these nails can be hit at the same time, and this Work Plan outlines the Q2-3 potential collaborative facilities to extend ICIS.

The existing ICIS forms the base from which to extend the features. ICIS will be the base container that provides the functionality that is critical to all other data management scenarios, be they in VTI, SPREAD, or SDMX scenarios.

GOALS AND OBJECTIVES

The goal of this Work Plan is to deliver a full life-cycle curation of datasets of observations and their metadata.

The objectives can be grouped by the proposed sub-solutions:

  1. Data management; Improve merge and join functionality.
  2. Harmonization; Transform data to refer to clearly defined reference code-lists, attribute ranges, and value formats and ranges. Capture the knowledge about the dataset as metadata.
  3. Code-list discovery; search a reference data-set to use in a curation; persist those settings; (More reference data management in another work plan)
  4. Use of code-list in curation; if not all elements can match, define a reference data policy (freeze / add to referenced code-list / remove from dataset / modify data); persist those settings (More on this in another work plan)
  5. Persist results; After curation, save or merge (append to existing table); select from a data-policy (overwrite, flag, ignore); save those settings for a future re-use;
  6. Persist curation settings; In addition to the reference data settings, also persist settings for: source-url (if non-csv loading is supported);
  7. For the work-flow support, a solution is not envisaged in this planning period.

The objectives do not mention existing curation and time-series facilities. It is assumed that improvements to e.g. the user-interface or back-end performance is covered elsewhere.

In addition, the objectives aim to deliver a VREVirtual Research Environment. that will be the base for many other VREVirtual Research Environment.’s that manage data; Code-list management, Code-list Mapping; SDMX Generator, VTI, are all examples of functionality that can re-use ICIS as a platform for development.

RESOURCES AND CONSTRAINTS

There are several constraints to overcome for the implementation of the extended functionality for each objective.

The constraints can be grouped by the proposed sub-solutions:

  1. Data management; The constraints to improve merge and join functionality that can already by identified are the lack of the notion of multiple observations in a TS Object. In addition, the effects of joining / merging incomplete sets (e.g. compare to an outer-join) are not analyzed. The impact of the merge / join on the descriptive metadata is not know.
  2. Harmonization; ICIS was not developed to support an extensive rule-frame to support harmonization at data-level. The back-end structure is not known to WP3, and triggers and rules support is basic. Understanding this constraint is critical.
  3. Code-list discovery; Reference data management will be provided by a set of other VREs. Their discovery is already supported in ICIS. Adding new types of data, versioning of code-lists, partial code-lists, and other feature requirements may emerge in the project.
  4. Use of code-lists in curation; A codelist currently is not modifiable from the ICIS VREVirtual Research Environment., and unmatchables can only be discarded.
  5. Persist results; Currently, after curation a new TS is created, which can be merged. It is now not possible to automate that merge step.
  6. Persist curation settings; The current ICIS has no roll-back or savepoint mechanism. It is important to realize that this will not be developed. The results can be regenerated by repeating earlier steps. This also evidences the need for a flexible capture of curation steps, where the settings can be changed.
  7. For the work-flow support, a solution is not envisaged in this planning period. The constraints here cannot be listed.

The most important constraint to overcome in ICIS is the discovery, access, and modification of code-list data. This requires activity in parallel VREVirtual Research Environment. developments for Code-list management and Code-list mapping.

The development of a next version of ICIS is likely to require mostly CNR resources for the implementation. The resources available can be grouped by the proposed sub-solutions:

  1. Data management; Start from ICIS;
  2. Harmonization; Start from ICIS;
  3. Code-list discovery; Start from ICS and D4S Codelist-manager
  4. Use of code-lists in curation; Start from ICIS;
  5. Persist results; No resources known;
  6. Persist curation settings; No resources known. Could be SDMX files;
  7. For the work-flow support, a solution is not envisaged in this planning period. The constraints here cannot be listed.

STRATEGY AND ACTIONS

The Strategy to meet the goal is to

  1. Discuss the Work Plan at the March TCom and iMarine Board meeting;
  2. Identify Resources; a proposed schedule is: WP3 (.5 PM), WP6-FAO 1.5 PM, CNR ?? PM, OBIS, .5 PM, WP9 .5 PM
  3. Document the progress and validation results

Several approaches can be taken to organize the effort;

  1. CNR continues on the development of ICIS
  2. A suitable application is identified that leverages most of the required functions. This was done with the Infostat system of the Bank of Italy. However, this requires participation of a larger group of parties beyond the scope of the iMarine project, and an agreement with the Boi. This option is currently on hold.

The approach selected is that CNR leads the implementation effort while FAO is responsible for describing requirements, contribute to component development and validate the results:

  1. Defining the objectives and constraints;
  2. Specification of verifiable validation points
  3. Organize the validation of released components
  4. Communicate the validation results.

The actions needed to reach the objectives, and implement the strategy can be grouped by objective. The validation will be based on a completed action, and here some indicators for validation are included:

  1. Data management; Improve merge and join functionality
    1. Validate the effect of a merge; 2 datasets that share the same dimension, but contain different observations. And 2 datasets where 1 set is appended to a previous one.
  1. Harmonization will be validated by performing the following actions;
    1. Load a dataset; a csv and a sdmx file will have to be loaded.
    2. Curate all dimensional, attribute and value columns; ICIS already support this, the rules can be refined somewhat. Support for date-type has to improve. Revert an action. Display curated data next to the original values.
    3. Review these settings in the batch history;
  1. Code-list discovery;
    1. Search a reference data-set to use in a curation;
    2. Save those settings;
    3. Review the settings in the batch history;
  1. Use of code-lists data in curation;
    1. if not all elements can match, define a reference data policy (freeze / add to reference / remove from dataset / modify data);
    2. persist those settings (More on this in another work plan).
    3. Review the settings in the batch history;
  1. persist results;
    1. After curation, save, join or merge (append to existing table);
    2. Select from a data-policy (overwrite, flag, ignore);
    3. Review the settings and the resulting data-set.
  1. Persist curation settings;
    1. In addition to the reference data settings, also persist settings for: source-url (if non-csv loading is supported);
    2. Review the settings and save as a control file for future datasets;
    3. Review the resulting control file (an xslt?).
  1. For the work-flow support, a solution is not envisaged in this planning period.

APPENDICES, INCLUDING A SCHEDULE

Vocabulary:

Attribute

Curation

Dimension

Harmonization

Standardization

Validation

Work flow


THE OVERALL FLOW OF THE WORK PLAN

The ICIS Work Plan will be presented and discussed at the TCom, after which a set of planning sessions will follow to allocate resources.

These will pass to PEB for approval. PEB will have to assess the estimated work-load, and propose an implementation schedule. The involved partners will then have to negotiate their involvement.

Validation will be performed when a new version of the VREVirtual Research Environment. is released, based on the objectives mentioned above. For the validation of entire components, WP3 will lead, and invite relevant representatives of the EA-CoPCommunity of Practice. to provide feed-back and suggestions.


CONCLUSION

ICIS is the base component for tabular data management for the CoPCommunity of Practice. in iMarine, and thus needs to provide robust import, storage, modification, and publication facilities.

The current version of ICIS, while already providing important featrures, must be improved to make it the critical data-tool essential to the CoPCommunity of Practice..

The curation functionality and the development of work-flow support are the main components that require immediate attention.