Ecosystem Approach Community of Practice: CodelistMapper

From D4Science Wiki
Jump to: navigation, search

CodelistMapper Profile DRAFT

The CodelistMapper VREVirtual Research Environment. aims to provide the management of relationships between Codelists;

  • a description of existing Codelists,
  • the type of relations between codes,
  • a description of the relationship,
  • a quantifier of that relationship,
  • a synchronizable reference to the codelists,
  • the temporal and
  • spatial coverage.

The base requirements have already been collected and were partly implemented in e.g.:

  • The ICIS Advanced Curation; the advanced search and replace;
  • the VRMF application in FAO, albeit this takes a very different approach, it may provide useful components. (F. Fiorellato);

This last application already manages mappings between lists of vessels, gear-types, vessel-types. Good progress is reported for names and identities records of persons. The application was demood to CNR staf in FAO, especially the vessel-matcher which uses a distance calculating algorithm that requires many resources.

The codelists currently used in ICIS Curation are a good starting point. The CodelistMapper VREVirtual Research Environment. would extend on this, and could be enriched by the capabilities developed in FAO.

Codelist-mapping offers many opportunitites, not only to the ICIS VREVirtual Research Environment., but also as a service to other infrastructures. This profile sets out to chart the initial base-line functionalities that can be combined in a comprehensive VREVirtual Research Environment..

Problem

Describe the CoPCommunity of Practice. issue to be addressed by the Componenent (VREVirtual Research Environment. / service / resource / etc)

The availability of reliable relations between codelist enables the transformation of data expressed using one set of codelists into a dataset using a different set of codelists. This requires a facility where the 1:1, n:1, and 1:n relationships between elements of two codelists are managed, but also the completeness of the mappings, and the quality of the relationship. A tool that manages the datasets can then exploit these relationships to transform data, while also generating information on the quality and reliability of the produced transformed dataset and its elements.

The generation of such mappings is already achieved in FAO, and the Codelistmapper could be based on that application. This application would benefit from the computing resources available in the D4S infrastructure to reduce the time required to generate the mappings. Currently, it takes between 3 - 4 days to generate the mappings for some 300.000 vessels. This long generation time is expected to be reduced if D4S cloud facilities are available. This could also replace the current ad-hoc mapping to a more accurate and timely overnight process.

Product

Describe the proposed solution in maximum 3 sentences:

With CodelistMapper two reference codelists can be loaded and displayed, and the user will be offered the option to automatically or manually establish relations beteen the elements. A facility that establishes relations between similar elements in the two list based on similarities supported by a set a qualifiers and quantifiers will result in a transparant mapping. The CodelistMapper offers this facility through a VREVirtual Research Environment., mainly for data curation in other VREVirtual Research Environment.'s, but also with the purpose to offer this mapping service to external clients and infrastructure, and the results as published content for consumption by external applications in a variety of formats, such as RDF and SDMX.


Priority to CoPCommunity of Practice.

List proposed solution priority following the iMarine Board priority setting criteria:

  • Identified community: Users now: Nearly all conceived VREVirtual Research Environment.'s are in need of high quality and dynamic mappings between codelists. Also beyond the gCube environment, access to reliable mappings (i.e. with quality indicators on completeness, validity, and precision) will be valuable.
  • Potential for co-funding: Good. However, community requirements outside the project are not well understood. How dynamic must mappings be, how to expose partially mapped codelists, how to discover codelists, what distribution formats must be used, the desired statistical and computational precision etc. all have to be understood. Examples: A codelist of weigth-classes; what does a reported capture of .5 mean? Can it be .5087, can it be .64? A codelist for periods; what is 2009? The calendar year? The Fiscal Year?
  • Structural allocation of resources: To be discussed in a. SB, b. iMarine Board
  • Referred in DoW: T3.3, WP9, T9.3, and in the Methodology.
  • Business Cases: Supports BC 1, 2, 3.
  • How does the proposed action generally support sustainability aspects Codelist are the base of any harmonization effort. Without properly mapped codelists, no data can be realistically used across any system. Without Codelist, there can be no sustainability. D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. technologies can access and manage data in many formats, and provide different computational services to generate the detailed mappings. CodelistMapping offers a good opportunity for a sustainable D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. data management framework.
  • How consistent it is with EC regulations/strategies (eg INSPIRE, ... ): Very much, as most strategies aim to bring data under commonly shared and maintaineded refrence schemes.
  • Re-usability – benefits – compatibility Very High. The benefit is that with gCube components the codelists mappings can be used in workflow supported data-upgrade services, transforming unstructered datasets to structured quality datasets.

Parentage

Relation to CoPCommunity of Practice. Software Critical to all shared efforts in iMarine where data have to cross a systems or domain boundary.

Relation to D4S technologies For WP6

Does the proposed solution solve other problems associated with EA-CoPCommunity of Practice. Business Cases? For further iMarine Board evaluation. First indications are that interest is high in e.g. BC1:

  • K.Morteo showed his interest in data harmonization
  • The merging of data across data providers in e.g. VTI requires the Codelist-Mapping management;
  • The production of SDMX can only be done if the relation with the Codelists used is well understood. E.g. the generation of a partial Codelist may require that a metadata of provinience and quality is re-calculated and attached;

Also in BC2, prospects are good:

  • Harmonization across sampling and survey methodology and instruments is a key blocking issue in all biological research;
  • Discovery of duplicates is an essential feature in OBIS;
  • Matching similarities across thousands of species names requires a powerful infrastructure;
  • etc

If the proposed solution can be used in another SW scenario (not users!) please describe. For WP6 The expectations are high that the developed solution for ICIS TS harmonization, will be of direct use for e.g. the following cases:

  • SPREAD - Harmonize data to be re-allocated;
  • OBIS - Species mapping; calculate the similarities between 2 entries based on their names (scientific, vernacular, ), taxonomy, description;
  • OBIS - Species list mapping; calculate the similaritites between 2 lists of species or other reference data sets;
  • VTI - Harmonize data.

Public

How big is the expected user community after delivery? All VREs and all tabular data managing services will benefit from the data exposed through the CodelistMapper. Outside the iMarine ecosystem, reliable codelists exposed as e.g. RDF will be marketable.

Productivity

Are the proposed measures effective? Very much. The service will boost the quality of not only the codelists mappings themselves, but also of all datasets that use the service.

Does it reduce a known workload? Yes, all efforts to generate, maintain and modify codelists can be done by a few experts, whereas in the current situation, all efforts are repeated without quality indicators.

Price

Is the proposed solution cheap? No, it should only be pursued if the requirements are understood by WP6 and the iMarine Board.

Expected effort in PM: 12 PM at least; 2 PM WP3, 5 PM WP6, etc.

Presentation

How is the component delivered to users? (Design / on-line help / training material / support).

CodelistMapper is conceived to be delivered through a VREVirtual Research Environment. that starts from an available codelist; this can be a csv, a codelist from a registry, or other dynamic system. If the system is a dynamic, online repository, a synchronization feature must be considered.

It also requires an user interface to define a new codelist, import an exsting one, manage versioning and synchronization, define sub-sets, manage the validity over owner, space and time. This is described for the closely related Codelist VREVirtual Research Environment..

For the quality indicators, access to an additional service is required to define e.g. how the quality must be calculated.

The codelists mappings have to made available in other VREVirtual Research Environment.'s, but also to external users as SDMX codelists, RDF, and / or JSON. Access to these codelists mappings is subject to policies that relate to the validity of the codelist, and the access rights of the external user or application.

The mappings will have different access policies, translations and units, depending on the user preferences.

Privacy

Are they safe?

There are no privacy issues of legal or physical persons involved.

Need the proposed solution to manage confidential info at data / dataset / organizational level?

Yes, most codelist are not entirely open, and the same is true for their mappings.

Any mapping must be explicitedly published before it can be used by others. If a VREVirtual Research Environment. User has created a mapping, the VREVirtual Research Environment. Manager is the responible to publish in e.g. as a D4S codelist, in an SDMX repository, or otherwise.

Every mapping produced can only be shared by the VREVirtual Research Environment. User that generated it in the VREVirtual Research Environment.. Only the VREVirtual Research Environment. manager can publish the generated maps to render them discoverable and visible to other VREs.

Describe security and privacy issues:

Not possible, here WP6 can contribute.

Policy

Are there any policies available that describe data access and sharing?

No, a beginning has been made in T3.2 by FAO.

Are these really needed?

No. Beyond the already available D4S technologies, no specific policies are known to exist.

Copyright / attribution / metadata / legal

The CodelistMapper can serve as a test-case for the management of attribute meta-data in the public domain. E.g a FAO capture data-set may be curated with Eurostat and Unesco Codelists. No policy exists on the correct copyright citation or legal implications of re-publishing data using external reference data.

Perils

Do they introduce moral hazard? (A hazard here is the risk that users will behave more recklessly if they are insulated from the effects of the software, or if they do noit understand what it produces, where data come from, what they represent etc. .)

The use of codelists mappings may lead users to believe that the data they describe follow the same quality rules as the coelists themselves. Bad data remain bad data, even in a high quality system.