Semantic cluster

From D4Science Wiki
Revision as of 18:13, 3 September 2015 by Fabio.sinibaldi (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The main purpose of the Cluster work plan (template [Ecosystem_Approach_Community_of_Practice_Overview:_Clusters#Cluster_Work_Plans_in_D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. here]) is to provide the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Board with a management tool usable as a framework for planning activities, and that can serve as a guide for carrying out that work. The scope is thus the interface between the Board and the project's Work Packages activities. After drafting, a work plan needs approval from the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Board, following the Board procedures.

Executive Summary

The D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Semantic Cluster is maintaining and promoting a Work Plan (this document) aimed at:

  • organizing collections of requirements gathered from the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Business Cases
  • providing recommendations for the implementation of the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure.

The requirements are inputs for the cluster, from D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Business Cases that are grouped as follows:

  • Support to regional (Africa) LME pelagic EAF community [Ecosystem_Approach_Community_of_Practice:_D4ScienceAn e-Infrastructure operated by the D4Science.org initiative._Business_Cases#BC3_-_Support_to_regional_.28Africa.29_LME_pelagic_EAF_community]
  • the FAO deep seas fisheries programme
  • and the UN EAF Ecosystem Approach to fisheries

The recommendations are outputs from the cluster, primarily intended for the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Board, the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. project partners (Work Packages) and the Communities of Practice (CoPCommunity of Practice.) identified within the Ecosystem Approach. They are aimed at releasing infrastructure services such as:

  • setting up ontologies from controlled vocabularies of the domain: species taxonomy, fishing vessels and gears codes (FAO, DG-MARE code lists, )...
  • creation of Linked Open Data through enrichment of Metadata with URIs of ontologies (TLO, Ecoscope, FLOD, WORMS): bibliographic references, OGC metadata (data sources and related services including processes), EML metadata, .pdf / . doc files
  • workflow for massive RDF generation, storage and publication (triple store, SPARQL endpoint, OpenSearch).
  • seamless access to metadata catalogues through search engines based on ontologies

Such Infrastructure Services can be used by the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. eScience services (VREs & Apps): species manager, geoexplorer, D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. search engine.

Introduction and Background (The Problems)

Currently, some datasets are freely available (GBIF, OBIS, INSPIRE..) but difficult to retrieve as related metadata are heterogeneous. Indeed the name of creators and other tags used to annotate these resources with related entities of the domain (species, fishing gears, fisheries..) are rarely using the same terms. Data discovery is thus complicated because users have to use synonyms for the same concepts in multiple languages to retrieve the datasets. Ontologies can help in matching terms and improving data discovery.

Semantic Web and ontologies enable data producers to create richer metadata. Usual metadata are using XML schema with literals as values for tags (like keywords, persons). This is the case for Dublin Core metadata, OGC metadata, EML metadata. These XML metadata with literals can be transformed in RDF metadata with URIs of ontologies. This can be achieved programmatically with text mining applications.

However, most of all, the main issue is the lack of ontologies for the domain of Ecosystem Approach to Marine Resources. Many initiatives have been dealing with related sub-domains:

  • species:
    • Worms [1] which is not a real ontology but is translated into RDF [2]
    • NASA Semantic Web for Earth and Environmental Terminology (SWEET ontologies [3])
    • ontologies for ecoinformatics [4]
  • fisheries sciences: Neon with FAO [5]

On top of these ontologies, there is a need to built a new top-level ontology which reuses parts of existing ones (including those for information resources: Dublin Core, FOAF, Dclite4g [6], Genesi-dec [7]..).

Such ontologies can be used to set up knowldedge bases by instianting underlying classes and properties. Indeed, concepts are not only URIs to annotate information resources but are made of a set of properties indicating the relationships between entities of the domain: which species is predator of these species, which fishing gear are targeting these species, where these vessels are fishing... Knowledge bases can thus be used to set up Web portals summarizing some knowledge about entities: fact sheets about species, fishing gears, ecosystems, fisheries..

Automated fact sheet generation is a key issue in D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. if we consider that a lot of systems have set up fact sheets:

  • Worms Yellowfin Tuna fact sheet [8]
  • FIRMS Yellowfin Tuna fact sheet [9]
  • Fishbase Yellowfin Tuna fact sheet [10]
  • Encyclopedia Of Life Yellowfin Tuna fact sheet [11]
  • GBIF Yellowfin Tuna fact sheet [12]

Being able to generate such fact sheets directly from RDF requires the content of underlying information systems to be made available in RDF. To achieve this goal, D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. VREVirtual Research Environment. and apps can help. Indeed, applications like "species manager" can combine information from different sources (OBIS, WORMS, GBIF, Fishbase...) and export the resulting mapping in RDF (compliant with TLO).


Other domains face similar issues and research projetcs like agInfra suggest methods and tools that have to be taken into account in the framework of D4ScienceAn e-Infrastructure operated by the D4Science.org initiative..

Goals and Objectives (The Outputs)

Outputs of the cluster are Roadmaps, Tradeoff analysis and Guidelines for the development, deployment and maintenance of infrastructure services involving semantic resources and technology, such as:

  • publication of species manager results (code mapping / reconcialiation) VREVirtual Research Environment. with RDF (based on Top Level Ontology Schema)
  • publication of D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. geonetwork metadata (about data sources and related services: WMSSee Workload Management System or Web Mapping Service. / WFSWeb Feature Service/ WCSWeb Coverage Service/ WPS...) through RDF (based on GENESI-DEC Schema)
  • RDF generation from various types of information resources (Web Pages, OGC metadata / CSW URL, .pdf /. doc files, bibliographic references..)

Such Infrastructure Services are needed by the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. eScience services (VREs & Apps) and other web service endpoints.

A validation process aims at matching the cluster outputs with 'consuming' eScience services like these ones:

  • a VREVirtual Research Environment. to provide GUIs to facilitate RDF generation through D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Tagger
  • a VREVirtual Research Environment. to provide a search engine for D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. enabling seamless access to different metadata catalogues (D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. native metadata element set, OGC, publications, pictures...)
  • Smartfish Web portal
  • Fact sheet generator (e.g. Tuna Atlas Use Case)

Resources and Constraints (The Inputs)

The Business Cases requirements are inputs for the cluster, they come from 3 Business Cases that are grouped as follows:

  • Smartfish
  • Tuna Atlas

Other inputs:

  • RDF sources for domain entities: FAO FLOD (species, vessels, areas and related properties), IRD Ecoscope (species, vessels, ecosystems and related properties), WORMS (taxon ranks and related properties), Species manager VREVirtual Research Environment. (species and codes).
  • RDF sources for information resources metadata: FAO FLOD (publications, ??), IRD Ecoscope (pictures, databases, publications, people...), D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. geonetwork

Strategy and Actions (from Inputs to Outputs)

Another Wiki page is dedicated to Semantic cluster achievements [Semantic_cluster_achievements] related to D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Board Work Plan [13].

From the strenghts and skills of the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. partners contributing to the Semantic Cluster, the following action plans have been conducted or are underway:

  • Leveraging the FLOD and Ecoscope knowledge bases,
  • Implementing SPARQL enpoints,
  • Implementing OpenSearch,
  • Implementing new schema for RDF metadata (GENESI-DEC)
  • use FORTH search engine (xSearch) on top of FLOD and Ecoscope knowledge bases (including OpenSearch for results and SPARQL enpoints for clustering),
  • use FORTH entity / text mining application with FLOD and Ecoscope to highlight Web Pages,
  • use FORTH entity / text mining to annotate new kinds of information resources (bibliographic references, OGC metadata...)


For each of them, it is envisioned (by January 2013) to review and benchmark their added-value accordingly to the following D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. standard review:

  • Who are the Users
  • Who are the co-funding partners
  • What are the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure resources involved
  • What are the outcomes that do match the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Description of Work
  • How do they fit in the EA-CoPCommunity of Practice. business cases
  • How do they contribute to the sustainability of an EA-CoPCommunity of Practice.
  • How far are they re-usable with clear benefits to EA-CoPCommunity of Practice. representatives, and proven compatibility with EA-CoPCommunity of Practice. resources
  • How far are they consistent with EC regulations/strategies such as open data strategy for Europe [14].

Cluster Participants and Roles

  • IRD:
    • provides an ontology about domain entities and related information resources metadata,
    • provides expertise about the domain (Ecosystem Approach to Marine Resources) with underlying research laboratory
  • FAO:
    • provides an ontology which deals with entities of the domain (vessel, gear, linneantaxonomy, port, flagstate, area: sea, eez, statisticaldivision, rfb..),
    • provides Linked Open Data (publications) which are annotated with FLOD ontologies URIs
  • FORTH:
    • provides expertise in setting up ontologies and work on TLO [Top_Level_Ontology]
    • provides tools to annotate information ressources and discover them through search engine exploiting ontologies (for clustering results...)

Appendix A - Resources

  • Wiki page about Top Level Ontology / TLO [Top_Level_Ontology#TLO_Implementation]
  • Ongoing version of TLO [15]
  • Previous version of TLO [16]
  • FORTH xSearch [XSearch]
  • FORTH tagger [XSearchLink]
  • Ecoscope fact sheet example [17]

Appendix B - Budget

Appendix C - Schedule

The Semantic Cluster aligns its work plan to its primary 'customer' milestones, that are the planned D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Board meetings, appointed through the life-time of the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. project:

  • Semester 1 (Nov 2011 - Apr. 2012);
    • Mobilization phase: identification of opportunities for collaboration and technologies
    • Semantic Cluster support:
  • Semester 2 (May 2012 - Oct. 2012);
    • Stabilization phase: validation of opportunities and definition of the technology scope
    • Semantic Cluster support:
  • Semester 3 (Nov 2012 - Apr. 2013);
    • Experimentation phase: with technologies, and with expansion of the EA-CoPCommunity of Practice. user base
    • Semantic Cluster support:
  • Semester 4 (May 2013 - Oct. 2013);
    • Validation phase: collaboration structures and EA-CoPCommunity of Practice. requirements consolidation
    • Semantic Cluster support:
  • Semester 5 (Nov 2013 - Apr. 2014);
    • Exploitation phase: operations through EA-CoPCommunity of Practice. collaboration frameworks
    • Semantic Cluster support:

Appendix D - Documents

TCOM Documents

  • OGC/ISO Publishing guidelines for Data and Services Providers. Use Cases and links with the Statistical Cluster (and VREs) and Semantic Cluster (Tuna Atlas fact sheets and indicators) TCOM-4 Oostende, Belgium 23-25 January 2013 at: http://bscw.research-infrastructures.eu/bscw/bscw.cgi/d275308/Geospatial_and_semantic.pdf
  • T10.4-Semantic Data Analysis FORTH 4th TCOM.pdf TCOM-4 Oostende, Belgium 23-25 January 2013 [18]
  • T10.4-FLOD initiative TCOM-4 Oostende, Belgium 23-25 January 2013 [19]

Appendix E - Other

D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Technical Guidelines

  • Publishing guidelines for Data and Services Providers [Semantic_cluster_guidelines]