Semantic technologies cluster

From D4Science Wiki
Revision as of 16:25, 27 January 2012 by Anton.ellenbroek (Talk | contribs) (Created page with " == iMarine Partners Position == The iMarine project promotes scientific research to be operated in interconnected Virtual Research Environments (VREs). D4Science-II (D4SII) ha...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

iMarine Partners Position

The iMarine project promotes scientific research to be operated in interconnected Virtual Research Environments (VREs). D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.-II (D4SII) has successfully provided the infrastructure and an initial toolset to enable scientists to use distributed computational power and shared digital resources. The interoperability of D4SII VREs reflected the degree of systems connectivity implemented at infrastructure level. In iMarine semantic technologies offer an additional approach to system interoperability based on connectivity implemented at data and metadata level.

The motivation for enforcing the use of semantic technologies in iMarine is to complement the organization and consumption of digital resources created inside and outside the infrastructure.

The three actions to make semantic contribution concrete are: the harmonization of foreign coding systems, the semantic annotation (ref T6.3) of digital resources, and information retrieval based on semantic search; these represent substantial innovations with respect to D4SII. The CoPs adhering to this approach will have their information assets linked, and be relieved from the commitment to move the data in to a shared space. Linked information is retrievable not only from within, but also beyond the boundaries of the iMarine infrastructure.

The integration of legacy information systems, brings in years of CoPs activity results (ref. T6.2). With the semantic approach it becomes possible to implement the access to iMarine information assets as one coherent source of facts, rather than a silo of collected contents. This second innovative contribution for the project represents the possibility to consume knowledge underlying users’ needs similarly expressed as:

“which amount of fish was caught in 2008 in the Danish Exclusive Economic Zone by vessels that practice fishing with traps, and have signed fishing agreements with the United Kingdom?”

Three classes of users are targeted: data managers, scientists, and decision makers. Data managers need to process data, and rely on services that harmonize across existing coding systems. Scientists is to collect and access field data, and rely on an effective data exchange. Decision makers have a need to read the explanation of scientific evidence, provided by retrieving relevant documentation.

The semantic approach does not require the replacement of existing technology stacks (from data schemas to engineering components and clients), instead, it adds a richer description of existing data, providing the glue for the co-existence of heterogeneous equivalent formats. It completes existing technological approaches, with a pick in the “Future Web ” trends.

The first request to the partners in the iMarine project, and the wider CoPs is to participate in to a thinking process considering iMarine information asset as part of bigger life science data framework contributed by institutions worldwide. The sooner iMarine enters this virtuous loop of data “sharing”, and it will leverage the effort of companion communities joining the Linked Open Data philosophy.

In the following we describe the streams of operation to start creating the semantic support in iMarine. We indicate the actors, to deepest details possible, or the proposed scenarios of application of such support. Each section is described with some background information, proposed measure and anticipated impact to provide with a complete overview. We also indicate which WP can host the activities described so to facilitate the intervention of key people in the project. The content of the sections below are considered as starting point to raise more focused discussions on open issues.  

The Semantic Technologies proposals: objectives and services

The support from semantic technologies in iMarine can be achieved through the following objectives:

O1. Creation of semantic network of reference data from existing semantic datasets reusing shared semantic models (e.g. Darwin Core, SKOS, FOAF, GeoOWL, etc.).

O2. Expansion of iMarine semantic network from O1 with alignments to existing semantic datasets (e.g. WoRMS, Fishbase, Aquamaps, FAO, etc.).

O3. Production of semantic representation of iMarine information asset (e.g. web pages, GIS layers, statistic time series, images, etc.) using the semantic network from O2.

O4. Implementation of the services for semantic search over the product in O3, and the services to consume iMarine semantic network from O2 (e.g. support to statistical reallocation, suggestion of annotation terms, retrieval of equivalent coding systems, classification browsing etc.).

The main players of the adoption of semantic technologies in iMarine will contribute with semantic resources and required semantic technologies to achieve objectives 1 to 4. The Fisheries and Aquaculture department of FAO is continuously developing the Fisheries Linked Open Data resource. This is a repository of semantically related reference data covering multiple disciplines related with the fisheries domain. The Institut de Recherche pour le Développement (IRD) is currently using a semantic dataset of GIS spatial object to facilitate the aggregation of contextually relevant scientific data. The Information Systems Laboratory of FORTH-ICS will contribute in ontology-based provenance modeling (related to O1), and investigate applying and extending its recent research results on exploratory search (results clustering, faceted search, instant search, etc), all related to O4. It could also contribute in automatic methods for producing semantic metadata by extracting the embedded (in various file types) metadata (related to O3). This initial asset sets the steps to the achievement of objectives 1 to 4 fairly above ground zero.

The web oriented nature of the technology, and the innovation with respect to the D4S-II set of architectural components indicates that the most suitable form to distribute semantic features is through the deployment of web based services (ref. T11.1, T11.2, T11.3). This makes possible to consume the services inside and outside the infrastructure, as well enabling good degree of independent development before making them part of the iMarine production environment.  

Fisheries Linked Open Data

Background The statistical division (FIPS) of the Fishery and Aquaculture department of FAO, promotes and practice standardization policies which traverse, from top to bottom, the stack of divisional technologies. The quest for standards aims to harmonize the divisional knowledge, and facilitate its exchange inside and outside the organization. The opening to the semantic technologies, envisaged as a conveyer for data harmonization, has motivated the creation of a divisional semantic knowledge base.

Under D4S-II and the activities for advanced data curation, the FIPS knowledge base took its preliminary shape for including equivalent coding system mapped by semantic relationships. Towards the end of the project the domains covered by the knowledge base increased to include: land and marine geography, land and marine geo-politics, fishery legislation, fishery techniques, fishery vessels and ports. The knowledge based was named Fisheries Linked Open Data for its characteristic of presenting links among RDF data (URIs), and of being openly accessible.

FLOD is used to keep relationships of equivalent codes among the ASFIS species list and coding systems such as Taxonomic Code, FIGIS, WoRMS, Aquamaps and Fishbase. In addition FLOD relates Exclusive Economic Zones (EEZ) and High Seas (HS) with the Countries who can practice fishing activities in those areas, and under what legal conditions. FLOD relates the EEZ and HS with geospatial links (e.g. intersect, csquare indexing) with FAO area classification list. FLOD relates the ASFIS species list with the distribution in the FAO area classification list. All the entities in FLOD are potential metadata to use in annotation processes. An example how the above excerpt of FLOD content is used, is provided by SPREAD (developed under D4S-II). SPREAD is an application which performs spatial reallocations of catch statistics collected at regional level (small scale) to FAO level (bigger scale). SPREAD is in need to unravel the complex knowledge background of catch time series, to correctly convey the data from regional statistical context to FAO statistical context.

The more FLOD content will refine and enrich, the better it will serve as:

  1. the source of metadata to annotate the result of VREs processes (e.g. SPREAD reallocation, curated time series, distribution map etc.) and
  2. the gluing relationships among the data available in the VREs and cross VREs. iMarine data infrastructure represent an additional data provider for the planned knowledge expansion of FLOD. The directions for expansion are: horizontal w.r.t to the number of data domains to include (i.e. cover more related data spaces), and vertical w.r.t. the included data domains (i.e. intensifying the cross domains relationships). More domains and more relationships directly impact the systems integration in line with the rationales of T3.3 T6.2 T8.4 T9.3 T10.1 T10.2 T11.1 T11.2 T11.3

Proposed measure The proposals enrich the content of FLOD consistently with the requirements from other VREs are:

  1. Engage a process of validation for the existing relationships of code equivalence between ASFIS species list and the code lists external to FAO controls. This action point will require contacting WoRMS, Fishbase and Aquamaps and mutually approving the alignments in FLOD. Contextually FLOD can feed the certified equivalence back to the interested codes providers for the sake of interoperability; currently these kind of mappings are not exposed publicly although each of the sources partial connections internally in their DBs.
  2. Increase the connections with other RDF sources that are currently registered on the LSID server of TDWG web site (e.g. Catalogue of Life, NameBank). Contextually FAO will produce LSIDs from the ASFIS and Taxonomic species lists and register them on the LSID web resolver of TDWG web site. Trough the LSID resolver other users will access the correspondences of codes.
  3. Implement data ingestion from VREs, or other sources in the project, to enrich FLOD content (mostly cross domain relationships). This will require the implementation of source format adapters to RDF. A first ingestion case can be applied to IRD observations files in Ecological Mark-up Language, and the geospatial RDF objects currently part of IRD knowledge base. FAO will implement the adapters with the software design supervision and support of NKUA developer team.
  4. Implement a workflow for the maintenance of the semantic data sets created in FLOD (equivalent codes and cross domain relationships). Currently FLOD has designed and partially implemented a maintenance mechanism based on datasets dependencies, change monitoring and cascades change process. With the supervision and support of NKUA and FORTH, FAO will integrate its mechanism in to a context where the maintenance functionalities are reusable by many (e.g. IRD).

This proposals will be tackled in accordance with the tasks in WP3, WP6, WP8, WP9, WP10 and WP11 listed above.

Anticipated impact

With respect with the four items listed in the previous section the expected impacts are:

  1. A case for the application of the policies on data governance. A certified dataset of equivalent code lists from multiple providers will be a resource of reference both inside and outside the project. The fisheries department has collected the requirements for mapping their ASFIS code list to other existing coding system, and iMarine is the perfect context to provide certification to this product. The certified sources such as WoRMS, Fishbase and Aquamaps Catalogue of Life, Name Bank, contain other information (e.g. species phylogeny, species nomenclature and the records of its change) that is used by VREs and for which local copies can be dropped in favour of decentralized maintenance.
  2. Being part of the LSID registered list on TDWG web resolver means to enlarge the audience potentially connecting to iMarine data trough a certified resource according the implementation of iMarine data policies. According to which other information are disseminated attached to the LSIDs will depend their re-use from other consumers.
  3. By carefully choosing the formats of standard use by the scientific community (e.g. EML), the use of data format adapters (to RDF) could represent an attraction for providers (e.g. IRD) willing to see their content connected in iMarine. The same adapters will ensure that update mechanisms (e.g. updated or new code lists) only needs to reference the updated data source, and then run the ingestion process to feed the FLOD.
  4. The implementation of RDF data maintenance impact on the synchronization of data source and target.

Digital Resource Semantic Annotation

Background

A semantic annotation implies a tagging procedure with terms (metadata) convening a meaning defined in a separate model (ontology). The annotation of digital resources generates a representation by mean of the semantic terms. Such representation will support the interpretation of the resource when its content is not easily comprehensible by machines (e.g. images, table values, gis maps, etc.) that run retrieval procedures.

When the semantic terms are shared in a large community, which reuses them to annotate different resource types, a clustering affect takes place (e.g. all reports and images about Yellow Fin Tuna catches in the North Sea). When also the semantics of the terms is shared (reuse of ontologies) the systems for retrieval will be able to pull together reports and images by extension of meaning and most importantly agree on the result interpretation.

The semantic web is invaded with metadata and data for life science. Many initiatives, some at global level (e.g. TDWG), and European projects (agInfra) are running activities to establish the best practices to perform content description with semantic metadata, and to leverage the maximum from it, in term of interoperability, system integration, and data harmonization.

For the VREs in the iMarine infrastructure the practicing of semantic annotation represent a unique opportunity to pursue two goals at once: decide for a comprehensive set of identifiers shared by the VREs operators, and enable the same operators to be interoperable with respect to practitioners outside the infrastructure. The first goal is in line with the rationale of T6.3, T8.4, T9.1, T10.1, T10.4 and T11.3, and promotes the synergies with agInfra project.

Proposed measure

The proposal is to run a three step process:

  1. Identify the type or resources to be annotated and some collections,
  2. Run semantic annotation using available technologies: COTS, iMarine provided, agInfra provided.
  3. Include semantic metadata in the main stream processes of infrastructure (e.g. VREVirtual Research Environment. data exchange, search) and evaluate the results.

The 3 steps can be repeated in more rounds based on criteria of best return over investment each time.

In the first round IRD (not limited) will indicate file types and collections of digital resources they would like to be annotated and linked to their current semantic knowledge base feeding Ecoscope Portal. FAO indicate existing technologies ready to use to run resources annotation (in collaboration with FORT and agInfra). CNR/NKUA/FORTH will indicate the integration mechanisms to ingest the metadata and feed the VREs in the infrastructure (e.g. processing species observations provided by IRD).

This proposal will be tackled in accordance with the tasks in WP6, WP8, WP9, WP10 and WP11 listed above.

Anticipated impact

Practicing semantic annotation will primarily affect the connection of the digital resources crated cross VREs; by extension the same VREs will be networked by sharing for instance the subject of treated resources. As secondary effect annotation will also expose the VREs result in an indirect dissemination activity by mean of shared semantic metadata.

SameAs Entity Server

Background

The ICES VREVirtual Research Environment. requires the curation of record values to adhere to a reference standard homogeneously (e.g. ISO3 code for all the countries). This requirement originated the idea to consume FLOD as the source for equivalent code lists openly available on the web.

FLOD is a semantic resource containing information much beyond than just code equivalence. Also FLOD is easy to consume for semantic client, less straight to be integrated in non-semantic architectures. From here the need to implement a service focusing only on the equivalent entities in FLOD and of easy integration with non-semantic components like ICES VREs for instance or other environment for data managers like IRD may have.

Other website, like WoRMS for instance, will provide services to match taxonomic entities, with the limitations of: working only with scientific names (prone to misspelling error), accepting limited number of request at once for online service (i.e. 2500), and providing matches only with WoRMS AphiaID.

SameAs service will: accept codes from any mapped source code list, have no limitation on the number of online request, provide equivalent code from any mapped list (e.g in:worms, out:asfis). SameAs service will not be limited to taxonomic species but will cover other reference object like, countries, water areas, etc.

For its concept the sameAs service is thought to foster interoperability among systems adopting different code list to reference their data, and this is in line with the rationales of T3.3 T9.3 T11.2 T11.3.


Proposed measure

Initially ICES developers will design the application interface to access the sameAs service content that is functional to the VREVirtual Research Environment. data harmonization process. FAO will implement the application interface, with the support and supervision of NKUA to ensure the respect the constraints for software integration. In a second phase the sameAs service could be integrated as a proxy (ENS) among many VREs (e.g. in a cross-VREVirtual Research Environment. search engine). Then NKUA will move the integration level of the service to a deeper layer in the infrastructure.

This proposals will be tackled in accordance with the tasks in WP3. WP9, and WP11 listed above.

Anticipated impact

The VREs in need to switch code list can rely on a updated source of equivalent codes without the burden to locally update a cloned copy of all the relevant providers (e.g. Aquamaps, WoRMS, Fishbase Dbs). The applications in need to know the target coding system to perform a query to the target data server, can use the sameAs service as a proxy, very much like a DNS (some call the sameAs service the Entity Name Server). With an ENS the final users can experience the sameAs service transparently without specifying in their search space.


Semantic Search

Background iMarine will bring under the same umbrella data from multiple disciplines with some degree or relationship among their subjects (e.g. VTI species distribution and catch time series in ICES/SPRED). With the right level of data conceptual modelling the operators of iMarine will experience data access in a way that is closer to their background discipline. The semantic search complements the full text search leveraging the presence of metadata and life science ontology models. The technology developed by FORTH institute allows expanding the information needs expressed by the users, by a process semantic interpretation of users’ input. A semantic search provides result items with meaningful explanation why they are relevant to the search. Items in the result set can be dynamically ranked according criteria based on information facets, and that are indicated by the users. Enabling the semantic search requires a process of data analysis, metadata design, and data representation in line with the activities to develop in T10.1 T11.2 T11.3.

Proposed measure

FORTH performs a semantic analysis of data and content (e.g. document reports, dataset, GIS maps, and images) available in iMarine, with the support of FAO and CNR. In parallel there will be the analysis of semantic models already adopted by the CoPs that will produce the list of selected ontologies to reuse for representing data and content. According the use cases defined for the semantic search, a selection of data and content will undergo annotation or triplificaition to have them compatible with the input to the semantic engine.

NKUA, with the supervision of FORTH, FAO and CNR will integrate the semantic search engine as an infrastructure component to be used inside and cross VREs.

This proposals will be tackled in accordance with the tasks in WP10 and WP11 listed above.

Anticipated impact

Expand the capabilities to perform multilingual search across the data infrastructure and to retrieve results beyond relevance calculated by classical information retrieval systems. A result set will include items because semantically relevant, or connected in a network of semantic relationships with the hits positive the user search.


Digital Resource Aggregator

Background

The scenarios for information mash-up are better known to the big audience with the success of RSS feeds. Given a topic of choice some web applications will exploit RSS metadata to suggest relevant new items to the readers. With RDF and linked data sets we can push further this concept and enrich some content with relevant other snippet of information.

This same document could be enriched with pictures, links, term annotations or other related documents if a monitoring system was analysing the input text similarly to Zemanta technology. For instance the FCPPS VREVirtual Research Environment. could produce reports with enriched related content suggested to the report manager. Aquamaps viewer could include FIGIS factsheets about the species distribution, or taxonomic information gathered by WoRMS, or images from Ecoscope, or time series from ICES VREVirtual Research Environment.; and so is true mutually for these sources. Enabling a digital resource aggregator fosters interoperability and dissemination of content cross VREVirtual Research Environment. in line with the rationales underlying T10.1 T11.2 T11.3.

Proposed measure After producing the digital resource annotations, and integrated the semantic search engine in the infrastructure, a pilot VREVirtual Research Environment. like FCPPS, or the Ecoscope portal from IRD, could be enabled with a client to the service provider of references to digital resources relevant to the information context.

NKUA with the supervision of responsible VREVirtual Research Environment., together with FAO and FORTH will work to integrate and test the features to enrich user information context.

In addition the agInfra project developed the WebAgris technology capable to return relevant scientific publications, annotated with AGROVOC terms, from a user search. This service could be directly tested in FCPPS, or Ecoscope, or other selected information environments.

This proposals will be tackled in accordance with the tasks in WP10 and WP11 listed above.

Anticipated impact

Having aggregated content will contribute to create more context around the data provided in a VREVirtual Research Environment.. In similar ways pushing iMarine data outside the infrastructure boundaries will acknowledge data provenance. If FCPPS is chosen as the VREVirtual Research Environment. to pilot the use of aggregated content, then a report could take advantage of being populated by section with maps, statistical data set (excerpt), biological data, images, and beyond.