Difference between revisions of "X-Link"
(→Design (tentative)) |
(→Design (tentative)) |
||
Line 111: | Line 111: | ||
− | In the back-end, for each '''category''' of entities '''X-Search-Link''' | + | In the back-end, for each '''category''' of entities '''X-Search-Link''' is aware of: |
* a knowledge base (specifically a '''sparql endpoint''') from which we can find data related to the corresponding category, | * a knowledge base (specifically a '''sparql endpoint''') from which we can find data related to the corresponding category, |
Revision as of 10:57, 29 March 2013
Contents
General Description
Persons responsible for editing/maintaining this page
- Pavlos Fafalios (fafalios@ics.forth.gr)
- Yannis Marketakis (marketak@ics.forth.gr)
- Julien Barde (julien.barde@ird.fr)
Description
The requirements concern the development of an application (library? RESTful service? ..?) that, based on a knowledge base, will be able to match named entities that lie in a file to URIs. The objective is to rely upon previous demonstrations of entity mining (highlighting terms in web pages) to fit some needs of the community of users that will create new linked open data available for different clients (e.g. search engines). Among data to be turn into linked data: bibliographic references, metadata (OGC from geospatial cluster, RDF results from opensearch complying with GENESI-DEC RDF schema), named entities in documents (Word, PDF files), etc.
In brief, the aforementioned application should:
a) read the content of a file (doc/pdf/XML/RDF) or web page as input,
b) discover named entities of interest (e.g. keywords, Species, Persons, Organizations, etc.) in that file,
c) match each discovered entity with one (ideally) or more entities from the underlying knowledge bases (i.e. URIs of TLO, FLOD, Ecoscope, etc)
The supposed process is sketched in the following figure:
Examples:
1) From following author description:
<foaf:Person>
<foaf:givenname>C.</foaf:givenname>
<foaf:surname>Mellon-Duval</foaf:surname>
</foaf:Person>
The tagger will find the triple of the related foaf:agent in Ecoscope SPARQL enpoint (or other endpoints):
http://www.ecoscope.org/ontologies/agents/capucineMellon foaf:name C.Mellon-Duval
2) From the following:
<dc:subject>
<z:AutomaticTag>
<rdf:value>Mediterranean</rdf:value>
</z:AutomaticTag>
</dc:subject>
The tagger will find the triple of the related ecosystem in Ecoscope SPARQL enpoint (or other endpoints):
http://www.ecoscope.org/ontologies/ecosystems/mediterranean_ecosystem rdfs:label Mediterranean
Difficulties/Challenges:
a) We must limit the probability of erroneous matchings “Entity-URI”. Possible Solution: a user will approve the matchings (however this may be laborious), or only URIs without ambiguity will be kept.
b) If for an entity we have matched more than one possible URIs, which one to select? Possible Solution: a user will select the right one (however this may be laborious).
Related iMarine WP/Tasks
It could be considered related to T10.4 - Semantic Data Analysis Facilities although it was not described in the corresponding milestone: Semantic_Data_Analysis
Related iMarine Deliverables
-
Related Milestones
-
Related Cluster
http://wiki.i-marine.eu/index.php/Semantic_cluster_achievements
Related Presentations/Tutorials
-
Current (development) status
Understand the problem and define the requirements against the challenges.
Demo Scenarios
(to describe one or more ideal scenarios)
The following figure (by Julien) depicts a possible application that exploits the functionality of X-Search-Link:
Design (tentative)
In order to offer the aforementioned functionality, X-Search-Link needs to know:
- the document we want to analyze (e.g. pdf, doc, rdf, xml, web page, etc)
- the categories of entities for which we want to detect entities in the document (e.g. Countries, Species, Water Areas, etc, or all possible categories).
In the back-end, for each category of entities X-Search-Link is aware of:
- a knowledge base (specifically a sparql endpoint) from which we can find data related to the corresponding category,
- a sparql template query for retrieving entities related to a string,
- maybe a list of named entities belonging to the category, and for each entity one or more corresponding URI(s).
The desired result is a list of matchings. Specifically each detected entity in the document will be matched with one or more URI(s).
In addition, the system is able to return all the available categories.
According to the above, an initial step could be to build a software library which will implement the above functionality.
Furthermore, given the software library, a web service (e.g. with a RESTful API) can be designed and developed (the format of the result can be in any form, e.g. xml, rdf triples, csv, etc).
Given the above (software library and/or web service), one or more portlets could be developed which will exploit X-Search-Link for offering any kind of desired functionality.
An important issue is how to add a new category (type) of entities. In that case, X-Search-Link needs:
- a sparql endpoint,
- the URI of the resource class (type), e.g. http://www.ecoscope.org/ontologies/ecosystems_def#shark
When a new category of entities is added, the category becomes available and X-Search-Link can start detecting entities of that category.
Plans and Next Steps
A tentative plan is to:
a) understand the problem and define the requirements against the challenges (by end of Feb 2013),
b) decide what is required to be designed/implemented, and
c) have a first implementation.
Related Tickets
Requirements
Enriching RDF files with the URIs of Named Entities (#1187)
Design
-
Implementation
-