Difference between revisions of "09.01.2014 Data Ingestion and Publication"

From D4Science Wiki
Jump to: navigation, search
(Created page with "==Data Publication== *Description of our first approach: **Data returned by search system are served with OAI-PMH protocol at ASL level *Search is used in order to retrieve all ...")
 
Line 1: Line 1:
 
==Data Publication==
 
==Data Publication==
*Description of our first approach:
 
**Data returned by search system are served with OAI-PMH protocol at ASL level
 
  
*Search is used in order to retrieve all available indexed sources, do not depend on single source and is a different way of providing data instead of presenting them. Browsing capability of search is being exploited.
+
Description of first implementation:
 +
* OAI-PMH publishing is based on data returned by collection browsing (testing approach), via an ASL component.
  
*Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH with dc schema support (at least dc subset)
+
Planned approach implementation:
 +
* OAI-PMH "resources" (i.e. datasets) are mapped to search queries that are served by the OAI-PMH protocol provider at ASL level.
 +
* Search is used in order to be able to expose all available indexed sources in a mixed manner, do being restricted on single source and is a different way of providing data instead of presenting them. (as mentioned, the browsing capability of search is currently being used).
 +
** The brows approach A query construction It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
 +
* An end-point providing information of all data sets exposed by a scope is provided as an ASL HTTP component.
 +
* Metadata are mapped into DC format appropriate for OAI-PMH publishing:
 +
** Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH compliance.
 +
** Transformations of hosted schemas into DC-lite are considered to be provided as XSLTs at the time of the configuration of the service.
  
*Steps to be fulfilled
+
Notes:
# identify different sets. Much work to be done here.
+
* The incremental OAI-PMH publishing cannot be supported.
# query construction. It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
+
# mappings of schemas to dc. File uploading of mappings.
+
  
*Implementation phase will take place at least until March. Sooner if possible cause no other OAI-PMH data publication is currently available as before.
+
Suggestions (by Leonardo C.):
 +
* Provide a UI (portlet) for configuring the service (OAI-PMH resource name, search query definition, transformation to DC-Lite).
 +
* Use an approach similar to simplified field mapping index resources, for creating the DC-Lite required schema of OAI-PMH Protocol. Additionally use the "common presentation fields" for deriving the DC-Lite record, so that it is homogeneous across all schemas.
 +
* We should follow the general approach of having as many standards as possible supported so that we can maximize adoption, yet OAI-ORE is not a priority.
 +
 
 +
Decisions / Concerns / Highlights:
 +
* A configuration UI must be provided along side the service components.
 +
* The "search query" approach will be followed as it allows more complex views of the harvested / hosted metadata to be served by the system.
 +
* The only blocking issue being that perhaps the "presentation" fields are too few to give any valuable DC-Lite record.
 +
* OAI-ORE is not a major priority, so it will be considered after OAI-PMH is completed.
 +
* Different sets to be published must be identified.
 +
* Implementation phase will take place at least until March. Sooner if possible depending on other concurrent activities.
  
 
==Data Ingestion==
 
==Data Ingestion==
 +
 
*Description:
 
*Description:
 
**Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
 
**Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.

Revision as of 18:42, 9 January 2014

Data Publication

Description of first implementation:

  • OAI-PMH publishing is based on data returned by collection browsing (testing approach), via an ASL component.

Planned approach implementation:

  • OAI-PMH "resources" (i.e. datasets) are mapped to search queries that are served by the OAI-PMH protocol provider at ASL level.
  • Search is used in order to be able to expose all available indexed sources in a mixed manner, do being restricted on single source and is a different way of providing data instead of presenting them. (as mentioned, the browsing capability of search is currently being used).
    • The brows approach A query construction It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
  • An end-point providing information of all data sets exposed by a scope is provided as an ASL HTTP component.
  • Metadata are mapped into DC format appropriate for OAI-PMH publishing:
    • Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH compliance.
    • Transformations of hosted schemas into DC-lite are considered to be provided as XSLTs at the time of the configuration of the service.

Notes:

  • The incremental OAI-PMH publishing cannot be supported.

Suggestions (by Leonardo C.):

  • Provide a UI (portlet) for configuring the service (OAI-PMH resource name, search query definition, transformation to DC-Lite).
  • Use an approach similar to simplified field mapping index resources, for creating the DC-Lite required schema of OAI-PMH Protocol. Additionally use the "common presentation fields" for deriving the DC-Lite record, so that it is homogeneous across all schemas.
  • We should follow the general approach of having as many standards as possible supported so that we can maximize adoption, yet OAI-ORE is not a priority.

Decisions / Concerns / Highlights:

  • A configuration UI must be provided along side the service components.
  • The "search query" approach will be followed as it allows more complex views of the harvested / hosted metadata to be served by the system.
  • The only blocking issue being that perhaps the "presentation" fields are too few to give any valuable DC-Lite record.
  • OAI-ORE is not a major priority, so it will be considered after OAI-PMH is completed.
  • Different sets to be published must be identified.
  • Implementation phase will take place at least until March. Sooner if possible depending on other concurrent activities.

Data Ingestion

  • Description:
    • Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
  • Intermediate data are not stored anywhere. Process will be triggered during indexing as a program with unique uri and manager will take care of the whole process, from data retrieval to feeding.
  • Example:
    • rdbms data represented as xml, further transformed into rowsets, and index feed
  • no incremental update. index swap at the moment. investigation in future
  • Credentials for all sources stored on IS.
  • Plugins will act as data provider, external to sources (within or out infrastructure), but must be able to access them
  • Begin with rdbms data and expand to GIS data, timeseries etc. GeoNetwork can be accesed directly, exploit (convert) timeseries

Other topics

to be discussed in next TCOM:

  • attach scope notation to the granularity of collection
  • possible extentions of SmartGears