09.01.2014 Data Ingestion and Publication

From D4Science Wiki
Revision as of 18:42, 9 January 2014 by George.kakaletris (Talk | contribs)

Jump to: navigation, search

Data Publication

Description of first implementation:

  • OAI-PMH publishing is based on data returned by collection browsing (testing approach), via an ASL component.

Planned approach implementation:

  • OAI-PMH "resources" (i.e. datasets) are mapped to search queries that are served by the OAI-PMH protocol provider at ASL level.
  • Search is used in order to be able to expose all available indexed sources in a mixed manner, do being restricted on single source and is a different way of providing data instead of presenting them. (as mentioned, the browsing capability of search is currently being used).
    • The brows approach A query construction It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
  • An end-point providing information of all data sets exposed by a scope is provided as an ASL HTTP component.
  • Metadata are mapped into DC format appropriate for OAI-PMH publishing:
    • Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH compliance.
    • Transformations of hosted schemas into DC-lite are considered to be provided as XSLTs at the time of the configuration of the service.

Notes:

  • The incremental OAI-PMH publishing cannot be supported.

Suggestions (by Leonardo C.):

  • Provide a UI (portlet) for configuring the service (OAI-PMH resource name, search query definition, transformation to DC-Lite).
  • Use an approach similar to simplified field mapping index resources, for creating the DC-Lite required schema of OAI-PMH Protocol. Additionally use the "common presentation fields" for deriving the DC-Lite record, so that it is homogeneous across all schemas.
  • We should follow the general approach of having as many standards as possible supported so that we can maximize adoption, yet OAI-ORE is not a priority.

Decisions / Concerns / Highlights:

  • A configuration UI must be provided along side the service components.
  • The "search query" approach will be followed as it allows more complex views of the harvested / hosted metadata to be served by the system.
  • The only blocking issue being that perhaps the "presentation" fields are too few to give any valuable DC-Lite record.
  • OAI-ORE is not a major priority, so it will be considered after OAI-PMH is completed.
  • Different sets to be published must be identified.
  • Implementation phase will take place at least until March. Sooner if possible depending on other concurrent activities.

Data Ingestion

  • Description:
    • Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
  • Intermediate data are not stored anywhere. Process will be triggered during indexing as a program with unique uri and manager will take care of the whole process, from data retrieval to feeding.
  • Example:
    • rdbms data represented as xml, further transformed into rowsets, and index feed
  • no incremental update. index swap at the moment. investigation in future
  • Credentials for all sources stored on IS.
  • Plugins will act as data provider, external to sources (within or out infrastructure), but must be able to access them
  • Begin with rdbms data and expand to GIS data, timeseries etc. GeoNetwork can be accesed directly, exploit (convert) timeseries

Other topics

to be discussed in next TCOM:

  • attach scope notation to the granularity of collection
  • possible extentions of SmartGears