09.01.2014 Data Ingestion and Publication

From D4Science Wiki
Revision as of 18:16, 9 January 2014 by John.gerbesiotis (Talk | contribs) (Created page with "==Data Publication== *Description of our first approach: **Data returned by search system are served with OAI-PMH protocol at ASL level *Search is used in order to retrieve all ...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data Publication

  • Description of our first approach:
    • Data returned by search system are served with OAI-PMH protocol at ASL level
  • Search is used in order to retrieve all available indexed sources, do not depend on single source and is a different way of providing data instead of presenting them. Browsing capability of search is being exploited.
  • Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH with dc schema support (at least dc subset)
  • Steps to be fulfilled
  1. identify different sets. Much work to be done here.
  2. query construction. It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
  3. mappings of schemas to dc. File uploading of mappings.
  • Implementation phase will take place at least until March. Sooner if possible cause no other OAI-PMH data publication is currently available as before.

Data Ingestion

  • Description:
    • Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
  • Intermediate data are not stored anywhere. Process will be triggered during indexing as a program with unique uri and manager will take care of the whole process, from data retrieval to feeding.
  • Example:
    • rdbms data represented as xml, further transformed into rowsets, and index feed
  • no incremental update. index swap at the moment. investigation in future
  • Credentials for all sources stored on IS.
  • Plugins will act as data provider, external to sources (within or out infrastructure), but must be able to access them
  • Begin with rdbms data and expand to GIS data, timeseries etc. GeoNetwork can be accesed directly, exploit (convert) timeseries

Other topics

to be discussed in next TCOM:

  • attach scope notation to the granularity of collection
  • possible extentions of SmartGears