09.01.2014 Data Ingestion and Publication
From D4Science Wiki
Revision as of 17:16, 9 January 2014 by John.gerbesiotis (Talk | contribs) (Created page with "==Data Publication== *Description of our first approach: **Data returned by search system are served with OAI-PMH protocol at ASL level *Search is used in order to retrieve all ...")
Data Publication
- Description of our first approach:
- Data returned by search system are served with OAI-PMH protocol at ASL level
- Search is used in order to retrieve all available indexed sources, do not depend on single source and is a different way of providing data instead of presenting them. Browsing capability of search is being exploited.
- Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH with dc schema support (at least dc subset)
- Steps to be fulfilled
- identify different sets. Much work to be done here.
- query construction. It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
- mappings of schemas to dc. File uploading of mappings.
- Implementation phase will take place at least until March. Sooner if possible cause no other OAI-PMH data publication is currently available as before.
Data Ingestion
- Description:
- Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
- Intermediate data are not stored anywhere. Process will be triggered during indexing as a program with unique uri and manager will take care of the whole process, from data retrieval to feeding.
- Example:
- rdbms data represented as xml, further transformed into rowsets, and index feed
- no incremental update. index swap at the moment. investigation in future
- Credentials for all sources stored on IS.
- Plugins will act as data provider, external to sources (within or out infrastructure), but must be able to access them
- Begin with rdbms data and expand to GIS data, timeseries etc. GeoNetwork can be accesed directly, exploit (convert) timeseries
Other topics
to be discussed in next TCOM:
- attach scope notation to the granularity of collection
- possible extentions of SmartGears