09.01.2014 Data Ingestion and Publication
From D4Science Wiki
Revision as of 17:42, 9 January 2014 by George.kakaletris (Talk | contribs)
Data Publication
Description of first implementation:
- OAI-PMH publishing is based on data returned by collection browsing (testing approach), via an ASL component.
Planned approach implementation:
- OAI-PMH "resources" (i.e. datasets) are mapped to search queries that are served by the OAI-PMH protocol provider at ASL level.
- Search is used in order to be able to expose all available indexed sources in a mixed manner, do being restricted on single source and is a different way of providing data instead of presenting them. (as mentioned, the browsing capability of search is currently being used).
- The brows approach A query construction It will also be automatically constructed by exploiting browsing capability. Also mapping sets to queries.
- An end-point providing information of all data sets exposed by a scope is provided as an ASL HTTP component.
- Metadata are mapped into DC format appropriate for OAI-PMH publishing:
- Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH compliance.
- Transformations of hosted schemas into DC-lite are considered to be provided as XSLTs at the time of the configuration of the service.
Notes:
- The incremental OAI-PMH publishing cannot be supported.
Suggestions (by Leonardo C.):
- Provide a UI (portlet) for configuring the service (OAI-PMH resource name, search query definition, transformation to DC-Lite).
- Use an approach similar to simplified field mapping index resources, for creating the DC-Lite required schema of OAI-PMH Protocol. Additionally use the "common presentation fields" for deriving the DC-Lite record, so that it is homogeneous across all schemas.
- We should follow the general approach of having as many standards as possible supported so that we can maximize adoption, yet OAI-ORE is not a priority.
Decisions / Concerns / Highlights:
- A configuration UI must be provided along side the service components.
- The "search query" approach will be followed as it allows more complex views of the harvested / hosted metadata to be served by the system.
- The only blocking issue being that perhaps the "presentation" fields are too few to give any valuable DC-Lite record.
- OAI-ORE is not a major priority, so it will be considered after OAI-PMH is completed.
- Different sets to be published must be identified.
- Implementation phase will take place at least until March. Sooner if possible depending on other concurrent activities.
Data Ingestion
- Description:
- Intention to be capable of indexing various data sources. A number of plugins will be implemented that retrieve data from sources, provide data with xml representation. Forward data to gDTS for transformation and feed to index.
- Intermediate data are not stored anywhere. Process will be triggered during indexing as a program with unique uri and manager will take care of the whole process, from data retrieval to feeding.
- Example:
- rdbms data represented as xml, further transformed into rowsets, and index feed
- no incremental update. index swap at the moment. investigation in future
- Credentials for all sources stored on IS.
- Plugins will act as data provider, external to sources (within or out infrastructure), but must be able to access them
- Begin with rdbms data and expand to GIS data, timeseries etc. GeoNetwork can be accesed directly, exploit (convert) timeseries
Other topics
to be discussed in next TCOM:
- attach scope notation to the granularity of collection
- possible extentions of SmartGears