09.01.2014 Data Ingestion and Publication

From D4Science Wiki
Jump to: navigation, search

Agenda:

  • discussion on NKUA planned facilities for data publication and ingestion;

Google Hangout

Participants:

  • A. Antoniadis (NKUA), L. Candela (CNR), J. Gerbesiotis (NKUA), G. Kakaletris (NKUA)

Data Publication

Description of first implementation:

  • OAI-PMH publishing is based on data returned by collection browsing (testing approach), via an ASL component.

Planned approach implementation:

  • OAI-PMH "resources" (i.e. datasets) are mapped to search queries that are served by the OAI-PMH protocol provider at ASL level.
    • these "resources" will correspond to OAI-PMH sets;
  • Search is used in order to be able to expose all available indexed sources in a mixed manner, not being restricted on single collection and is a different way of providing data instead of presenting them. (as mentioned, the browsing capability of search is currently being used).
    • The query construction will also be automatically, constructed by exploiting browsing capability. Also mapping sets to queries.
  • An end-point providing information of all data sets exposed by a scope is provided as an ASL HTTP component.
  • Metadata are mapped into DC format appropriate for OAI-PMH publishing:
    • Custom schema is currently being used, intend to xslt transform each different schema to dc schema, in order to have OAI-PMH compliance.
    • Transformations of hosted schemas into DC-lite are considered to be provided as XSLTs at the time of the configuration of the service.

Notes:

  • The incremental OAI-PMH publishing cannot be supported.

Suggestions (by Leonardo C.):

  • Provide a UI (portlet) for configuring the service (OAI-PMH resource name, search query definition, transformation to DC-Lite).
  • Use an approach similar to simplified field mapping index resources, for creating the DC-Lite required schema of OAI-PMH Protocol. Additionally use the "common presentation fields" for deriving the DC-Lite record, so that it is homogeneous across all schemas.
  • We should follow the general approach of having as many standards as possible supported so that we can maximize adoption, yet OAI-ORE is not a priority.

Decisions / Concerns / Highlights:

  • A configuration UI must be provided along side the service components.
  • The "search query" approach will be followed as it allows more complex views of the harvested / hosted metadata to be served by the system.
  • The only blocking issue being that perhaps the "presentation" fields are too few to give any valuable DC-Lite record.
  • OAI-ORE is not a major priority, so it will be considered after OAI-PMH is completed.
  • Different sets to be published must be identified.
  • Implementation phase will take place at least until March. Sooner if possible depending on other concurrent activities.

Data Ingestion

Objective

  • Render Information Retrieval "systems" capable of serving various "sources".

Approach:

  • A hierarchical manifestation of "searchable" objects is assumed to be adequate for IR needs. Similar assumption underlies the TreeManager design.
  • A number of adapters retrieve data from sources, providing the (meta)data into xml representation. Each adapter instance has its own configuration for mapping the underlying data model into XML.
    • E.g.the SQL2XML adapter:
      • The connection string
      • major entities (consisting of several tables of a relational model) are mapped into XML representable objects.
  • A "transformation" exists for each adapter+configuration set: Translates the XML into a flat (currently) rowset.
    • More complex elements feeding is currently being evaluated
    • The two configurations (adapter vs transformation) are separated as they might reside on different infrastructures, while the allow for greater separation of concerns.
  • A "mediating" component is responsible for getting data from the adapter into the transformation program
    • Provides a "trigger" that can request that the source is "re-harvested". Process will be triggered during indexing as an endpoint with unique uri and manager will take care of the whole process, from data retrieval to feeding.
    • Periodic autonomously launched updates can be also provided.
    • Intermediate data are not stored anywhere.

Example: rdbms data represented as xml (adapter), further transformed into rowsets (transformation program) and fed into the index (sink of gDTS workflow).

Notes:

  • Initial approach supports refeeding of entier "segments" of the index (which are mapped to collections): search requests are served by the "old" index state, while the new state is created during an update, and the switching is done at the end of the feed.
  • When adapters act as data provider, external to sources (within or out infrastructure), but must be able to access them


Additional info (not mentioned during the meeting)

  • The direct push interface is also under consideration (e.g. directly feed "some" entries to the index, yet issues exist with deletes and modifications).
  • An additional API which polls the adapters for any significant changes is considered.

Suggestions/Observations (By Leonardo C.)

  • Exploit the definition of external databases as defined in IS currently (where passwords are encrypted), so as to avoid wide diffusion of sensitive information.
  • Homogenize the terminology across the system: currently we have "data sources" in search, indexed data (again sources of TM) and now we have the "adapter-based" sources (providers). Additionally a manager should managed all these via a homogenizing UI / tool.
  • Geonetwork data are currently exposed as OAI-PMH data, no need to go immediately to Geo-data. On the other hand timeseries are not provided by any similar approach.
  • The challenge will be put on data managers to provide their representation of their model into the hierarchical model of the approach.

Observations/Concerns/Decisions

  • Adapters might not be always deployed inside gCube boundaries and as such they might be configured independently of gCube/IR systems.
  • Adapter configuration (when hosted in D4ScienceAn e-Infrastructure operated by the D4Science.org initiative.) will point to IS resources where sensitive info exists, so that sensitive information is not duplicated, while their own configuration will also be a published resource (for any possible reuse).
  • Roadmap for adapters: Begin with rdbms data, go into timeseries, then to GIS data.
  • Present all details during the TCom. There might be needs for extending SmartGears or other work so that this is kept as generic and flexible as desired.

Other topics

to be discussed in next TCOM:

  • attach scope notation to the granularity of collection
  • possible extentions of SmartGears