R algorithm integration with Statistical Manager
Hypothesis and Thesis
This experiment is performed by FAO in order to test and assess how data managers / developers can plug easily algorithms (especially R algorithms) in the infrastructure, through the Statistical Manager tool, and respond quickly to data analysis needs while benefiting of iMarine computing resources.
The product of this experiment is a basic service that allows to convert a SDMX dataset, provided through a SDMX service URL, to the CSV format. Other similar experiments will handle more complex algorithms, and will complete the outcomes of this first basic experiment
The scope of these algorithm integration experiments is:
- developer/algorithm integrator oriented
- to assess how a data manager / developer can plug an algorithm by their own,
- to identify potential improvements for the ease, speed and sustainability of the R algorithm integration procedure
- end-user oriented
- to assess user friendliness of the Statistical Manager data analysis tool
Outcome
The results of this experiment show that the procedure of integrating R scripts as data analysis algorithms is a quick, straightforward and sustainable to be considered by institutions that wish to plug data analysis algorithms.
From a developer/algorithm integrator point of view, the benefits are the following:
- The e-infrastructure, by means of the Statistical Manager, provides a fast, straightforward, and sustainable procedure of algorithm integration, highlightly recommended for institutions
- In term of software tools & programming language, some basic knowledge is required:
- very basic knowledge in Java programming is required: The Statistical Manager is Java-based. Each algorithm has to be handled in a Java class and implement a generic algorithm interface. However, the Statistical Manager is designed in such way the implementation is very straighforward and familiaring with it is very quick
- knowledge of an IDE (e.g. Eclipse) and SVN is recommended
- Additional knowledge of Maven is recommended. However, this one is optional, and required only if data managers intend to build a separate Java project to deliver the algorithms (as done in this exercice).
- Through this procedure, the e-infrastructure offers a powerful tool to institutions, especially research institutions, to expose R scripts (often scattered among offices & laboratories) to be exposed as web-services, and make benefits of the e-infrastructure computing resources
From a user point of view, the benefits are the following:
- The algorithms are exposed in a rich and ergonomic web interface, where the user can access 3 components: (i) a dataspace, where he can upload his data, (2) an environment to execute the algorithms he wants, and (3) a place where he can check the computations.
- The tool allows non-R users to execute an R script in a user-friendly way
- The tool allows the usesr to access to the history of computations performed by him, the computation characteristics (status complete/failed, ellapsed time, etc), and offers the possibility to relaunch a computation
- In addition, each computation performed by the user comes with set of meta information
Activity Workflow
- The activity was done by familiarizing with the Statistical Manager, relying both on the documentation and a tutorial video made available to facilitate the integration of algorithms.
- A basic R script was created to test the Statistical Manager. This script allows to convert a SDMX-ML dataset to CSV.
- In order to integrate the R script, a separate Java Maven project was created (with the aim to add further algorithm later).
- Few exchange with the Statistical Managers developers was required for the project settings, an highlighted some few scatter in the documentation
- The R script was integrated in the project, tested and sent to Statistical Manager team for its deployment
- Additional exchange with the team took place, to have some clarifications on:
- algorithms inputs (difference between a File input and remote resource - URL - input)
- the need for data managers to indicate the eventual R package dependencies to install prior to the algorithm deployment
- how to add the algorithm within a given category of algorithms (for display purpose in the Statistical Manager user interface)
- The algorithm was successfully deployed and is currently operational in the development portal, and usable in the rich user interface of the Statistical Manager.
Conclusion
TBD
Recommendations & future developments
TBD
Experimentation
TBD