Difference between revisions of "ICES SGVMS"

From D4Science Wiki
Jump to: navigation, search
(Conclusion)
 
(9 intermediate revisions by one other user not shown)
Line 6: Line 6:
 
The scope of the SGVMS is to supply ICES expert groups with information and highlights. Interested groups would manage the following fields of research: bird ecology, marine mammal ecology, spatial planning, socio-economics. The products of the SGVMS analyses involve (i) spatially detailed maps of fishing effort by métier, (ii) trends in effort over time and (iii) identification of regions unimpacted by certain gears.
 
The scope of the SGVMS is to supply ICES expert groups with information and highlights. Interested groups would manage the following fields of research: bird ecology, marine mammal ecology, spatial planning, socio-economics. The products of the SGVMS analyses involve (i) spatially detailed maps of fishing effort by métier, (ii) trends in effort over time and (iii) identification of regions unimpacted by certain gears.
  
Starting from this point, the scope of this experiment is to show what the i-Marine e-Infrastructure can add to the SGVMS procedures.
+
Starting from this point, the scope of this experiment is to show what the D4Science e-Infrastructure can add to the SGVMS procedures.
  
 
In particular, we want to answer to the following questions:
 
In particular, we want to answer to the following questions:
* Which enhancements can importing SGVMS tools in the i-Marine e-Infrastructure bring to the original process?
+
* Which enhancements can importing SGVMS tools in the D4Science e-Infrastructure bring to the original process?
 
* Which is the performance of the resulting process?
 
* Which is the performance of the resulting process?
  
Line 44: Line 44:
 
# The procedures cannot manage synchronous calls by different users, producing different outputs.
 
# The procedures cannot manage synchronous calls by different users, producing different outputs.
  
I-Marine is endowed with a [https://gcube.wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager#Integrating_R_Scripts framework] to import Vessels processing scripts, written in R language. Furthermore, it accommodates the above requirements by means of input\output standardization and e-Infrastructure facilities.
+
D4Science is endowed with a [https://wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager#Integrating_R_Scripts framework] to import Vessels processing scripts, written in R language. Furthermore, it accommodates the above requirements by means of input\output standardization and e-Infrastructure facilities.
  
 
These can be summarized in the following:
 
These can be summarized in the following:
Line 55: Line 55:
 
# intersection with models developed in other domains (e.g. Aquamaps).
 
# intersection with models developed in other domains (e.g. Aquamaps).
  
=== Future Development ===
+
=== Future Development on SGVMS Tools ===
 
Future development on top of the here presented integration can involve the following points:
 
Future development on top of the here presented integration can involve the following points:
  
 
# Porting of other SGVMS tools onto the e-Infrastructure;
 
# Porting of other SGVMS tools onto the e-Infrastructure;
# Using i-Marine to extend the SGVMS tools, e.g. for producing FishFrame compliant documents;
+
# Using D4Science to extend the SGVMS tools, e.g. for producing FishFrame compliant documents;
# Using i-Marine to practically interpolating large vessels tracks;
+
# Using D4Science to practically interpolating large vessels tracks;
 
# Using Time series analysis tools on vessels tracks;
 
# Using Time series analysis tools on vessels tracks;
 
# Extracting backward and forward fishing indicators;
 
# Extracting backward and forward fishing indicators;
Line 67: Line 67:
 
# Applying clustering to vessels tracks to detect similar behaviors by vessels;
 
# Applying clustering to vessels tracks to detect similar behaviors by vessels;
 
# Using Bayesian models to automatically classifying fishing activity.
 
# Using Bayesian models to automatically classifying fishing activity.
 +
 +
===Prospects In Using D4Science for VMS Analyses===
 +
 +
'''Confidentiality''': the data imported and produced by a user are available for that user only. (S)He can decide to share it with other people by means of the e-Infrastructure sharing facilties. Furthermore, R scripts are executed in process sandboxes, which prevent them to install new software or to connect to external systems;
 +
 +
'''Flexibility''': The Statistical Manager platform is able to invoke R scripts even if they run on external computational systems. The Statistical Manager is able to harvest processes published under [http://en.wikipedia.org/wiki/Web_Processing_Service WPS] standard. For such processes, the SM automatically produces a web interface and manages provenance;
 +
 +
'''Extensibility''': users can integrate their own R scripts directly on the SM by themselves, using the [https://wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager#Integrating_R_Scripts Statistical Manager Java development framework];
 +
 +
'''Reproducibility''': the structure of our system is meant to support experiments reproducibility. The computation summary gives links to inputs, outputs and parameters. The D4Science sharing facilities allow to distribute such information;
 +
 +
'''Reportability''': [http://wiki.gcube-system.org/gcube/index.php/Document_Workflows automatic services in the D4Science e-Infrastructure] allow to produce reports on top of the produced datasets;
 +
 +
'''Enrichment''': [http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6607976&queryText%3Dcoro methods yet developed in other projects] can be applied to correct speed and course using current and wind data, or to exclude fishing if wave action was too high;
  
 
=== Experimentation ===
 
=== Experimentation ===
Line 76: Line 90:
  
 
====Performance====
 
====Performance====
As first step, we evaluated the performance of the R Script when running on one of the i-Marine machines. The machine was an Intel i7-3770 CPU @3.4 GHz, with 16 GB RAM, running Ubuntu 12.04, 64 bits.  
+
As first step, we evaluated the performance of the R Script when running on one of the D4Science machines. The machine was an Intel i7-3770 CPU @3.4 GHz, with 16 GB RAM, running Ubuntu 12.04, 64 bits.  
 
A summary of the performance is reported in a public document, available [http://goo.gl/f3Y3je  here].
 
A summary of the performance is reported in a public document, available [http://goo.gl/f3Y3je  here].
The performance in terms of memory are reported in the following figure, which run the script on 97,016 vessels trajectories points requiring 10 minutes resolution. The required memory was more than 20 GB.
+
The performance in terms of memory are reported in the following figure, which reports the memory requirement of the script, when run on 97,016 vessels trajectories points requiring 10 minutes resolution. The required memory to completely run the script was more than 20 GB, which is unpractical for a standard desktop machine.
  
 
[[Image:memorySGVMS.png|thumb|center|upright=4.5|Memory usage by VMSTools on 97,016 vessels points interpolations at 10 minutes resolution.]]
 
[[Image:memorySGVMS.png|thumb|center|upright=4.5|Memory usage by VMSTools on 97,016 vessels points interpolations at 10 minutes resolution.]]
  
On the other side, we evaluated the computation time required by the script at the variation of the number of points. The next figure shows the exponential nature of the computation time
+
On the other side, we evaluated the computation time required by the script at the variation of the number of points to interpolate. The next figure shows the exponential nature of the computation time
  
 
[[Image:computationTimeSGVMS.png|thumb|center|upright=4.5|Variation in the computation time with respect to the number of vessels points to process.]]
 
[[Image:computationTimeSGVMS.png|thumb|center|upright=4.5|Variation in the computation time with respect to the number of vessels points to process.]]
 +
 +
Note: '''the script has also the possibility to be run with lower memory requirements, which takes much higher time and was not focused by this analysis.'''
  
 
====Integration Result====
 
====Integration Result====
We integrated the script with the i-Marine e-Infrastructure facilities. This required to build an algorithm for the Statistical Manager platform. The Statistical Manager comes with a [https://gcube.wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager#Integrating_R_Scripts development framework] to rapidly integrate R Scripts.
+
We integrated the script with the D4Science e-Infrastructure facilities. This required to build an algorithm for the Statistical Manager platform. The Statistical Manager comes with a [https://wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager#Integrating_R_Scripts development framework] to rapidly integrate R Scripts.
  
 
The added value of this framework is that it automatically accounts for multiple users requests even if the input\output of the R script is static. This is achieved by performing on-the-fly code injection.
 
The added value of this framework is that it automatically accounts for multiple users requests even if the input\output of the R script is static. This is achieved by performing on-the-fly code injection.
  
The result of such activity was that the procedure was deployed on the i-Marine e-Infrastructure, which automatically endowed it with a user interface, and with facilities for sharing and provenance maintenance.
+
The result of such activity was that the procedure was deployed on the D4Science e-Infrastructure. This automatically endowed it with a user interface, and with facilities for data sharing and provenance maintenance.
 +
 
 
The next figure shows the interface to the procedure automatically generated by the e-Infrastructure, on the basis of the inputs\outputs specifications
 
The next figure shows the interface to the procedure automatically generated by the e-Infrastructure, on the basis of the inputs\outputs specifications
  
 
[[Image:InterfaceSGVMS.png|thumb|center|upright=4.5|Interface for the Vessels interpolation procedure. The interface was automatically produced by the Statistical Manager.]]
 
[[Image:InterfaceSGVMS.png|thumb|center|upright=4.5|Interface for the Vessels interpolation procedure. The interface was automatically produced by the Statistical Manager.]]
  
The Statistical Manager dataspace environment reports the used inputs and the produced outputs
+
The Statistical Manager dataspace environment reports the used inputs and the produced outputs, where the distinction between them is in the metadata, which indicate if the dataset was "computed" or "imported"
  
 
[[Image:inputsoutputsSGVMS.png|thumb|center|upright=4.5|A summary of the TACSAT2 files produced by the computations. Note that inputs are separated from outputs by the Imported\Computed indication.]]
 
[[Image:inputsoutputsSGVMS.png|thumb|center|upright=4.5|A summary of the TACSAT2 files produced by the computations. Note that inputs are separated from outputs by the Imported\Computed indication.]]
Line 105: Line 122:
  
 
[[Image:checkcomp2SGVMS.png|thumb|center|upright=4.5|Report of the output produced by a previously run process.]]
 
[[Image:checkcomp2SGVMS.png|thumb|center|upright=4.5|Report of the output produced by a previously run process.]]
 
  
 
=== Related links ===
 
=== Related links ===
[http://gcube.wiki.gcube-system.org/gcube/index.php/Statistical_Manager_Tutorial A tutorial on the Statistical Manager]
+
[http://wiki.gcube-system.org/gcube/index.php/Statistical_Manager_Tutorial A tutorial on the Statistical Manager]
  
[http://gcube.wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager Implementing algorithms with the Statistical Manager Framework]
+
[http://wiki.gcube-system.org/gcube/index.php/How-to_Implement_Algorithms_for_the_Statistical_Manager Implementing algorithms with the Statistical Manager Framework]
  
[http://gcube.wiki.gcube-system.org/gcube/index.php/How_to_Interact_with_the_Statistical_Manager_by_client Using the Statistical Manager by external thin clients]
+
[http://wiki.gcube-system.org/gcube/index.php/How_to_Interact_with_the_Statistical_Manager_by_client Using the Statistical Manager by external thin clients]
  
[https://i-marine.d4science.org/group/biodiversitylab/processing-tools The Statistical Manager interface on the i-Marine Portal (Biodiversity Lab VRE)]
+
[https://services.d4science.org/group/biodiversitylab/processing-tools The Statistical Manager interface on the D4Science Portal (Biodiversity Lab VRE)]

Latest revision as of 17:19, 3 September 2015

Hypothesis and Thesis

The premise of this activity was a review of the ICES procedure for interpolating Vessels routes. A feasibility study was produced on the basis of the 2012 report by the Study Group on VMS data (SGVMS) on vessels data analysis.

Our review is available at the following address: http://goo.gl/risQre

The scope of the SGVMS is to supply ICES expert groups with information and highlights. Interested groups would manage the following fields of research: bird ecology, marine mammal ecology, spatial planning, socio-economics. The products of the SGVMS analyses involve (i) spatially detailed maps of fishing effort by métier, (ii) trends in effort over time and (iii) identification of regions unimpacted by certain gears.

Starting from this point, the scope of this experiment is to show what the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. can add to the SGVMS procedures.

In particular, we want to answer to the following questions:

  • Which enhancements can importing SGVMS tools in the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. bring to the original process?
  • Which is the performance of the resulting process?

In this experiment we give an answer to the above questions.

Outcome

The results of this experiment highlight that there are advantages in integrating SGVMS tools in the e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large.. We demonstrate this with a practical example on Vessels points interpolation.

We summarize benefits in the following:

  1. The e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. allows for executing processes that are highly demanding in terms of hardware resources
  2. The e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. enables multi-tenancy and synchronous interrogation to a standalone procedure containing hardcoded inputs and outputs
  3. A graphical user interface is automatically generated on top of the procedure
  4. The e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. allows for executing R scripts on powerful machines
  5. The scripts can potentially be fed with datasets yet available in the e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large.
  6. The integration allows non-R programmers to use an R Script
  7. The system enables automatic provenance management: the history of the experiments, the used inputs and the produced outputs are automatically recorded and stored
  8. The system allows for inputs, outputs and parameters sharing in easy way
  9. Information is stored on hi-availability, distributed storage systems
  10. The procedure can be used by external people, if the process is allowed to be published under WPS standard. Connection is even possible by means of Java thin clients

Activity Workflow

A logbook of the activity, from the requirements to the implementation, can be found here.

Conclusion

Weakness points of the sequential solution by SGVMS were identified from explicit assertions in the SGVMS documents and from our considerations. They can be summarized in the following:

  1. The SGVMS proposes several approaches to vessels tracks interpolation. Nevertheless, they state that these methods should be compared to each other and should be tested against a high resolution dataset. This would be useful to assess which of the methods most closely reflects reality. They expect that different methods might appear most suitable depending on gear or fleet;
  2. The users of their platform should be able to (i) understand of the contents of the data being analyzed, (ii) work with a command-line interface environment, (iii) use adequate resources to ensure standardized but meaningful outputs;
  3. The SGVMS reports and encourages also other approaches to VMS analysis, e.g. Bayesian models to investigate fishing patterns and models to understand the effect of resolution of VMS analysis on benthic impact assessments. On the other side, this requires cross-domain knowledge by users;
  4. No mention to intersecting ecological models and VMS data is given;
  5. The procedures cannot manage synchronous calls by different users, producing different outputs.

D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. is endowed with a framework to import Vessels processing scripts, written in R language. Furthermore, it accommodates the above requirements by means of input\output standardization and e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. facilities.

These can be summarized in the following:

  1. separation between final users and developers of the process;
  2. multi-tenancy an multi-user facilities;
  3. resources sharing;
  4. input\output datasets reusability;
  5. applicability of models from other domains (Bayesian models);
  6. intersection with models developed in other domains (e.g. Aquamaps).

Future Development on SGVMS Tools

Future development on top of the here presented integration can involve the following points:

  1. Porting of other SGVMS tools onto the e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large.;
  2. Using D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. to extend the SGVMS tools, e.g. for producing FishFrame compliant documents;
  3. Using D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. to practically interpolating large vessels tracks;
  4. Using Time series analysis tools on vessels tracks;
  5. Extracting backward and forward fishing indicators;
  6. Distributing VMS related aggregated data products through the infrastructure;
  7. Intersecting VMS/FishFrame data with niche models (Aquamaps) to retrieve the list of species possibly involved in catches;
  8. Applying clustering to vessels tracks to detect similar behaviors by vessels;
  9. Using Bayesian models to automatically classifying fishing activity.

Prospects In Using D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. for VMS Analyses

Confidentiality: the data imported and produced by a user are available for that user only. (S)He can decide to share it with other people by means of the e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. sharing facilties. Furthermore, R scripts are executed in process sandboxes, which prevent them to install new software or to connect to external systems;

Flexibility: The Statistical Manager platform is able to invoke R scripts even if they run on external computational systems. The Statistical Manager is able to harvest processes published under WPS standard. For such processes, the SM automatically produces a web interface and manages provenance;

Extensibility: users can integrate their own R scripts directly on the SM by themselves, using the Statistical Manager Java development framework;

Reproducibility: the structure of our system is meant to support experiments reproducibility. The computation summary gives links to inputs, outputs and parameters. The D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. sharing facilities allow to distribute such information;

Reportability: automatic services in the D4Science e-Infrastructure allow to produce reports on top of the produced datasets;

Enrichment: methods yet developed in other projects can be applied to correct speed and course using current and wind data, or to exclude fishing if wave action was too high;

Experimentation

As first step, we analyzed the VMSTools suite, published by SGVMS on the google code repository. This suite involves sevaral scripts and procedures to analyze VMS data. We extracted the VMS interpolation procedures from VMSTools. These accept a set of VMS trajectories as input and produce interpolated versions, using either Hermite Cubic Splines or Straight line interpolation. Inputs have to follow the TACSAT2 format. One example of input file can be found here. A sample of interpolated output is visible here.

We produced a single R Script including the necessary VMS tools procedures. The script can be found here. It builds on top of the interpolation and reconstruction functions in the VMSTools.

Performance

As first step, we evaluated the performance of the R Script when running on one of the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. machines. The machine was an Intel i7-3770 CPU @3.4 GHz, with 16 GB RAM, running Ubuntu 12.04, 64 bits. A summary of the performance is reported in a public document, available here. The performance in terms of memory are reported in the following figure, which reports the memory requirement of the script, when run on 97,016 vessels trajectories points requiring 10 minutes resolution. The required memory to completely run the script was more than 20 GB, which is unpractical for a standard desktop machine.

Memory usage by VMSTools on 97,016 vessels points interpolations at 10 minutes resolution.

On the other side, we evaluated the computation time required by the script at the variation of the number of points to interpolate. The next figure shows the exponential nature of the computation time

Variation in the computation time with respect to the number of vessels points to process.

Note: the script has also the possibility to be run with lower memory requirements, which takes much higher time and was not focused by this analysis.

Integration Result

We integrated the script with the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. facilities. This required to build an algorithm for the Statistical Manager platform. The Statistical Manager comes with a development framework to rapidly integrate R Scripts.

The added value of this framework is that it automatically accounts for multiple users requests even if the input\output of the R script is static. This is achieved by performing on-the-fly code injection.

The result of such activity was that the procedure was deployed on the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large.. This automatically endowed it with a user interface, and with facilities for data sharing and provenance maintenance.

The next figure shows the interface to the procedure automatically generated by the e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large., on the basis of the inputs\outputs specifications

Interface for the Vessels interpolation procedure. The interface was automatically produced by the Statistical Manager.

The Statistical Manager dataspace environment reports the used inputs and the produced outputs, where the distinction between them is in the metadata, which indicate if the dataset was "computed" or "imported"

A summary of the TACSAT2 files produced by the computations. Note that inputs are separated from outputs by the Imported\Computed indication.

The "Check the Computations" environment allows for having the history of the experiments, along with the used parameters and the produced outputs (provenance)

Report of the input parameters used in a previously run process.
Report of the output produced by a previously run process.

Related links

A tutorial on the Statistical Manager

Implementing algorithms with the Statistical Manager Framework

Using the Statistical Manager by external thin clients

The Statistical Manager interface on the D4Science Portal (Biodiversity Lab VRE)