Difference between revisions of "Procedure Infrastructure Monitoring"

From D4Science Wiki
Jump to: navigation, search
(gCube Resources)
(gCube Resources)
Line 13: Line 13:
 
* Messaging System: Based on the information published by probes local to each node. This information is used to send emails to [[Role Site Manager|Site Managers]] when problems are found.
 
* Messaging System: Based on the information published by probes local to each node. This information is used to send emails to [[Role Site Manager|Site Managers]] when problems are found.
 
* Nagios: Based on the information gathered by [http://www.nagios.org/ Nagios] about the availability of each gHN. In case of problems Nagios notifies by mail the [[Role Infrastructure Manager|Infrastructure Managers]].  The iMarine Data e-Infrastructure Nagios server is available at [https://nagios.d4science.org:8443/nagios/ Nagios Server]
 
* Nagios: Based on the information gathered by [http://www.nagios.org/ Nagios] about the availability of each gHN. In case of problems Nagios notifies by mail the [[Role Infrastructure Manager|Infrastructure Managers]].  The iMarine Data e-Infrastructure Nagios server is available at [https://nagios.d4science.org:8443/nagios/ Nagios Server]
* Ganglia: Based on hte information gathered by [http://ganglia.sourceforge.net/ Ganglia] server contacting a series of agents deployed on the infrastructure nodes. The iMarine Data e-Infrastructure Ganglia server is available at [https://imarine1.cern.ch/ganglia/ Ganglia Server]
+
* Ganglia: Based on the information gathered by [http://ganglia.sourceforge.net/ Ganglia] server contacting a series of agents deployed on the infrastructure nodes. The iMarine Data e-Infrastructure Ganglia server is available at [https://imarine1.cern.ch/ganglia/ Ganglia Server]
  
 
== UMD Resources ==
 
== UMD Resources ==

Revision as of 15:37, 17 July 2013

The monitoring of the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Ecosystem is carried out by Infrastructure Managers, Site Managers, VRE Managers, VO Admins, and Data Managers. Such activity is done on a regular basis using the different tools provided to monitor the status of gCube , gLite, Hadoop and Runtime Resources (check below).

In case a new problem is identified an incident should be reported immediately following the Incident Management procedure.


gCube Resources

The monitoring of the gCube Resources of the infrastructure is based on several systems:

  • IS Monitoring: Based on information published in the gCube Information System. This information is accessible from:
  • Messaging System: Based on the information published by probes local to each node. This information is used to send emails to Site Managers when problems are found.
  • Nagios: Based on the information gathered by Nagios about the availability of each gHN. In case of problems Nagios notifies by mail the Infrastructure Managers. The iMarine Data e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. Nagios server is available at Nagios Server
  • Ganglia: Based on the information gathered by Ganglia server contacting a series of agents deployed on the infrastructure nodes. The iMarine Data e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. Ganglia server is available at Ganglia Server

UMD Resources

There are several tools to monitor the EGI production infrastructure resources and consequently the UMD resources. Many of these tools share the same information source providing only different views over it. Such large number of tools cover many monitoring possibilities.

The table below provides direct links to status of the different UMD sites of the infrastructure provided by iMarine members:

Site Service Availability
CNR gocdb gstat MyEGI Service Availability
NKUA gocdb gstat MyEGI Service Availability

Hadoop Resources

The Hadoop clusters are monitored trough the Hadoop internal monitoring and tracking systems. These tools provide monitoring for MapReduce jobs and for HDFS filesystems.

MapReduce HDFS
CNR
Job Tracker
DFS Health

Runtime Resources

The monitoring of the Runtime Resources of the infrastructure is based on 2 systems:

  • IS Monitoring: Based on information published in the gCube Information System. This information is accessible from:
  • Nagios: Based on the information gathered by Nagios about the availability of each Runtime Resource. For some of the Runtime Resources additional checks are going to be instrumunted ( e.g. Mysql or PSQL DB DB sizes, or Index Usages) in Nagios.In case of problems Nagios notifies by mail the Infrastructure Managers. The iMarine Data e-InfrastructureAn operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and the people and organizational structures needed to support research efforts and collaboration in the large. Nagios server is available at Nagios Server