Difference between revisions of "Procedure Infrastructure Certification"

From D4Science Wiki
Jump to: navigation, search
(gLite Nodes)
 
Line 30: Line 30:
  
 
The information about the certification status is accessible through the infrastructure [http://monitor.d4science.research-infrastructures.eu Monitoring] tool. Certification incidents are managed using [https://support.d4science.research-infrastructures.eu/ Support TRAC] tickets. These tickets must be created according to the [[Procedure Infrastructure Incident Management|Incident Management]] procedure. If the ticket is not closed within 5 working days the affected node can be removed from the infrastructure. The monitoring of the gHN certification is carried out by the [[Role Infrastructure Manager|Infrastructure Manager]].
 
The information about the certification status is accessible through the infrastructure [http://monitor.d4science.research-infrastructures.eu Monitoring] tool. Certification incidents are managed using [https://support.d4science.research-infrastructures.eu/ Support TRAC] tickets. These tickets must be created according to the [[Procedure Infrastructure Incident Management|Incident Management]] procedure. If the ticket is not closed within 5 working days the affected node can be removed from the infrastructure. The monitoring of the gHN certification is carried out by the [[Role Infrastructure Manager|Infrastructure Manager]].
 
== UMD Nodes ==
 
 
The procedure for a given site to be certified as part of the EGI production infrastructure depends on the requirements of each EGI federation. Each EGI federation is represented by one National Grid Initiatives [http://www.egi.eu/user-support/ngi_support/ NGI]. The certification process includes the following steps:
 
# Site: requests X.509 user certificates from its national Certification Authority for all site managers;
 
# Site: contacts its NGI to get information about what site-specific information and which statement of acceptance of policy the site has to provide;
 
# NGI: validates the submitted information and adds the site in the EGI [http://goc.egi.eu/ GOCDB] database, setting its certification status to "candidate";
 
# Site: add missing information in the GOCDB (adding security contacts, more site administrators, etc.);
 
# NGI: validates the submitted information and changes the site certification status to "uncertified";
 
# Site: requests the membership for [https://wiki.egi.eu/wiki/Dteam_vo dteam] and [https://wiki.egi.eu/wiki/OPS_vo ops] Virtual Organisations and subscribes to relevant mailing lists;
 
# Site: installs gLite (with guidance and support of its NGI support contacts);
 
# NGI: starts the execution of certification tests via [https://grid-monitoring.egi.eu/myegi SAM];
 
# NGI: sets the certification status of the site to "certified" and the production status to "production".
 
  
 
== Hadoop Nodes ==
 
== Hadoop Nodes ==

Latest revision as of 19:13, 27 March 2018

Different certifications procedures apply for gCube, gLite, and Hadoop nodes. In the case of the certification process we refer to nodes, cause the process is applied to the nodes hosting the resources

gCube Nodes

gCube Nodes are locally managed by a gCube service named gHN Manager. This service is part of the gHN distribution and is automatically made available when deploying the gHN distribution. The gHN Manager is the active part of the gHN being responsible of the quality of service delivered by the node. The gHN Manager includes a gHN monitoring component that periodically performs a local certification of the node. This local certification incorporates a number of tests to verify the correct functioning of the gHN. The following gHNs characteristics are evaluated:

  1. correctness of gHN profile
  2. correctness of gHN configuration
  3. existence of host certificates
  4. correctness of the connectivity with the Information System
  5. correctness of the deployment, initialization, activation, and upgrade of the services' instances hosted on the gHN.

A gHN, and consequently a gCube node, can be considered as:

  • Started: when the initialisation phase of the gHN is started
  • Ready: when conditions 1. to 3. are properly verified
  • Failed: when at least one condition among 1. to 3. is not properly verified
  • Certified: when condition 5. is properly verified meaning that all services are ready
  • Down: when the gHN is under upgrade, reboot or shutdown
  • Unreachable: when the connection with the Information System is temporarily or permanently broken, condition 4.

Any time a gHN is upgraded its certification is suspended by putting the gHN in "Down" status since it is impossible to predict the status after the upgrade. When the upgrade is completed the certification status is transmuted to "Ready", "Failed" or "Certified". The status of a node can return to "Ready" even for a failure of a local instance of a service that it is not related to an upgrade operation. For example, in a secure infrastructure, it can happen that the proxy certificate associated to the service expires and it is not possible to renew it automatically.

The possible transitions among the gHN status are depicted in the following picture. In addition, from any status and any time, it is possible to move to the Unreachable status and vice versa. This is because this status is usually associated to network (hopefully temporary) failures.

gHN status transition


Besides the normal monitoring activities, the certification information is also used by the gCube VREManager which deploys gCube services only on gHNs marked as "Certified" or "Ready". Moreover, the same information is used to measure the reliability of a node. Thus a node with a high number of "Down" status indicates a node hosting unreliable services that require frequent software upgrades and it is not the appropriate one to deploy services requiring dependable node. Lastly, the information about the status of a node can lead to reallocate instances in accordance with the node history.

The information about the certification status is accessible through the infrastructure Monitoring tool. Certification incidents are managed using Support TRAC tickets. These tickets must be created according to the Incident Management procedure. If the ticket is not closed within 5 working days the affected node can be removed from the infrastructure. The monitoring of the gHN certification is carried out by the Infrastructure Manager.

Hadoop Nodes

No certification procedure is applied to Hadoop nodes.


Runtime Resources Nodes

No certification procedure is applied to Runtime Resources nodes.