Difference between revisions of "Procedure Infrastructure Incident Management"

From D4Science Wiki
Jump to: navigation, search
(Created page with "__TOC__ The management of incidents affecting the production infrastructure is based on TRAC issue tracking system. Open incidents can be browsed using TRAC [https://support.d4s...")
 
Line 28: Line 28:
 
'''2) Prioritization and Assignment'''
 
'''2) Prioritization and Assignment'''
  
The [[Role Infrastructure Manager|Infrastructure Managers]] assign the correct priority to the incident:
+
The [[WP5 Role Infrastructure Manager|Infrastructure Managers]] assign the correct priority to the incident:
  
 
* High: tickets must be solved maximum in 2 working days or in a Maintenance Release
 
* High: tickets must be solved maximum in 2 working days or in a Maintenance Release
Line 34: Line 34:
 
* Low: tickets must be solved in two weeks time or possibly in the the next Minor/Major Release
 
* Low: tickets must be solved in two weeks time or possibly in the the next Minor/Major Release
  
Tickets must also be assigned to one member of the [[Role Support Team|Support Team]]. This is done either by [[Role Infrastructure Manager|Infrastructure Managers]] or by members of the [[Role Support Team|Support Team]]. The assigned [[Role Support Team|Support Team]] member is automatically notified by email. All email notifications are also send to the [mailto:support_team@d4science-ii.research-infrastructures.eu support team] mailing list.
+
Tickets must also be assigned to one member of the [[WP5 Role Support Team|Support Team]]. This is done either by [[WP5 Role Infrastructure Manager|Infrastructure Managers]] or by members of the [[WP5 Role Support Team|Support Team]]. The assigned [[WP5 Role Support Team|Support Team]] member is automatically notified by email. All email notifications are also send to the [mailto:support_team@d4science-ii.research-infrastructures.eu support team] mailing list.
  
  
 
'''3) 1st Line Support'''
 
'''3) 1st Line Support'''
  
The assigned member of the [[Role Support Team|Support Team]] analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the [[Role Support Team|Support Team]] might occur as a result of ticket analysis and resolution.
+
The assigned member of the [[WP5 Role Support Team|Support Team]] analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the [[WP5 Role Support Team|Support Team]] might occur as a result of ticket analysis and resolution.
  
If the [[Role Support Team|Support Team]] is not able to close the incident, the ticket is escalated to the TCom for further analysis (functional escalation).
+
If the [[WP5 Role Support Team|Support Team]] is not able to close the incident, the ticket is escalated to the TCom for further analysis (functional escalation).
  
  
Line 58: Line 58:
 
'''6) Resolution and Recovery'''
 
'''6) Resolution and Recovery'''
  
The [[Role Support Team|Support Team]] or the TCom solves the incident and recovers the affected service to its normal status.
+
The [[WP5 Role Support Team|Support Team]] or the TCom solves the incident and recovers the affected service to its normal status.
  
To indicate that a software problem has been fixed and is now waiting for integration and testing, the [[Role Support Team|Support Team]] or TCom member should assign the milestone "Ready to Release" to the incident ticket.
+
To indicate that a software problem has been fixed and is now waiting for integration and testing, the [[WP5 Role Support Team|Support Team]] or TCom member should assign the milestone "Ready to Release" to the incident ticket.
  
  
 
'''7) Incident Closure'''
 
'''7) Incident Closure'''
  
The [[Role Support Team|Support Team]] closes the ticket according to the ticket priority:
+
The [[WP5 Role Support Team|Support Team]] closes the ticket according to the ticket priority:
 
* High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem;
 
* High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem;
 
* Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new TRAC task created.
 
* Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new TRAC task created.
Line 71: Line 71:
 
The submitter may re-open the ticket in case he is not satisfied with the ticket response.   
 
The submitter may re-open the ticket in case he is not satisfied with the ticket response.   
  
The [[Role Support Team|Support Team]] maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with a incident resolution report including the following points: Problem Description, Affected Services, and Resolution Details. This information must be provided as a final comment to the incident TRAC ticket.
+
The [[WP5 Role Support Team|Support Team]] maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with a incident resolution report including the following points: Problem Description, Affected Services, and Resolution Details. This information must be provided as a final comment to the incident TRAC ticket.
  
  
Line 85: Line 85:
  
 
== gLite Nodes ==
 
== gLite Nodes ==
Tickets related to the usage of gLite nodes are managed according to the procedure defined above. However, the [[Role Support Team|support Team]] may forward the problem to EGI by creating a new ticket in the EGI [https://gus.fzk.de/pages/home.php GGUS] support service. The [[Role Support Team|Support Team]] is then responsible for monitoring the tickets submitted in GGUS and update the corresponding TRAC tickets.
+
Tickets related to the usage of gLite nodes are managed according to the procedure defined above. However, the [[WP5 Role Support Team|support Team]] may forward the problem to EGI by creating a new ticket in the EGI [https://gus.fzk.de/pages/home.php GGUS] support service. The [[Role Support Team|Support Team]] is then responsible for monitoring the tickets submitted in GGUS and update the corresponding TRAC tickets.
  
 
Tickets related to the installation and upgrade of gLite are submitted by email to the corresponding EGI [http://www.egi.eu/production-infrastructure/Resource-providers/ NGIs]. These tickets are them processed by [http://www.egi.eu/production-infrastructure/Resource-providers/ NGIs] which either provide an immediate resolution to the problem or forward the problem to the EGI [https://gus.fzk.de/pages/home.php GGUS] support service.
 
Tickets related to the installation and upgrade of gLite are submitted by email to the corresponding EGI [http://www.egi.eu/production-infrastructure/Resource-providers/ NGIs]. These tickets are them processed by [http://www.egi.eu/production-infrastructure/Resource-providers/ NGIs] which either provide an immediate resolution to the problem or forward the problem to the EGI [https://gus.fzk.de/pages/home.php GGUS] support service.
  
 
Best effort support and open discussion on any gLite issue is possible using the [mailto:glite-discuss@cern.ch gLite-discuss] mailing list.
 
Best effort support and open discussion on any gLite issue is possible using the [mailto:glite-discuss@cern.ch gLite-discuss] mailing list.

Revision as of 19:40, 1 December 2011

The management of incidents affecting the production infrastructure is based on TRAC issue tracking system. Open incidents can be browsed using TRAC report1 . Closed incidents can be browsed using TRAC report 2.

The resolution of infrastructure incidents is based on the following steps:

  1. incident submission
  2. prioritization and assignment
  3. 1st level support
  4. 2nd level support (optional)
  5. management decision (optional)
  6. resolution and recovery
  7. incident closure


1) Incident Submission

Incident tickets are created in TRAC. Tickets must be created as follows:

  • enter your email address
  • fill the Summary text box
  • fill the Description text box
  • select the required Priority (optional)
  • add other concerned people in CC (optional)
  • select the appropriate VREVirtual Research Environment. (optional)

Note: access to TRAC can be requested by emailing to the project support mailing list.


2) Prioritization and Assignment

The Infrastructure Managers assign the correct priority to the incident:

  • High: tickets must be solved maximum in 2 working days or in a Maintenance Release
  • Normal: tickets must be solved in one week time or in the next Minor/Major Release
  • Low: tickets must be solved in two weeks time or possibly in the the next Minor/Major Release

Tickets must also be assigned to one member of the Support Team. This is done either by Infrastructure Managers or by members of the Support Team. The assigned Support Team member is automatically notified by email. All email notifications are also send to the support team mailing list.


3) 1st Line Support

The assigned member of the Support Team analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the Support Team might occur as a result of ticket analysis and resolution.

If the Support Team is not able to close the incident, the ticket is escalated to the TCom for further analysis (functional escalation).


4) 2nd Line Support

The TCom tries to understand and solve the incident.

If the TCom is not able to close the incident, the ticket is escalated for a final decision to the PEB (hierarchical escalation).


5) Management Decision

The PEB takes a final decision to allow the resolution of the ticket.


6) Resolution and Recovery

The Support Team or the TCom solves the incident and recovers the affected service to its normal status.

To indicate that a software problem has been fixed and is now waiting for integration and testing, the Support Team or TCom member should assign the milestone "Ready to Release" to the incident ticket.


7) Incident Closure

The Support Team closes the ticket according to the ticket priority:

  • High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem;
  • Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new TRAC task created.

The submitter may re-open the ticket in case he is not satisfied with the ticket response.

The Support Team maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with a incident resolution report including the following points: Problem Description, Affected Services, and Resolution Details. This information must be provided as a final comment to the incident TRAC ticket.


gCube Nodes

Tickets related to gCube nodes are followed according to the procedure defined above.


Hadoop Nodes

Tickets related to Hadoop nodes are followed according to the procedure defined above.


gLite Nodes

Tickets related to the usage of gLite nodes are managed according to the procedure defined above. However, the support Team may forward the problem to EGI by creating a new ticket in the EGI GGUS support service. The Support Team is then responsible for monitoring the tickets submitted in GGUS and update the corresponding TRAC tickets.

Tickets related to the installation and upgrade of gLite are submitted by email to the corresponding EGI NGIs. These tickets are them processed by NGIs which either provide an immediate resolution to the problem or forward the problem to the EGI GGUS support service.

Best effort support and open discussion on any gLite issue is possible using the gLite-discuss mailing list.