Difference between revisions of "Procedure Infrastructure Incident Management"
Andrea.manzi (Talk | contribs) (→gLite/UMD Resources) |
|||
(4 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
== Incident Management== | == Incident Management== | ||
− | The management of incidents affecting the production infrastructure is based on | + | The management of incidents affecting the production infrastructure is based on RedMine issue tracking system. Open incidents can be browsed using RedMine [https://support.d4science.org/projects/d4science/issues?query_id=207 Open Incident ]. Closed incidents can be browsed using RedMine [https://support.d4science.org/projects/d4science/issues?query_id=208 Closed Incident]. |
The resolution of infrastructure incidents is based on the following steps: | The resolution of infrastructure incidents is based on the following steps: | ||
Line 16: | Line 16: | ||
==='''Incident Submission'''=== | ==='''Incident Submission'''=== | ||
− | Incident tickets are created in [https://support.d4science.org | + | Incident tickets are created in [https://support.d4science.org Tracking System]. Tickets must be created as follows: |
* enter your email address | * enter your email address | ||
* fill the ''Summary'' text box | * fill the ''Summary'' text box | ||
Line 24: | Line 24: | ||
* add other concerned people in ''CC'' (optional) | * add other concerned people in ''CC'' (optional) | ||
* select the appropriate ''VRE'' (optional) | * select the appropriate ''VRE'' (optional) | ||
− | |||
− | |||
− | |||
Line 35: | Line 32: | ||
* High: tickets must be solved maximum in 2 working days or in a Maintenance Release | * High: tickets must be solved maximum in 2 working days or in a Maintenance Release | ||
* Normal: tickets must be solved in one week time or in the next Minor/Major Release | * Normal: tickets must be solved in one week time or in the next Minor/Major Release | ||
− | * Low: tickets must be solved in two weeks time or possibly in | + | * Low: tickets must be solved in two weeks time or possibly in the next Minor/Major Release |
− | Tickets must also be assigned to one member of the [[Role Support Team|Support Team]]. This is done either by [[Role Infrastructure Manager|Infrastructure Managers]] or by members of the [[WP5 Role Support Team|Support Team]]. The assigned [[Role Support Team|Support Team]] member is automatically notified by email. All email notifications are also | + | Tickets must also be assigned to one member of the [[Role Support Team|Support Team]]. This is done either by [[Role Infrastructure Manager|Infrastructure Managers]] or by members of the [[WP5 Role Support Team|Support Team]]. The assigned [[Role Support Team|Support Team]] member is automatically notified by email. All email notifications are also sent to the [mailto:support_team@d4science.org support team] mailing list. |
==='''1st Line Support'''=== | ==='''1st Line Support'''=== | ||
Line 43: | Line 40: | ||
The assigned member of the [[Role Support Team|Support Team]] analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the [[Role Support Team|Support Team]] might occur as a result of ticket analysis and resolution. | The assigned member of the [[Role Support Team|Support Team]] analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the [[Role Support Team|Support Team]] might occur as a result of ticket analysis and resolution. | ||
− | If the [[Role Support Team|Support Team]] is not able to close the incident, the ticket is escalated to the TCom for further analysis (functional escalation). | + | If the [[Role Support Team|Support Team]] is not able to close the incident, the ticket is escalated to the Technical Committee (TCom) for further analysis (functional escalation). |
− | + | ||
==='''2nd Line Support'''=== | ==='''2nd Line Support'''=== | ||
− | The TCom tries to understand and solve the incident. | + | The TCom tries to understand and solve the incident and it may change its prioritization. |
− | If the TCom is not able to close the incident, the ticket is escalated for a final decision to the | + | If the TCom is not able to close the incident, the ticket is escalated for a final decision to the management board (hierarchical escalation). |
==='''Management Decision'''=== | ==='''Management Decision'''=== | ||
− | The | + | The management board takes a final decision to allow the resolution of the ticket. |
Line 64: | Line 60: | ||
Different steps apply to the resolution of incident tickets, depending on the nature of the incident : | Different steps apply to the resolution of incident tickets, depending on the nature of the incident : | ||
− | * To indicate that a problem which required the intervention on the infrastructure has been fixed, the [[Role Support Team|Support Team]] or TCom member should assign | + | * To indicate that a problem which required the intervention on the infrastructure has been fixed, the [[Role Support Team|Support Team]] or TCom member should assign to the Progress field the 100% value. |
* To indicate that a software problem has been fixed and is now waiting for integration and testing, the [[Role Support Team|Support Team]] or TCom member should assign the milestone "Ready to Release" to the incident ticket and assign to the Progress field the 100% value. | * To indicate that a software problem has been fixed and is now waiting for integration and testing, the [[Role Support Team|Support Team]] or TCom member should assign the milestone "Ready to Release" to the incident ticket and assign to the Progress field the 100% value. | ||
Line 72: | Line 68: | ||
The [[Role Support Team|Support Team]] closes the ticket according to the ticket priority: | The [[Role Support Team|Support Team]] closes the ticket according to the ticket priority: | ||
* High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem; | * High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem; | ||
− | * Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new | + | * Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new issue ticket created with the tracker 'task' type. |
− | '''PLEASE NOTE''' that in case of an intervention on the production infrastructure needed to fix an incident ( e.g. service/GHN restart) , the entitled [[Role Support Team|Support Team]] member should fill the '''Intervention Time''' field of the incident ticket with the info on minutes spent in the infrastructure activity. | + | '''PLEASE NOTE''' that in case of an intervention on the production infrastructure needed to fix an incident ( e.g. service/GHN restart), the entitled [[Role Support Team|Support Team]] member should fill the '''Intervention Time''' field of the incident ticket with the info on minutes spent in the infrastructure activity. |
The submitter may re-open the ticket in case he is not satisfied with the ticket response. | The submitter may re-open the ticket in case he is not satisfied with the ticket response. | ||
− | The [[Role Support Team|Support Team]] maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with | + | The [[Role Support Team|Support Team]] maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with an incident resolution report including the following points: Problem Description, Affected Services, and Resolution Details. This information must be provided as a final comment to the incident ticket. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + |
Latest revision as of 16:31, 5 April 2018
Incident Management
The management of incidents affecting the production infrastructure is based on RedMine issue tracking system. Open incidents can be browsed using RedMine Open Incident . Closed incidents can be browsed using RedMine Closed Incident.
The resolution of infrastructure incidents is based on the following steps:
- incident submission
- prioritization and assignment
- 1st level support
- 2nd level support (optional)
- management decision (optional)
- resolution and recovery
- incident closure
Incident Submission
Incident tickets are created in Tracking System. Tickets must be created as follows:
- enter your email address
- fill the Summary text box
- fill the Description text box
- select the appropriate Application from the list available, if none of the options available apply to your incident please select the "Other" option
- select the required Priority (optional)
- add other concerned people in CC (optional)
- select the appropriate VREVirtual Research Environment. (optional)
Prioritization and Assignment
The Infrastructure Managers assign the correct priority to the incident:
- High: tickets must be solved maximum in 2 working days or in a Maintenance Release
- Normal: tickets must be solved in one week time or in the next Minor/Major Release
- Low: tickets must be solved in two weeks time or possibly in the next Minor/Major Release
Tickets must also be assigned to one member of the Support Team. This is done either by Infrastructure Managers or by members of the Support Team. The assigned Support Team member is automatically notified by email. All email notifications are also sent to the support team mailing list.
1st Line Support
The assigned member of the Support Team analyses the ticket and resolves the issue. Reassignment of the ticket to other members of the Support Team might occur as a result of ticket analysis and resolution.
If the Support Team is not able to close the incident, the ticket is escalated to the Technical Committee (TCom) for further analysis (functional escalation).
2nd Line Support
The TCom tries to understand and solve the incident and it may change its prioritization.
If the TCom is not able to close the incident, the ticket is escalated for a final decision to the management board (hierarchical escalation).
Management Decision
The management board takes a final decision to allow the resolution of the ticket.
Resolution and Recovery
The Support Team or the TCom solves the incident and recovers the affected service to its normal status.
Different steps apply to the resolution of incident tickets, depending on the nature of the incident :
- To indicate that a problem which required the intervention on the infrastructure has been fixed, the Support Team or TCom member should assign to the Progress field the 100% value.
- To indicate that a software problem has been fixed and is now waiting for integration and testing, the Support Team or TCom member should assign the milestone "Ready to Release" to the incident ticket and assign to the Progress field the 100% value.
Incident Closure
The Support Team closes the ticket according to the ticket priority:
- High Priority: tickets can only be closed when the issue has been solved and fixed in the production ecosystem;
- Normal/Low Priority: tickets can be closed if the issue was either (1) solved and fixed in the production ecosystem, or (2) fully understood and a new issue ticket created with the tracker 'task' type.
PLEASE NOTE that in case of an intervention on the production infrastructure needed to fix an incident ( e.g. service/GHNgCube Hosting Node. restart), the entitled Support Team member should fill the Intervention Time field of the incident ticket with the info on minutes spent in the infrastructure activity.
The submitter may re-open the ticket in case he is not satisfied with the ticket response.
The Support Team maintains a knowledge database with information about the incident descriptions and resolutions. All closed tickets should be completed with an incident resolution report including the following points: Problem Description, Affected Services, and Resolution Details. This information must be provided as a final comment to the incident ticket.