Introduction
Contents
- Introduction
- Flowchart Diagram
- Process
- Roles, Responsibilities, & RACI Matrix
- Appendices
Purpose and Scope
The purpose of this process is to restore normal service operation as quickly as possible, minimize adverse impact on business operations, and ensure that agreed upon levels of service quality are maintained. Instances of disruption or degradation to normal business operations are referred to as incidents.
The scope of this document is to describe the process for identifying, prioritizing, and resolving incidents for supported services.
Process Goals
The goal of this process is to establish standardized methods for Incident Management.
The Division of IT will maintain its commitment to high standards of excellence through a professional approach to quickly resolving incidents. This process is owned by the Deputy Executive Director: ITEE.
Document Owner
The document owner is responsible for ensuring that this document is accurate and up to date, following agreed processes for review and revision.
This document is owned by the Associate Director of IT Support.
Definitions
Normal service operation: Normal service operation is defined by the thresholds of Service Level Agreements (SLA) and Operational Level Agreements (OLA).
Incident: An incident is defined as an unplanned interruption or reduction in quality of an IT service. Many incidents are reported by customers, although incidents can also be identified through monitoring tools or IT employees. In ServiceNow incident records begin with ‘INC’ so they are easy to identify.
Impact: A categorization representing how many customers are experiencing disruption or degradation from the incident. That is, how many people are affected by an incident (e.g., a campus-wide network outage has a higher impact than a single malfunctioning Ethernet port).
Urgency: Measure of how quickly the incident needs to be fixed based on its business criticality to the University (e.g., an incident preventing a customer from receiving information right before a deadline would have higher urgency than inability for a customer to receive general emails).
Priority: A measure of how critical the incident is overall to the business as calculated by a combination of the incident’s Impact and Urgency.
Major Incident: An incident which represents a wide-spread or high-impact degradation or outage of a designated core, critical, security, or safety related service. Major Incidents are handled through the Major Incident Management process.
Major Incident Management: Major Incidents are proposed to and evaluated by Major Incident Managers and follow the Major Incident Management process.
Flowchart Diagram
Click the image to view a larger version.
Process
Process Inputs and Outputs
User Intake Methods
- Phone
- Self-Service Portal
Process Inputs
- User Submission – phone, self-service portal, email
- Monitoring Event Trigger
- Support Staff Submission
- Third Party Systems
Process Outputs
- Resolved incidents
- Documented workarounds, solutions, or knowledge articles
- New problems, changes, or incidents
- Instantiation of Major Incident Management process
- Identification of Configuration Items associated with or impacted by incidents
Standard Incident Process Flow
Incident Identification
This is the process step where a reported or suspected incident is identified and initial evaluation is performed. User submission is made through any of the following methods: phone call, self-service, or email.
During this step, the intake agent will identify whether an incident has been or needs to be created.
- Evaluate the customer issue and determine if it relates to a supported service.
- If the incident relates to a non-supported service, notify the user, and cancel the incident.
- If the user is referred to another service, update the incident and close as a referral.
- Identify whether the user has an open incident and if so, provide status updates.
- Identify any duplicate incidents, update the original incident with new information, and close the incident as a duplicate.
- If no incident is required, create a call record.
Incident Logging
This is the process step where the Service Desk creates new incidents or updates existing incidents with new information. Incidents may also be logged automatically through system integrations or by service owners and providers when receiving early indicators of a service degradation or outage.
Incidents logged for or by VIP users will be automatically assigned to the correct support group for the duration of the incident management process.
During this step, the intake agent will log relevant information related to the user and the user’s issue and create an incident record.
- Select the next unassigned incident in the queue or create a new self-assigned incident, setting the assignment group and assigned to fields.
- Validate and update user contact information: name, PID, email address, and phone number.
- Provide a short description of the issue and detailed work notes including symptoms, steps to replicate, timing, and any steps the user took to resolve their issue.
- Submit the incident to create a record of the issue.
Incident Categorization
During this process step, the Service Desk will categorize the incident based on information available at creation. Incidents may be automatically categorized through system integrations or by service owners and providers when receiving early indicators of a service degradation or outage. Service owners and providers may refine categorization as warranted.
- Select the category and subcategory best reflected by the information provided.
- Add associated Configuration Items to the incident.
Incident Prioritization
During this process step, ServiceNow will calculate the priority of the incident using the assessed impact and urgency of the issue.
- Set the impact of the incident using available information.
- Single User – the user is not aware of anyone else experiencing the issue and there are no related active incidents.
- Multiple Users/Locations – the user reports other users or locations experiencing the issue or there are multiple related incidents.
- University Wide – the issue is affecting the entire university.
- Set the urgency of the incident using available information. Necessity of systems in context and patterns of business activity (PBA) should be taken into account when assessing urgency.
- Low – minor issues not significantly affecting business operations or user productivity.
- Normal – moderate issues affecting business operations or user productivity but no organizational threat.
- High – significant issues affecting business operations or user productivity. Potential threat to security or reputation.
- Critical – severe issues immediately affecting business operations, user productivity, organizational security, or reputation.
- Assess the priority of the incident as calculated by ServiceNow.
Impact | ||||
---|---|---|---|---|
Single User | Multi Users/Bldgs | University | ||
Urgency | Low | P4 | P4 | P4 |
Normal | P3 | P2 | P2 | |
High | P2 | P2 | P1 | |
Critical | P2 | P1 | P1 |
Priority of Incident | Classification | Time to Response * | Time to Resolution * |
P4 | Low | - | - |
---|---|---|---|
P3 | Normal | 8 business hours | 3 - 5 business days |
P2 | Severe | 1 business hour | 1 - 2 business days |
P1 | Major | 15 minutes | 2 - 4 hours |
* Actual times will depend on the impacted service. These times are only examples.
A Major Incident may be proposed if the incident priority is P1 or there is a large volume of related P2 incidents, the work notes indicate an outage or degradation, and the service impacted is core infrastructure, business critical, or impacts public safety or information security.
Initial Diagnosis
During this process step, the agent will attempt to identify a workaround or resolution for the incident and document any troubleshooting steps.
- Search internal and external knowledge bases, related incidents, and known errors. Contact the user for additional information as needed.
- Provide the user with troubleshooting steps. Document the outcomes, error messages, screenshots, and any other relevant information.
- Update the incident with new information as it becomes available. Refine categorization, related configuration items, impact, and urgency as needed.
- Resolve or reassign the updated incident for additional investigation. Update the user on status.
Incident Investigation
During this process step, the partner or service provider conducts additional troubleshooting and investigation of the incident, escalating to the service owner(s) or vendor(s) as required.
- Conduct in-depth investigation and troubleshooting, clarifying and refining the issue.
- Provide the user with troubleshooting steps. Document outcomes, error messages, screenshots, and any other relevant information.
- Update the incident with new information as it becomes available. Refine categorization, related configuration items, impact and urgency as needed.
- Resolve or escalate the updated incident to the service owner(s) or vendor(s). Update the user on status.
Incident Resolution and Closure
During this final process step, the incident is resolved and closed by the assignee.
- Document all troubleshooting steps, research, and resolution in the work notes.
- Communicate the resolution to the user.
- Summarize the resolution steps and assign a closure code based on the resolution.
- Resolve the incident.
- Reassign the incident to the Service Desk for user validation and closure or allow the incident to close automatically.
Roles, Responsibilities, & RACI Matrix
Name the unit(s) and / or position title(s) for the following. Do not list names, as names may change.
Responsible: People or stakeholders who do the work. They must complete the task or objective or make the decision. Several people can be jointly Responsible.
Accountable: Person or stakeholder who is the “owner” of the work. He or she must sign off or approve when the task, objective or decision is complete. This person must make sure that responsibilities are assigned in the matrix for all related activities. Success requires that there is only one person Accountable, which means that “the buck stops there.”
Consulted: People or stakeholders who need to give input before the work can be done and signed-off on. These people are “in the loop” and active participants.
Informed: People or stakeholders who need to be kept “in the picture.” They need updates on progress or decisions, but they do not need to be formally consulted, nor do they contribute directly to the task or decision.
Activity | Incident Manager | Intake | Service Provider | Service Owner | MI Manager |
---|---|---|---|---|---|
Incident Management Support | A / R | ||||
Incident Identification | A | R | R | ||
Incident Logging | A | R | R | ||
Incident Categorization | A | R | |||
Incident Prioritization | A | R | |||
Initial Diagnosis - T1 | A | R | |||
Pro-Active User Information (Notification) | A / R | R | |||
Incident Resolution - T1 | A | R | |||
Monitoring & Escalation | A / R | R | R | C / I | |
Incident Investigation - T2 |
C / I |
C / I | R | A | |
Incident Resolution - T2 | C / I | R | A | ||
Major Incident Management | C / I | C / I | C / I | C / I | A / R |
Incident Closure | A | R | C / I | C / I | |
Incident Management Reporting | A / R | C | I | I | C / I |
Incident Process Owner (Deputy Executive Director: ITEE)
- Owns the entire process, including associated procedural steps, roles, responsibilities, changes, improvements, and definitions.
- Matures and adapts the process to changing business needs through annual reviews and Continuous Service Improvement (CSI).
Incident Process Manager (Associate Director of IT Support)
- Responsible for ensuring the Incident process is carried out in a standard way and overseeing day-to-day process execution.
- Responsible for process reporting and metrics as well as proposing process improvements.
Incident Intake – Tier 1
Incident intake is the first support level after self-service and is responsible for gathering information about the customer and the issue as well as logging, categorizing, prioritizing, and performing initial diagnosis.
Service Provider – Tier 2
The group responsible for delivering a supported service or services, including working with vendors and third parties. Provides support to incident intake, as well as advanced troubleshooting and escalation to the service owner as needed.
Service Owner
Either a manager or member of a service provider group or their delegate. This individual is responsible for the lifecycle of a service, including the availability and delivery of that service. Creates incidents when receiving early indicators about potential service degradation or outage.
Appendices
Appendix A: Related Documentation
Process | Process Owner | Relationship to This Process |
---|---|---|
Cyber Security Incident Response | Chief Information Security Officer | Handling of major security incidents |
Major IT Service Issue Process | Deputy Executive Director: User Engagement | Defines the handling and communications of major issues with services |
Appendix B: ServiceNow Incident States
Incident State | Description | Behavior |
---|---|---|
New | Incidents start in this state and remain so until assigned to an individual. | |
Active | The state of an incident after assignment to an individual. | |
Awaiting Problem | The state of an incident when all other work has been completed and a workaround is in place/service is restored, but the underlying issue is not resolved and is associated with a problem record. | Attaches new workarounds from the associated problem to this incident as an additional comment. |
Awaiting User Information | The state of an incident after contact has been made with the user when requesting more information from a user or awaiting the outcome of actions on the user’s part to resolve the issue. | Pauses SLA timers. |
On Hold | This state is for incidents that cannot be resolved at this time and there is no user communication expected. | Pauses SLA timers. |
Resolved | The issue or disruption is fixed, a workaround is in place, or the user has been referred for assistance. | The incident will stay in the resolved state for three (3) days. During this time the incident can be re-opened. |
Closed | The incident is closed. | The incident has remained in the resolved state for more than three days without being re-opened or the user has accepted the resolution of the incident. The incident cannot be modified. If the issue reoccurs a new incident will need to be created. |
Appendix C: Standard Incident Metrics & KPIs
- First Contact Resolution Rate (FCR): This is the percentage of incidents resolved by the Service Desk on the first interaction with the user. It's calculated as (Number of incidents resolved on first contact / Total number of incidents) * 100.
- Mean Time to Resolution (MTTR): This is the average time taken to resolve an incident from the time it was reported. It's calculated as the total time spent resolving incidents / Total number of resolved incidents.
- Incident Volume: This is the total number of incidents reported in a given period. It's a simple count of incidents.
- Incident Backlog: This is the number of incidents that are open and yet to be resolved. It's a simple count of open incidents.
- Incident Response Time: This is the average time taken to respond to an incident after it has been reported. It's calculated as the total time taken to respond to incidents / Total number of incidents.
- Percentage of Incidents Escalated: This is the percentage of incidents that had to be escalated to a higher level of support. It's calculated as (Number of incidents escalated / Total number of incidents) * 100.
- Customer Satisfaction (CSAT) Score: This is a measure of how satisfied users are with the resolution of their incidents. It's typically calculated through user surveys.
- Service Level Agreement (SLA) Compliance Rate: This is the percentage of incidents resolved within the agreed SLA. It's calculated as (Number of incidents resolved within SLA / Total number of incidents) * 100.
- Cost per Incident: This is the average cost of handling an incident. It's calculated as Total cost of Service Desk operations / Total number of incidents.
- Incident Reopen Rate: This is the percentage of incidents that had to be reopened after they were initially resolved. It's calculated as (Number of incidents reopened / Total number of incidents) * 100.