Knowledge Article View - Service Portal

Introduction

A major incident is an emerging or ongoing outage or degradation of a core or critical production service which significantly disrupts university/business operations.

Introduction
Contents
Purpose
Major Incident Scope
Major Incident Definition
Process
How to Associate an Incident with a Major Incident: Link to knowledgebase article
Glossary
Appendix A: Roles and Responsibilities
Appendix B: Priority, Impact, and Urgency
Appendix C: Backup IT services
Appendix D: Communications Templates and Timeline
Appendix E: Core and Critical Services, Service Owners, and Service Providers (Support Groups)
Appendix F: Major Incident Manager Checklist
Process Change History

Purpose

Formulate a coordinated response to major incidents with goals to:

Restore core/critical production services to normal operations as soon as possible.
Inform stakeholders of core/critical production service outages or degradations.
Support activities to identify cause of issues and review organizational response through after-action reviews.

Top of page

Major Incident Scope

A major incident is an emerging or ongoing outage or degradation of a core or critical production service which significantly disrupts university/business operations. A major incident has at least one of the following characteristics:

A core or critical service is (or is expected to be) inaccessible, unusable, or has performance issues which prevent normal operational usage, and the impact is widespread (e.g., affecting departments, campuses, a building or buildings).
Time-critical business processes important to the university are hindered or stopped as a result of issues with core or critical service(s) regardless of the number of users affected.
4Help/FASTR (Faculty and Staff Technology Resources) determine there is a major incident related to a core/critical service based on specific individuals and/or buildings experiencing an issue.

Top of page

Major Incident Definition

A major incident can only be declared under the conditions that:

An incident record exists (submitted by a user or created by technical personnel)
The incident impacts a core or critical service
The incident's priority ranks P1 (critical) or P2 (high)
Incident work notes indicate an outage or degradation
The incident has been proposed as a major incident

Clarification for sensitive public-security services: Any outage of a service that underpins the VT Alerts platform will not be posted as an outage on the IT Status page and communications will be limited to internal division channels only: Technical Team and VPIT/Senior Leaders communication channels. We will not communicate to external channels, including departmental IT channels (IT Council and Techsupport groups).

When a major incident is proposed, a Major Incident Manager decides whether to declare it a major incident. The Service Owner will be notified when an incident is proposed as major, and the Major Incident Manager will make the best effort to get input from the Service Owner; however, the Major Incident Manager may declare a major incident without Service Owner input.

An incident will be declared as major if it meets one of the following criteria:

A Service Owner proposes a major incident for their own service
The incident is proposed by the FASTR team and designated as warranting a major incident
An incident is proposed as major with a P1 or P2 priority and corroborated by other reports or information (especially input from Service Owner, Major Incident Manager's further investigation, rate of incoming or recent incidents with similar symptoms)
The Major Incident Manager can reproduce the symptoms of the service outage or degradation as well as at least one other Division of IT or departmental IT personnel
4Help or frontline support personnel have proposed multiple incidents as major with same symptoms or for the same service

A Major Incident Manager will reject a major incident proposal if:

The proposed incident is not for a core/critical service
The core/critical service was on the SAMS calendar for a scheduled maintenance that indicated the service would be unavailable
The Service Owner informs the Major Incident Manager the service is not in use (e.g., Course Registration outage when not being used for registration), assuming incident volume and circumstances support this claim
No corroborating information is available from the affected Service Owner, Technical Lead, Service Provider, other incidents, or other technical personnel
The proposed incident lacks information such as work notes, justification, appropriate priority to warrant a major incident

Examples of Incidents vs Major Incidents
Incident	Major Incident
User unable to log into Canvas Shared printer is not working for multiple users User cannot access a particular Banner form they used to be able to access User cannot receive SMS or calls from Duo User is unable to send/receive e-mail	Canvas is down and no users can login Network is down on the academic side of campus VT Active Directory failures across the university HokieMart is experiencing slowness Duo reports that they are unable to send SMS or phone calls

Top of page

Process

View the Process Diagram

Any incident fulfiller (especially 4Help, departmental IT, or other frontline support) logs, categorizes, prioritizes and performs initial diagnosis of an incident. A user may have submitted the incident, or it may have been created by technical personnel who detected an issue.
An incident fulfiller proposes the incident as a major incident candidate based on major incident scope.
1. The Major Incident Manager and Service Owner are notified when an incident has been proposed as major.
2. If Service Owner(s), Technical Lead(s), or Service Provider(s) are aware of an outage or degradation, they may begin investigation and troubleshooting. Any production changes or significant events/findings should be recorded.
A Major Incident Manager reviews the proposed incident and accepts or rejects it. This review includes checking the proposed incident against major incident criteria, gathering information, and seeking input of impacted Service Owner(s).
1. If the incident is deemed major, then the Major Incident Manager promotes it to a major incident and self-assigns it.
2. If the incident is not deemed major, then the Major Incident Manager demotes it with explanation and re-assigns the incident to the proposer.
The Major Incident Manager, after declaring a major incident:
1. Establishes a war room and major incident channel.
2. Gathers impacted Service Owner(s), Service Provider(s), Technical Lead(s) in a war room and major incident channel. The Major Incident Manager is present in the war room for the duration of the major incident.
  1. Service Owner(s), Service Provider(s), Technical Lead(s) join the war room and major incident channel.
    1. The war room is used for coordination of the major incident, inter-team technical troubleshooting, and developing stakeholder communications. The participants should only be those actively involved in coordinating or troubleshooting the major incident:
      1. Service Owner(s) and Technical Lead(s) coordinate changes to production so changes are implemented in a deliberate, controlled way. This includes coordinating with other Service Owner(s), Technical Lead(s), and Service Provider(s).
      2. Technical Lead directs Service Provider team in troubleshooting to eventual resolution of the issue.
      3. Service Owner(s) provide regular updates to the Major Incident Manager, who will in turn communicate updates to stakeholders.
      4. Service Owner(s) notify the Major Incident Manager of additional technical resources needed. The Major Incident Manager will recruit individuals/teams needed.
      5. The Major Incident Manager apprises Service Owner(s) of significant new reports or environmental changes.
    2. The MI (major incident) channel is used for logging production changes, attempted fixes and results, and significant events:
      1. Technical Lead(s) and Service Owner(s) coordinate logging changes to production, attempted technical fixes, and results as they occur in the MI channel.
      2. The Major Incident Manager, Service Owner(s), Technical Lead(s), and Service Provider team(s) record significant events (e.g., new symptoms, expanded scope of impact).
      3. The MI channel is used to get added teams or individuals up to speed in the case the major incident is escalated and expanded to involve other teams.
    3. The Major Incident Manager records significant activities in the major incident record.
3. The Major Incident Manager informs stakeholders of the major incident and provides regular updates:
  1. Post an initial IT status with input from Service Owner(s).
  2. Notify support personnel of major incident declaration in the #itsupport-csc channel
  3. Send initial stakeholder communications with input from Service Owner(s).
Troubleshooting and communication activities continue, with escalation as needed, until resolution. If the issue bridges business-hours and non-business hours, the Major Incident Manager will brief a new Major Incident Manager who will take over responsibilities.
The Service Owner(s) notify the Major Incident Manager of resolution of the Major Incident.
1. The Major Incident Manager resolves the major incident record and sends resolution notifications to stakeholders.
2. The Major Incident Manager replaces the IT status with an informational status to be removed at EOB of current or next business day.
The Major Incident Manager assigns Service Owner(s) to complete the after-action report or contribute to its completion. (The AAR template can be found as an attachment at the bottom of this article.)
1. Service Owner(s) consult with their Technical Lead and Service Provider to complete after-action report.
The Major Incident Manager schedules an after-action review meeting, if necessary, to review the after-action report and gather feedback from participants on organizational response to the major incident.
The Major Incident Manager sends the after-action report to internal stakeholders.

Clarifications:

Outside of normal business hours or if the designated Major Incident Manager is unreachable, 4Help will contact the supervisor on-call to serve as the Major Incident Manager.
For core or critical services with verified outage or degradation lasting a brief time (seconds or minutes):
- The Major Incident Manager will create a major incident with an immediately ending outage/degradation and post an informational status acknowledging the temporary outage/degradation. The Major Incident Manager will work with Service Owner(s) to develop and send a brief communication to stakeholders notifying them of the issue. No after-action report or after-action review will be required.
If the Service Owner cannot be reached, the Major Incident Manager will notify a group of major incident contacts which can fulfill their responsibilities.
If there are simultaneous, separate major incidents, a Major Incident Manager may recruit an additional Major Incident Manager who will be assigned to assist with management activities of the potentially separate major incident. Both major incident managers will maintain active communications to coordinate efforts and share information.
The IT Service Status list has some non-core/critical services for which an IT status can be posted, but would not be in scope of the major incident process.
The ITSO Major Security Incident Process can be found here: Major Security Incident Process.

Top of page

How to Associate an Incident with a Major Incident: Link to knowledgebase article.

Top of page

Glossary

Core/critical service: A core service is a foundational IT system, infrastructure, platform, or process that is widely leveraged by stakeholders across the university. A critical service is a mission-critical service that requires continuous availability. Breaks in a mission-critical service are intolerable and may immediately cause damage to the university's mission. Only services identified as core or critical can trigger major incidents. Core/critical services are identified by the VP for IT, Division of IT senior leadership, or a Division of IT service owner. A new core/critical service can be added by submitting an incident at 4help.vt.edu to attention of ServiceNow team. See appendix D for a list of core/critical services.
Degradation: Some or all users cannot access some features, experience performance issues, or experience intermittent symptoms.
Impact: The measure of the effect of an incident on business processes or how service levels will be affected. The impact may be defined by accounting for the number of affected users or buildings/locations affected.
Major Incident: A major incident is an unplanned outage or degradation of a core/critical production service.
Outage: No users can utilize any features of the service.
Priority: The combined measure of impact and urgency which determines low, normal, high, or critical support priority.
Urgency: The importance of the service encountering an issue and a measure of how long it will be until the incident has significant impact on the business. It may be viewed as the degree to which the resolution of the incident can be delayed.

Top of page

Appendix A: Roles and Responsibilities

4Help, frontline IT support: any incident fulfiller, though especially 1st-tier, higher-tier support, and departmental IT.
- Record, assess, classify, and diagnose incidents
- Propose an incident as a major incident
- Associate an incident to an open major incident
- Assist with validating resolution of incidents associated with a major incident
- 4help only: For incidents proposed as major incidents after-hours, contact the on-call supervisor to fulfill Major Incident Manager role
Major Incident Manager: coordinates organizational response to a major incident, oversees communications.
- Ensure a Major Incident Manager is actively investigating a proposed major incident
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Review and investigate proposed major incidents
- Check SAMS calendar
- Promote/escalate proposed major incident to major incident
- Establish a war room and major incident channel and bring in needed roles
- Be available in the war room for the duration of the major incident or until and Major Incident Major is named at a shift change.
- Communicate major incident initiation, updates, resolution, and after-action review summary to internal and external stakeholders
- Manage IT service status postings
- Apprise Service Owner(s) of significant events or environmental changes
- Coordinate the overall major incident
- Recruiting the appropriate technical resources, escalating to resource managers if needed
- Coordinate verifying incident resolution with frontline support and inform Service Owner(s) of results
- Resolve major incident with input from Service Owner(s)
- Assign Service Owner(s) to complete or contribute to the after-action report
- Schedule and lead after-action review with involved individuals and teams
- Brief new Major Incident Manager in case of handoff (e.g., business hours to after-hours)
Service Owner: an individual in the Division of IT accountable for delivery of a core or critical service.
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Investigate a proposed major incident and provide information to the Major Incident Manager
- Ensure a Technical Lead is identified from the Service Provider team is identified and actively working on the major incident
- Identify a Technical Lead from Service Provider group if one is not available
- Participate in war room and MI channel
- Provides regular updates to the Major Incident Manager
- Supply information/content for communications and review draft stakeholder communications (initial, regular, and resolution communications) as requested by the Major Incident Manager
- Coordinate with the Technical Lead and ensure logging of production changes, troubleshooting activities, and results of troubleshooting to the major incident channel
- Apprise the Major Incident Manager of significant events or environmental changes
- Notify the Major Incident Manager if additional technical resources are needed
- Notify Business Owner of a major incident with a core/critical service by directing them to IT Service Status
- Inform the Major Incident Manager when a major incident can be resolved.
- Complete or contribute to the after-action report as assigned by Major Incident Manager
- Attend after-action review meeting as assigned by Major Incident Manager
Business Owner: does not have responsibilities or activities defined in major incident response. This is typically an individual outside the Division of IT that owns the service from a university business standpoint, but is not engaged in service delivery activities.
Technical Lead: A member of the Service Provider team which provides support for a core/critical service and oversees technical troubleshooting activities.
- Begin investigation and troubleshooting prior to major incident declaration
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Participate in war room and MI channel
- Lead and coordinate troubleshooting efforts via war room of respective Service Provider team members
- Coordinate efforts under direction of Service Owner
- Coordinate troubleshooting efforts via war room with other Technical Lead(s)
- Ensure changes to production systems, attempted fixes, and their results are logged in major incident channel (in addition to regular change management practices)
- Complete after-action review as assigned by Major Incident Manager
- Contribute content to after-action review as assigned by Service Owner
- Attend after-action review meeting (as requested)
Service Provider: a group responsible for delivering or supporting a core/critical service
- Participate in war room and MI channel
- Troubleshoot and resolve the major incident under direction of Technical Lead and in coordination with other Service Provider teams via war room.
- Log production changes or significant troubleshooting activities and their results in major incident channel
- Contribute content to after-action review as assigned by Service Owner or Technical Lead
- Attend after-action review meeting (as requested)

Top of page

Appendix B: Priority, Impact, and Urgency

The priority of an incident, as determined by impact and urgency, is considered when declaring a major incident.

Impact: The measure of the effect of an incident on business processes or how service levels will be affected. The impact may be defined by accounting for the number of affected users or buildings/locations affected.
- University: all users of an IT service are affected, multiple IT services are affected, or a campus or campuses are affected
- Multiple users/building(s): affects many users of an IT service, a small group of users, department, or building
- Single User: single user or room affected
Urgency: A measure of how long it will be until the incident has significant business impact. It may be viewed as the degree to which the resolution of the incident can be delayed.
- Critical: halts a time-sensitive, critical university/business process, causes or indicates a major security issue, all functions of service are unusable or cannot be accessed and no workaround exists, rapidly increasing sphere of impact, rapidly causing incidents
- High: impedes a university/business process, major inconvenience, slowly increasing sphere of impact, causing incidents at a slow rate
- Normal: minor inconvenience, incident effects are stable
- Low: not time-sensitive

Impact and urgency determine an incident's priority as shown in the prioritization matrix below.

Prioritization Matrix		Urgency
Prioritization Matrix		Critical	High	Normal	Low
Impact	University	P1	P1	P2	P4
	Multiple users/building(s)	P1	P2	P2	P4
	Single User	P2	P2	P3	P4

Top of page

Appendix C: Backup IT services

Purpose	Primary (default) service	Secondary (if primary down)	Tertiary (if secondary down)
Troubleshooting	Zoom	Slack	Teams
Logging changes/events	Slack	Teams
Communications	Email	Slack	Teams
Incident Handling	ServiceNow	Manual record

Top of page

Appendix D: Communications Templates and Timeline

View Formal Communications Templates here

Estimated Time (after MI proposal)	Item	Sender	Receiver	Communication Channel/Medium
0-10 mins	Notice of major incident proposal	MI Manager	Service Owner, Service Provider, others	Email, #itsupport-csc (Slack)
10-20 mins.	Notice of major incident declaration	MI Manager	Service Owner, Service Provider, 4Help, others	Slack, IT Status
10-20 mins.	War room/major incident channel notification	MI Manager	Service Owner, Service Provider	Slack, Email
20-30 mins.	Initial MI communications	MI Manager	Stakeholders	Email
As available, at least hourly	Technical updates	Service Owner	MI Manager	Zoom War Room
Hourly	Major Incident updates	MI Manager	Stakeholders	Email, IT Status
0-15 mins. after resolution is confirmed	Incident resolution	Service Owner	MI Manager	MI Slack Channel, Zoom War Room, others
15-30 mins. after resolution is confirmed	Incident resolution	MI Manager	Stakeholders	Email, IT Status
15-30 mins. after resolution is confirmed	Informational status after incident resolution	MI Manager	Stakeholders	IT Status post (informational)

Stakeholders:

Technical (troubleshooting): Service Owner, Service Provider, Technical Lead
Technical (internal awareness): 4Help and other service owners/providers
Internal: VPIT/Senior Leadership Team, Service Owner(s), IT Communications Team
External: techsupport-g email list; broader university community; customers

Communication Streams outside of Major Incident Process Defined Templates:

Additional communication to other applicable channels such as social media or external Slack channels will be discussed and agreed upon by the Major Incident Manager, Service Owner and the Division Communications Team prior to being sent by the Division of IT Communications Team.

Top of page

Appendix E: Core and Critical Services, Service Owners, and Service Providers (Support Groups)

Top of page

Appendix F: Major Incident Manager Checklist

Timeline from time of an incident being proposed as a major incident:

T+10 minutes

Ensure a member of Major Incident Manager group is investigating.
Read existing information on proposed incident (Major Incident Candidate).
Check Planned Maintenance (SAMS) calendar to see if related service had planned unavailability. Was the service announced to be unavailable? Take note of any planned planned maintenance that happened earlier.
Notify Slack channel #itsupport-csc of major incident proposal and short description of report. Ask for others experiencing same symptoms or issues.
Contact the Service Owner via Slack direct message. Gather information, including:
1. Who is the Technical Lead for the service?
2. Are you aware of an outage or degradation with the service?
3. Have you engaged technical resources to troubleshoot?
4. Do you know the extent of impact (users, locations, other services affected)?
If Service Owner is unresponsive/unavailable, reach out to their backup group members via Slack to identify someone to fill in for Service Owner responsibilities.

T+11 to T+20

Accept or reject proposed major incident, with reasoning via work note. See Major Incident Definition for criteria to aid in making decision.
Establish/join war room (Zoom) and claim host using key.
1. If Zoom unavailable, use Slack #it-major-incident channel as a dual war room and logging location.
Convene Service Owner(s), Technical Lead(s), and needed Service Provider team(s) in war room (contact them using the VT Technical Communication task and with same content via direct Slack message).
At start of war room gather initial information:
1. Is the issue an outage or degradation?
2. What is the extent of impact (who is impacted, how are they impacted)?
3. What are the symptoms?
4. Is there a workaround?
5. Are additional individuals/teams needed?
Add affected Service Owner(s), Technical Lead(s), Service Provider(s) to the #it-major-incident Slack channel.
Post initial IT status with Service Owner input.
Notify #itsupport-csc of major incident declaration, include record number they can associate incidents with.
Depending on scale of issue (e.g., multiple Service Owners, Technical Leads, teams involved), create a 'Technical troubleshooting' breakout room. Direct Technical Leads and Service Provider teams to troubleshoot issue in breakout room.

T+21 to T+30

Draft the initial notification to stakeholders with input from Service Owner(s).
Send initial notification to VPIT / Senior Leadership with input from Service Owner(s).
Send initial notification to departmental IT.

Ongoing:

Track significant events or activities in the #it-major-incident channel.
Communicate regularly to VPIT / Senior Leadership using communication tasks and update IT Service Status regularly. Communicate timing of next status update if longer than an hour is expected between updates.
Provide instructions for workaround or steps for remediation to frontline support.
Bring additional teams/resources into war room and major incident channel.
For major incidents bridging between business-hours/outside business hours, brief replacement Major Incident Manager before handing off the major incident.

Incident Resolution:

Check with Service Owner(s) that major incident can be resolved.
- Work with frontline support to help verify resolution.
Post an additional comment to the major incident record "We believe we have corrected the issue you were experiencing and are resolving your request for help. Please reply if you are still experiencing issues."
Send resolution communication tasks to stakeholders.
Except for the after-action report communication task, close any running communication tasks.
Click the 'Resolve' button on the major incident.
Notify #itsupport-csc of resolution.
Replace the IT status with informational message. Set time to take informational message down (typically EOB).

After-action report and review:

Create a collaboratively edited Google Document on the Major Incident shared Google Team Drive using the After-Action Report template. Name the document with and the service(s) impacted by the event and the date the Major Incident started.
2. Assign Service Owner(s) to complete the after-action report, provide link, and set a due date – typically 3 to 5 business days.
3. When necessary, as determined by the Major Incident Manager, in consultation with the Service Owner and their Senior Leader, an After-Action Review (postmortem) meeting will be scheduled. The purpose of the meeting is to capture any improvements for the service, internal processes, and the Major Incident Process.
4. When an After-Action Review meeting is necessary, members of the senior leadership team and the service owner will be invited to participate. In addition, the Major Incident manager will update the After-Action Review, if needed, based on additional findings from the meeting.
5. Update the Major Incident After-Action Review section.
6. Send the After-Action Report using the Communication Task.
7. Close the major incident record.

Top of page

Process Change History

Major changes to the process are documented here.

Date	Author(s)	Description of Change
2021-04-12	David Duckett, Lucas Sullivan, Joyce Landreth	Initial version of process.
2021-08-25	Joyce Landreth	Added clarification to Appendix D: Communications
2021-09-28	Lucas Sullivan, Joyce Landreth	Added section or Major Security Incident: The ITSO Major Security Incident Process can be found here: https://4help.vt.edu/sp?id=kb_article&sys_id=71a269241b32b810688b2f82604bcbdc
2022-01-20	Lucas Sullivan, Joyce Landreth	Added clarification to After-Action Report Section and updated communications templates. Moved communication templates to an attachment to this process.
2022-7-21	Joyce Landreth	Targeted improvements made: A knowledgebase article will be published to instruct distributed partners to attach related articles to Major Incidents if they receive incidents or calls User Engagement has developed a Google form to allow us to take calls if our ServiceNow instance is down or unavailable due to local authentication issues
2022-9-14	Joyce Landreth	Added clarification for sensitive public-security services in Definitions section.

Top of page

Major Incident

Introduction

Contents

Purpose

Major Incident Scope

Major Incident Definition

Process

How to Associate an Incident with a Major Incident: Link to knowledgebase article.

Glossary

Appendix A: Roles and Responsibilities

Appendix B: Priority, Impact, and Urgency

Appendix C: Backup IT services

Appendix D: Communications Templates and Timeline

Appendix E: Core and Critical Services, Service Owners, and Service Providers (Support Groups)

Appendix F: Major Incident Manager Checklist

T+10 minutes

T+11 to T+20

T+21 to T+30

Process Change History