Introduction
A major incident is an emerging or ongoing outage or degradation of a core or critical production service which significantly disrupts university/business operations.
Purpose
Formulate a coordinated response to major incidents with goals to:
- Restore core/critical production services to normal operations as soon as possible.
- Inform stakeholders of core/critical production service outages or degradations.
- Support activities to identify cause of issues and review organizational response through after-action reviews.
Top of page
Major Incident Scope
A major incident is an emerging or ongoing outage or degradation of a core or critical production service which significantly disrupts university/business operations. A major incident has at least one of the following characteristics:
- A core or critical service is (or is expected to be) inaccessible, unusable, or has performance issues which prevent normal operational usage, and the impact is widespread (e.g., affecting departments, campuses, a building or buildings).
- Time-critical business processes important to the university are hindered or stopped as a result of issues with core or critical service(s) regardless of the number of users affected.
- 4Help/FASTR (Faculty and Staff Technology Resources) determine there is a major incident related to a core/critical service based on specific individuals and/or buildings experiencing an issue.
Top of page
Major Incident Definition
A major incident can only be declared under the conditions that:
- An incident record exists (submitted by a user or created by technical personnel)
- The incident impacts a core or critical service
- The incident's priority ranks P1 (critical) or P2 (high)
- Incident work notes indicate an outage or degradation
- The incident has been proposed as a major incident
Clarification for sensitive public-security services: Any outage of a service that underpins the VT Alerts platform will not be posted as an outage on the IT Status page and communications will be limited to internal division channels only: Technical Team and VPIT/Senior Leaders communication channels. We will not communicate to external channels, including departmental IT channels (IT Council and Techsupport groups).
When a major incident is proposed, a Major Incident Manager decides whether to declare it a major incident. The Service Owner will be notified when an incident is proposed as major, and the Major Incident Manager will make the best effort to get input from the Service Owner; however, the Major Incident Manager may declare a major incident without Service Owner input.
An incident will be declared as major if it meets one of the following criteria:
- A Service Owner proposes a major incident for their own service
- The incident is proposed by the FASTR team and designated as warranting a major incident
- An incident is proposed as major with a P1 or P2 priority and corroborated by other reports or information (especially input from Service Owner, Major Incident Manager's further investigation, rate of incoming or recent incidents with similar symptoms)
- The Major Incident Manager can reproduce the symptoms of the service outage or degradation as well as at least one other Division of IT or departmental IT personnel
- 4Help or frontline support personnel have proposed multiple incidents as major with same symptoms or for the same service
A Major Incident Manager will reject a major incident proposal if:
- The proposed incident is not for a core/critical service
- The core/critical service was on the SAMS calendar for a scheduled maintenance that indicated the service would be unavailable
- The Service Owner informs the Major Incident Manager the service is not in use (e.g., Course Registration outage when not being used for registration), assuming incident volume and circumstances support this claim
- No corroborating information is available from the affected Service Owner, Technical Lead, Service Provider, other incidents, or other technical personnel
- The proposed incident lacks information such as work notes, justification, appropriate priority to warrant a major incident
Examples of Incidents vs Major Incidents
|
Incident
|
Major Incident
|
- User unable to log into Canvas
- Shared printer is not working for multiple users
- User cannot access a particular Banner form they used to be able to access
- User cannot receive SMS or calls from Duo
- User is unable to send/receive e-mail
|
- Canvas is down and no users can login
- Network is down on the academic side of campus
- VT Active Directory failures across the university
- HokieMart is experiencing slowness
- Duo reports that they are unable to send SMS or phone calls
|
Top of page
Process
View the Process Diagram
- Any incident fulfiller (especially 4Help, departmental IT, or other frontline support) logs, categorizes, prioritizes and performs initial diagnosis of an incident. A user may have submitted the incident, or it may have been created by technical personnel who detected an issue.
- An incident fulfiller proposes the incident as a major incident candidate based on major incident scope.
- The Major Incident Manager and Service Owner are notified when an incident has been proposed as major.
- If Service Owner(s), Technical Lead(s), or Service Provider(s) are aware of an outage or degradation, they may begin investigation and troubleshooting. Any production changes or significant events/findings should be recorded.
- A Major Incident Manager reviews the proposed incident and accepts or rejects it. This review includes checking the proposed incident against major incident criteria, gathering information, and seeking input of impacted Service Owner(s).
- If the incident is deemed major, then the Major Incident Manager promotes it to a major incident and self-assigns it.
- If the incident is not deemed major, then the Major Incident Manager demotes it with explanation and re-assigns the incident to the proposer.
- The Major Incident Manager, after declaring a major incident:
- Establishes a war room and major incident channel.
- Gathers impacted Service Owner(s), Service Provider(s), Technical Lead(s) in a war room and major incident channel. The Major Incident Manager is present in the war room for the duration of the major incident.
- Service Owner(s), Service Provider(s), Technical Lead(s) join the war room and major incident channel.
- The war room is used for coordination of the major incident, inter-team technical troubleshooting, and developing stakeholder communications. The participants should only be those actively involved in coordinating or troubleshooting the major incident:
- Service Owner(s) and Technical Lead(s) coordinate changes to production so changes are implemented in a deliberate, controlled way. This includes coordinating with other Service Owner(s), Technical Lead(s), and Service Provider(s).
- Technical Lead directs Service Provider team in troubleshooting to eventual resolution of the issue.
- Service Owner(s) provide regular updates to the Major Incident Manager, who will in turn communicate updates to stakeholders.
- Service Owner(s) notify the Major Incident Manager of additional technical resources needed. The Major Incident Manager will recruit individuals/teams needed.
- The Major Incident Manager apprises Service Owner(s) of significant new reports or environmental changes.
- The MI (major incident) channel is used for logging production changes, attempted fixes and results, and significant events:
- Technical Lead(s) and Service Owner(s) coordinate logging changes to production, attempted technical fixes, and results as they occur in the MI channel.
- The Major Incident Manager, Service Owner(s), Technical Lead(s), and Service Provider team(s) record significant events (e.g., new symptoms, expanded scope of impact).
- The MI channel is used to get added teams or individuals up to speed in the case the major incident is escalated and expanded to involve other teams.
- The Major Incident Manager records significant activities in the major incident record.
- The Major Incident Manager informs stakeholders of the major incident and provides regular updates:
- Post an initial IT status with input from Service Owner(s).
- Notify support personnel of major incident declaration in the #itsupport-csc channel
- Send initial stakeholder communications with input from Service Owner(s).
- Troubleshooting and communication activities continue, with escalation as needed, until resolution. If the issue bridges business-hours and non-business hours, the Major Incident Manager will brief a new Major Incident Manager who will take over responsibilities.
- The Service Owner(s) notify the Major Incident Manager of resolution of the Major Incident.
- The Major Incident Manager resolves the major incident record and sends resolution notifications to stakeholders.
- The Major Incident Manager replaces the IT status with an informational status to be removed at EOB of current or next business day.
- The Major Incident Manager assigns Service Owner(s) to complete the after-action report or contribute to its completion. (The AAR template can be found as an attachment at the bottom of this article.)
- Service Owner(s) consult with their Technical Lead and Service Provider to complete after-action report.
- The Major Incident Manager schedules an after-action review meeting, if necessary, to review the after-action report and gather feedback from participants on organizational response to the major incident.
- The Major Incident Manager sends the after-action report to internal stakeholders.
Clarifications:
- Outside of normal business hours or if the designated Major Incident Manager is unreachable, 4Help will contact the supervisor on-call to serve as the Major Incident Manager.
- For core or critical services with verified outage or degradation lasting a brief time (seconds or minutes):
- The Major Incident Manager will create a major incident with an immediately ending outage/degradation and post an informational status acknowledging the temporary outage/degradation. The Major Incident Manager will work with Service Owner(s) to develop and send a brief communication to stakeholders notifying them of the issue. No after-action report or after-action review will be required.
- If the Service Owner cannot be reached, the Major Incident Manager will notify a group of major incident contacts which can fulfill their responsibilities.
- If there are simultaneous, separate major incidents, a Major Incident Manager may recruit an additional Major Incident Manager who will be assigned to assist with management activities of the potentially separate major incident. Both major incident managers will maintain active communications to coordinate efforts and share information.
- The IT Service Status list has some non-core/critical services for which an IT status can be posted, but would not be in scope of the major incident process.
- The ITSO Major Security Incident Process can be found here: Major Security Incident Process.
Top of page
Top of page
Glossary
- Core/critical service: A core service is a foundational IT system, infrastructure, platform, or process that is widely leveraged by stakeholders across the university. A critical service is a mission-critical service that requires continuous availability. Breaks in a mission-critical service are intolerable and may immediately cause damage to the university's mission. Only services identified as core or critical can trigger major incidents. Core/critical services are identified by the VP for IT, Division of IT senior leadership, or a Division of IT service owner. A new core/critical service can be added by submitting an incident at 4help.vt.edu to attention of ServiceNow team. See appendix D for a list of core/critical services.
- Degradation: Some or all users cannot access some features, experience performance issues, or experience intermittent symptoms.
- Impact: The measure of the effect of an incident on business processes or how service levels will be affected. The impact may be defined by accounting for the number of affected users or buildings/locations affected.
- Major Incident: A major incident is an unplanned outage or degradation of a core/critical production service.
- Outage: No users can utilize any features of the service.
- Priority: The combined measure of impact and urgency which determines low, normal, high, or critical support priority.
- Urgency: The importance of the service encountering an issue and a measure of how long it will be until the incident has significant impact on the business. It may be viewed as the degree to which the resolution of the incident can be delayed.
Top of page
Appendix A: Roles and Responsibilities
- 4Help, frontline IT support: any incident fulfiller, though especially 1st-tier, higher-tier support, and departmental IT.
- Record, assess, classify, and diagnose incidents
- Propose an incident as a major incident
- Associate an incident to an open major incident
- Assist with validating resolution of incidents associated with a major incident
- 4help only: For incidents proposed as major incidents after-hours, contact the on-call supervisor to fulfill Major Incident Manager role
- Major Incident Manager: coordinates organizational response to a major incident, oversees communications.
- Ensure a Major Incident Manager is actively investigating a proposed major incident
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Review and investigate proposed major incidents
- Check SAMS calendar
- Promote/escalate proposed major incident to major incident
- Establish a war room and major incident channel and bring in needed roles
- Be available in the war room for the duration of the major incident or until and Major Incident Major is named at a shift change.
- Communicate major incident initiation, updates, resolution, and after-action review summary to internal and external stakeholders
- Manage IT service status postings
- Apprise Service Owner(s) of significant events or environmental changes
- Coordinate the overall major incident
- Recruiting the appropriate technical resources, escalating to resource managers if needed
- Coordinate verifying incident resolution with frontline support and inform Service Owner(s) of results
- Resolve major incident with input from Service Owner(s)
- Assign Service Owner(s) to complete or contribute to the after-action report
- Schedule and lead after-action review with involved individuals and teams
- Brief new Major Incident Manager in case of handoff (e.g., business hours to after-hours)
- Service Owner: an individual in the Division of IT accountable for delivery of a core or critical service.
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Investigate a proposed major incident and provide information to the Major Incident Manager
- Ensure a Technical Lead is identified from the Service Provider team is identified and actively working on the major incident
- Identify a Technical Lead from Service Provider group if one is not available
- Participate in war room and MI channel
- Provides regular updates to the Major Incident Manager
- Supply information/content for communications and review draft stakeholder communications (initial, regular, and resolution communications) as requested by the Major Incident Manager
- Coordinate with the Technical Lead and ensure logging of production changes, troubleshooting activities, and results of troubleshooting to the major incident channel
- Apprise the Major Incident Manager of significant events or environmental changes
- Notify the Major Incident Manager if additional technical resources are needed
- Notify Business Owner of a major incident with a core/critical service by directing them to IT Service Status
- Inform the Major Incident Manager when a major incident can be resolved.
- Complete or contribute to the after-action report as assigned by Major Incident Manager
- Attend after-action review meeting as assigned by Major Incident Manager
- Business Owner: does not have responsibilities or activities defined in major incident response. This is typically an individual outside the Division of IT that owns the service from a university business standpoint, but is not engaged in service delivery activities.
- Technical Lead: A member of the Service Provider team which provides support for a core/critical service and oversees technical troubleshooting activities.
- Begin investigation and troubleshooting prior to major incident declaration
- Set contact information in ServiceNow and subscribe to receive SMS notifications for Major Incident notifications
- Participate in war room and MI channel
- Lead and coordinate troubleshooting efforts via war room of respective Service Provider team members
- Coordinate efforts under direction of Service Owner
- Coordinate troubleshooting efforts via war room with other Technical Lead(s)
- Ensure changes to production systems, attempted fixes, and their results are logged in major incident channel (in addition to regular change management practices)
- Complete after-action review as assigned by Major Incident Manager
- Contribute content to after-action review as assigned by Service Owner
- Attend after-action review meeting (as requested)
- Service Provider: a group responsible for delivering or supporting a core/critical service
- Participate in war room and MI channel
- Troubleshoot and resolve the major incident under direction of Technical Lead and in coordination with other Service Provider teams via war room.
- Log production changes or significant troubleshooting activities and their results in major incident channel
- Contribute content to after-action review as assigned by Service Owner or Technical Lead
- Attend after-action review meeting (as requested)
Top of page
Appendix B: Priority, Impact, and Urgency
The priority of an incident, as determined by impact and urgency, is considered when declaring a major incident.
- Impact: The measure of the effect of an incident on business processes or how service levels will be affected. The impact may be defined by accounting for the number of affected users or buildings/locations affected.
- University: all users of an IT service are affected, multiple IT services are affected, or a campus or campuses are affected
- Multiple users/building(s): affects many users of an IT service, a small group of users, department, or building
- Single User: single user or room affected
- Urgency: A measure of how long it will be until the incident has significant business impact. It may be viewed as the degree to which the resolution of the incident can be delayed.
- Critical: halts a time-sensitive, critical university/business process, causes or indicates a major security issue, all functions of service are unusable or cannot be accessed and no workaround exists, rapidly increasing sphere of impact, rapidly causing incidents
- High: impedes a university/business process, major inconvenience, slowly increasing sphere of impact, causing incidents at a slow rate
- Normal: minor inconvenience, incident effects are stable
- Low: not time-sensitive
Impact and urgency determine an incident's priority as shown in the prioritization matrix below.
Prioritization Matrix
|
Urgency
|
Critical
|
High
|
Normal
|
Low
|
Impact
|
University
|
P1
|
P1
|
P2
|
P4
|
Multiple users/building(s)
|
P1
|
P2
|
P2
|
P4
|
Single User
|
P2
|
P2
|
P3
|
P4
|
Top of page
Appendix C: Backup IT services
Purpose
|
Primary (default) service
|
Secondary (if primary down)
|
Tertiary (if secondary down)
|
Troubleshooting
|
Zoom
|
Slack
|
Teams
|
Logging changes/events
|
Slack
|
Teams
|
|
Communications
|
Email
|
Slack
|
Teams
|
Incident Handling
|
ServiceNow
|
Manual record
|
|
Top of page
Appendix D: Communications Templates and Timeline
View Formal Communications Templates here
Estimated Time (after MI proposal)
|
Item
|
Sender
|
Receiver
|
Communication Channel/Medium
|
0-10 mins
|
Notice of major incident proposal
|
MI Manager
|
Service Owner, Service Provider, others
|
Email, #itsupport-csc (Slack)
|
10-20 mins.
|
Notice of major incident declaration
|
MI Manager
|
Service Owner, Service Provider, 4Help, others
|
Slack, IT Status
|
War room/major incident channel notification
|
MI Manager
|
Service Owner, Service Provider
|
Slack, Email
|
20-30 mins.
|
Initial MI communications
|
MI Manager
|
Stakeholders
|
Email
|
As available, at least hourly
|
Technical updates
|
Service Owner
|
MI Manager
|
Zoom War Room
|
Hourly
|
Major Incident updates
|
MI Manager
|
Stakeholders
|
Email, IT Status
|
0-15 mins. after resolution is confirmed
|
Incident resolution
|
Service Owner
|
MI Manager
|
MI Slack Channel, Zoom War Room, others
|
15-30 mins. after resolution is confirmed
|
Incident resolution
|
MI Manager
|
Stakeholders
|
Email, IT Status
|
Informational status after incident resolution
|
MI Manager
|
Stakeholders
|
IT Status post (informational)
|
Stakeholders:
- Technical (troubleshooting): Service Owner, Service Provider, Technical Lead
- Technical (internal awareness): 4Help and other service owners/providers
- Internal: VPIT/Senior Leadership Team, Service Owner(s), IT Communications Team
- External: techsupport-g email list; broader university community; customers
Communication Streams outside of Major Incident Process Defined Templates:
Additional communication to other applicable channels such as social media or external Slack channels will be discussed and agreed upon by the Major Incident Manager, Service Owner and the Division Communications Team prior to being sent by the Division of IT Communications Team.
Top of page
Appendix E: Core and Critical Services, Service Owners, and Service Providers (Support Groups)
Log in to ServiceNow to view the list of Major Incident declarable services.
Top of page
Appendix F: Major Incident Manager Checklist
Timeline from time of an incident being proposed as a major incident:
T+10 minutes
- Ensure a member of Major Incident Manager group is investigating.
- Read existing information on proposed incident (Major Incident Candidate).
- Check Planned Maintenance (SAMS) calendar to see if related service had planned unavailability. Was the service announced to be unavailable? Take note of any planned planned maintenance that happened earlier.
- Notify Slack channel #itsupport-csc of major incident proposal and short description of report. Ask for others experiencing same symptoms or issues.
- Contact the Service Owner via Slack direct message. Gather information, including:
- Who is the Technical Lead for the service?
- Are you aware of an outage or degradation with the service?
- Have you engaged technical resources to troubleshoot?
- Do you know the extent of impact (users, locations, other services affected)?
- If Service Owner is unresponsive/unavailable, reach out to their backup group members via Slack to identify someone to fill in for Service Owner responsibilities.
T+11 to T+20
- Accept or reject proposed major incident, with reasoning via work note. See Major Incident Definition for criteria to aid in making decision.
- Establish/join war room (Zoom) and claim host using key.
- If Zoom unavailable, use Slack #it-major-incident channel as a dual war room and logging location.
- Convene Service Owner(s), Technical Lead(s), and needed Service Provider team(s) in war room (contact them using the VT Technical Communication task and with same content via direct Slack message).
- At start of war room gather initial information:
- Is the issue an outage or degradation?
- What is the extent of impact (who is impacted, how are they impacted)?
- What are the symptoms?
- Is there a workaround?
- Are additional individuals/teams needed?
- Add affected Service Owner(s), Technical Lead(s), Service Provider(s) to the #it-major-incident Slack channel.
- Post initial IT status with Service Owner input.
- Notify #itsupport-csc of major incident declaration, include record number they can associate incidents with.
- Depending on scale of issue (e.g., multiple Service Owners, Technical Leads, teams involved), create a 'Technical troubleshooting' breakout room. Direct Technical Leads and Service Provider teams to troubleshoot issue in breakout room.
T+21 to T+30
- Draft the initial notification to stakeholders with input from Service Owner(s).
- Send initial notification to VPIT / Senior Leadership with input from Service Owner(s).
- Send initial notification to departmental IT.
Ongoing:
- Track significant events or activities in the #it-major-incident channel.
- Communicate regularly to VPIT / Senior Leadership using communication tasks and update IT Service Status regularly. Communicate timing of next status update if longer than an hour is expected between updates.
- Provide instructions for workaround or steps for remediation to frontline support.
- Bring additional teams/resources into war room and major incident channel.
- For major incidents bridging between business-hours/outside business hours, brief replacement Major Incident Manager before handing off the major incident.
Incident Resolution:
- Check with Service Owner(s) that major incident can be resolved.
- Work with frontline support to help verify resolution.
- Post an additional comment to the major incident record "We believe we have corrected the issue you were experiencing and are resolving your request for help. Please reply if you are still experiencing issues."
- Send resolution communication tasks to stakeholders.
- Except for the after-action report communication task, close any running communication tasks.
- Click the 'Resolve' button on the major incident.
- Notify #itsupport-csc of resolution.
- Replace the IT status with informational message. Set time to take informational message down (typically EOB).
After-action report and review:
- Create a collaboratively edited Google Document on the Major Incident shared Google Team Drive using the After-Action Report template. Name the document with and the service(s) impacted by the event and the date the Major Incident started.
2. Assign Service Owner(s) to complete the after-action report, provide link, and set a due date – typically 3 to 5 business days.
3. When necessary, as determined by the Major Incident Manager, in consultation with the Service Owner and their Senior Leader, an After-Action Review (postmortem) meeting will be scheduled. The purpose of the meeting is to capture any improvements for the service, internal processes, and the Major Incident Process.
4. When an After-Action Review meeting is necessary, members of the senior leadership team and the service owner will be invited to participate. In addition, the Major Incident manager will update the After-Action Review, if needed, based on additional findings from the meeting.
5. Update the Major Incident After-Action Review section.
6. Send the After-Action Report using the Communication Task.
7. Close the major incident record.
Top of page
Process Change History
Major changes to the process are documented here.
Date |
Author(s) |
Description of Change |
2021-04-12 |
David Duckett, Lucas Sullivan, Joyce Landreth |
Initial version of process. |
2021-08-25 |
Joyce Landreth |
Added clarification to Appendix D: Communications |
2021-09-28 |
Lucas Sullivan, Joyce Landreth |
Added section or Major Security Incident:
|
2022-01-20 |
Lucas Sullivan, Joyce Landreth |
Added clarification to After-Action Report Section and updated communications templates. Moved communication templates to an attachment to this process. |
2022-7-21 |
Joyce Landreth |
Targeted improvements made:
- A knowledgebase article will be published to instruct distributed partners to attach related articles to Major Incidents if they receive incidents or calls
- User Engagement has developed a Google form to allow us to take calls if our ServiceNow instance is down or unavailable due to local authentication issues
|
2022-9-14 |
Joyce Landreth |
Added clarification for sensitive public-security services in Definitions section.
|
Top of page