Major Incident


Contents

 

Purpose

Formulate a coordinated response to major incidents with goals to:

Major Incident Scope

A major incident is an emerging or ongoing outage or degradation of a core or critical production service which significantly disrupts university/business operations. A major incident has at least one of the following characteristics:

Major Incident Definition

A major incident can only be declared under the conditions that:

Clarification for sensitive public-security services:  Any outage of a service that underpins the VT Alerts platform will not be posted as an outage on the IT Status page and communications will be limited to internal division channels only:  Technical Team and VPIT/Senior Leaders communication channels.  We will not communicate to external channels, including departmental IT channels (IT Council and Techsupport groups).

When a major incident is proposed, a Major Incident Manager decides whether to declare it a major incident. The Service Owner will be notified when an incident is proposed as major, and the Major Incident Manager will make the best effort to get input from the Service Owner; however, the Major Incident Manager may declare a major incident without Service Owner input.

An incident will be declared as major if it meets one of the following criteria:

A Major Incident Manager will reject a major incident proposal if:

Examples of Incidents vs Major Incidents

Incident

Major Incident

  • User unable to log into Canvas
  • Shared printer is not working for multiple users
  • User cannot access a particular Banner form they used to be able to access
  • User cannot receive SMS or calls from Duo
  • User is unable to send/receive e-mail
  • Canvas is down and no users can login
  • Network is down on the academic side of campus
  • VT Active Directory failures across the university
  • HokieMart is experiencing slowness
  • Duo reports that they are unable to send SMS or phone calls

Process

View the Process Diagram

  1. Any incident fulfiller (especially 4Help, departmental IT, or other frontline support) logs, categorizes, prioritizes and performs initial diagnosis of an incident. A user may have submitted the incident, or it may have been created by technical personnel who detected an issue.
  2. An incident fulfiller proposes the incident as a major incident candidate based on major incident scope.
    1. The Major Incident Manager and Service Owner are notified when an incident has been proposed as major.
    2. If Service Owner(s), Technical Lead(s), or Service Provider(s) are aware of an outage or degradation, they may begin investigation and troubleshooting. Any production changes or significant events/findings should be recorded.
  3. A Major Incident Manager reviews the proposed incident and accepts or rejects it. This review includes checking the proposed incident against major incident criteria, gathering information, and seeking input of impacted Service Owner(s).
    1. If the incident is deemed major, then the Major Incident Manager promotes it to a major incident and self-assigns it.
    2. If the incident is not deemed major, then the Major Incident Manager demotes it with explanation and re-assigns the incident to the proposer.
  4. The Major Incident Manager, after declaring a major incident:
    1. Establishes a war room and major incident channel.
    2. Gathers impacted Service Owner(s), Service Provider(s), Technical Lead(s) in a war room and major incident channel. The Major Incident Manager is present in the war room for the duration of the major incident.
      1. Service Owner(s), Service Provider(s), Technical Lead(s) join the war room and major incident channel.
        1. The war room is used for coordination of the major incident, inter-team technical troubleshooting, and developing stakeholder communications. The participants should only be those actively involved in coordinating or troubleshooting the major incident:
          1. Service Owner(s) and Technical Lead(s) coordinate changes to production so changes are implemented in a deliberate, controlled way. This includes coordinating with other Service Owner(s), Technical Lead(s), and Service Provider(s).
          2. Technical Lead directs Service Provider team in troubleshooting to eventual resolution of the issue.
          3. Service Owner(s) provide regular updates to the Major Incident Manager, who will in turn communicate updates to stakeholders.
          4. Service Owner(s) notify the Major Incident Manager of additional technical resources needed. The Major Incident Manager will recruit individuals/teams needed.
          5. The Major Incident Manager apprises Service Owner(s) of significant new reports or environmental changes.
        2. The MI (major incident) channel is used for logging production changes, attempted fixes and results, and significant events:
          1. Technical Lead(s) and Service Owner(s) coordinate logging changes to production, attempted technical fixes, and results as they occur in the MI channel.
          2. The Major Incident Manager, Service Owner(s), Technical Lead(s), and Service Provider team(s) record significant events (e.g., new symptoms, expanded scope of impact).
          3. The MI channel is used to get added teams or individuals up to speed in the case the major incident is escalated and expanded to involve other teams.
        3. The Major Incident Manager records significant activities in the major incident record.
    3. The Major Incident Manager informs stakeholders of the major incident and provides regular updates:
      1. Post an initial IT status with input from Service Owner(s).
      2. Notify support personnel of major incident declaration in the #itsupport-csc channel
      3. Send initial stakeholder communications with input from Service Owner(s).
  5. Troubleshooting and communication activities continue, with escalation as needed, until resolution. If the issue bridges business-hours and non-business hours, the Major Incident Manager will brief a new Major Incident Manager who will take over responsibilities.
  6. The Service Owner(s) notify the Major Incident Manager of resolution of the Major Incident.
    1. The Major Incident Manager resolves the major incident record and sends resolution notifications to stakeholders.
    2. The Major Incident Manager replaces the IT status with an informational status to be removed at EOB of current or next business day.
  7. The Major Incident Manager assigns Service Owner(s) to complete the after-action report or contribute to its completion. (The AAR template can be found as an attachment at the bottom of this article.)
    1. Service Owner(s) consult with their Technical Lead and Service Provider to complete after-action report.
  8. The Major Incident Manager schedules an after-action review meeting, if necessary, to review the after-action report and gather feedback from participants on organizational response to the major incident.
  9. The Major Incident Manager sends the after-action report to internal stakeholders.

Clarifications:

Glossary

Appendix A: Roles and Responsibilities

Appendix B: Priority, Impact, and Urgency

The priority of an incident, as determined by impact and urgency, is considered when declaring a major incident.

Impact and urgency determine an incident’s priority as shown in the prioritization matrix below.

Prioritization Matrix

Urgency 

 Critical 

 High 

 Normal 

 Low 

Impact

University 

P1

P1

P2

P4

Multiple users/building(s) 

P1

P2

P2

P4

Single User 

P2

P2

P3

P4

Appendix C: Backup IT services

Purpose

Primary (default) service

Secondary (if primary down)

Tertiary (if secondary down)

Troubleshooting

Zoom

Slack

Teams

Logging changes/events

Slack

Teams

 

Communications

Email

Slack

Teams

Incident Handling

ServiceNow

Manual record

 

Appendix D: Communications Templates and Timeline

View Formal Communications Templates here

Estimated Time (after MI proposal)

Item

Sender

Receiver

Communication Channel/Medium

0-10 mins

Notice of major incident proposal

MI Manager

Service Owner, Service Provider, others

Email, #itsupport-csc (Slack)

10-20 mins.

Notice of major incident declaration

MI Manager

Service Owner, Service Provider, 4Help, others

Slack, IT Status

War room/major incident channel
notification

MI Manager

Service Owner, Service Provider

Slack, Email

20-30 mins.

Initial MI communications

MI Manager

Stakeholders

Email

As available, at least hourly

Technical updates

Service Owner

MI Manager

Zoom War Room

Hourly

Major Incident updates

MI Manager

Stakeholders

Email, IT Status

0-15 mins. after resolution is confirmed

Incident resolution

Service Owner

MI Manager

MI Slack Channel, Zoom War Room, others

15-30 mins. after resolution is confirmed

Incident resolution

MI Manager

Stakeholders

Email, IT Status

Informational status after incident resolution

MI Manager

Stakeholders

IT Status post (informational)

 

Stakeholders:

Communication Streams outside of Major Incident Process Defined Templates:  

Additional communication to other applicable channels such as social media or external Slack channels will be discussed and agreed upon by the Major Incident Manager, Service Owner and the Division Communications Team prior to being sent by the Division of IT Communications Team.

Appendix E: Core and Critical Services, Service Owners, and Service Providers (Support Groups)

Log in to ServiceNow to view the list of Major Incident declarable services

Appendix F: Major Incident Manager Checklist

Timeline from time of an incident being proposed as a major incident.

T+10 minutes

  1. Ensure a member of Major Incident Manager group is investigating.
  2. Read existing information on proposed incident (Major Incident Candidate).
  3. Check Planned Maintenance (SAMS) calendar to see if related service had planned unavailability. Was the service announced to be unavailable? Take note of any planned planned maintenance that happened earlier.
  4. Notify Slack channel #itsupport-csc of major incident proposal and short description of report. Ask for others experiencing same symptoms or issues.
  5. Contact the Service Owner via Slack direct message. Gather information, including:
    1. Who is the Technical Lead for the service?
    2. Are you aware of an outage or degradation with the service?
    3. Have you engaged technical resources to troubleshoot?
    4. Do you know the extent of impact (users, locations, other services affected)?
  6. If Service Owner is unresponsive/unavailable, reach out to their backup group members via Slack to identify someone to fill in for Service Owner responsibilities.

T+11 to T+20

  1. Accept or reject proposed major incident, with reasoning via work note. See Major Incident Definition for criteria to aid in making decision.
  2. Establish/join war room (Zoom) and claim host using key.
    1. If Zoom unavailable, use Slack #it-major-incident channel as a dual war room and logging location.
  3. Convene Service Owner(s), Technical Lead(s), and needed Service Provider team(s) in war room (contact them using the VT Technical Communication task and with same content via direct Slack message).
  4. At start of war room gather initial information:
    1. Is the issue an outage or degradation?
    2. What is the extent of impact (who is impacted, how are they impacted)?
    3. What are the symptoms?
    4. Is there a workaround?
    5. Are additional individuals/teams needed?
  5. Add affected Service Owner(s), Technical Lead(s), Service Provider(s) to the #it-major-incident Slack channel.
  6. Post initial IT status with Service Owner input.
  7. Notify #itsupport-csc of major incident declaration, include record number they can associate incidents with.
  8. Depending on scale of issue (e.g., multiple Service Owners, Technical Leads, teams involved), create a ‘Technical troubleshooting’ breakout room. Direct Technical Leads and Service Provider teams to troubleshoot issue in breakout room.

T+21 to T+30

  1. Draft the initial notification to stakeholders with input from Service Owner(s).
  2. Send initial notification to VPIT / Senior Leadership with input from Service Owner(s).
  3. Send initial notification to departmental IT.

Ongoing:

Incident Resolution:

  1. Check with Service Owner(s) that major incident can be resolved.
    • Work with frontline support to help verify resolution.
  2. Post an additional comment to the major incident record "We believe we have corrected the issue you were experiencing and are resolving your request for help. Please reply if you are still experiencing issues."
  3. Send resolution communication tasks to stakeholders.
  4. Except for the after-action report communication task, close any running communication tasks.
  5. Click the 'Resolve' button on the major incident.
  6. Notify #itsupport-csc of resolution.
  7. Replace the IT status with informational message. Set time to take informational message down (typically EOB).

After-action report and review:

  1. Create a collaboratively edited Google Document on the Major Incident shared Google Team Drive using the After-Action Report template. Name the document with and the service(s) impacted by the event and the date the Major Incident started.
    2. Assign Service Owner(s) to complete the after-action report, provide link, and set a due date – typically 3 to 5 business days.
    3. When necessary, as determined by the Major Incident Manager, in consultation with the Service Owner and their Senior Leader, an After-Action Review (postmortem) meeting will be scheduled.  The purpose of the meeting is to capture any improvements for the service, internal processes, and the Major Incident Process.
    4. When an After-Action Review meeting is necessary, members of the senior leadership team and the service owner will be invited to participate.  In addition, the Major Incident manager will update the After-Action Review, if needed, based on additional findings from the meeting. 
    5. Update the Major Incident After-Action Review section.
    6. Send the After-Action Report using the Communication Task.
    7. Close the major incident record.

Process change history

Major changes to the process are documented here.

Date Author(s) Description of Change
2021-04-12 David Duckett, Lucas Sullivan, Joyce Landreth Initial version of process.
2021-08-25 Joyce Landreth Added clarification to Appendix D: Communications
2021-09-28 Lucas Sullivan, Joyce Landreth Added section or Major Security Incident:
2022-01-20 Lucas Sullivan, Joyce Landreth Added clarification to After-Action Report Section and updated communications templates.  Moved communication templates to an attachment to this process. 
2022-7-21 Joyce Landreth

Targeted improvements made: 

  • A knowledgebase article will be published to instruct distributed partners to attach related articles to Major Incidents if they receive incidents or calls
  • User Engagement has developed a Google form to allow us to take calls if our ServiceNow instance is down or unavailable due to local authentication issues
2022-9-14 Joyce Landreth

Added clarification for sensitive public-security services in Definitions section.