Major Incident Handbook for Services July 2015 Hotline 617-496-2831 Questions? Contact IT Service Management at

Slides:



Advertisements
Similar presentations
SERVICE MANAGER 9.2 PROBLEM MANAGEMENT TRAINING JUNE 2011.
Advertisements

Using the Self Service BMC Helpdesk
Tivoli Service Request Manager
Program Management Office (PMO) Design
IT Major Incident Management Response and Communication
Business Continuity Training & Awareness by Sulia Toutai (ANZ)
PROJECT RISK MANAGEMENT
Major Incident Process
Problem Management Process Training
A A A N C N U I N F O R M A T I O N T E C H N O L O G Y : IT OPERATIONS 1 Problem Management Jim Heronime, Manager, ITSM Program Tanya Friehauf-Dungca,
IT Strategic Planning Project – Hamilton Campus FY2005.
SE 555 Software Requirements & Specification Requirements Management.
Best Practices – Overview
IS&T Project Management: Project Management 101 June, 2006.
Remedy, a BMC Software company Change Management Maximize Speed and Minimize Risk in the Change Process.
1 Change Management FOR University Medical Group Saint Louis University Click this icon for Audio.
Problem Management Overview
Incident Management ISD Division Office of State Finance.
IS&T Project Management: How to Engage the Customer September 27, 2005.
Problem Management ISD Division Office of State Finance.
ITIL Problem Management Tool Guide Gerald M. Guglielmo ITIL Problem Manager CD-doc
Network security policy: best practices
ITIL Process Management An Overview of Service Management Processes Presented by Jerree Catlin, Sue Silkey & Thelma Simons.
Problem Management SDG teamIT Problem Management Process 2009 Client Services Authors: M. Begley & R. Crompton, Client Services.
ITPD PRODUCTION SUPPORT PROCESS OCTOBER 8, /15/2015 Guiding Principles 1.Support the business area’s needs to execute transactions and expand.
Change Advisory Board COIN v1.ppt Change Advisory Board ITIL COIN June 20, 2007.
Release & Deployment ITIL Version 3
What is Business Analysis Planning & Monitoring?
HUIT Queue Managers Forum May 7, Agenda Welcome The Role of the Service Owner Service Metrics “IT Order Takers” ServiceNow Best Practices, Tips.
Purpose A crisis communication plan coordinates the communication within the organization, as well as between the organization and the media and the public.
Roles and Responsibilities
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
INFSO-RI Enabling Grids for E-sciencE Incident Response Policies and Procedures Carlos Fuentes
Service Management Processes
ITIL Process Management An Overview of Service Management Processes Thanks to Jerree Catlin, Sue Silkey & Thelma Simons University of Kansas.
Working with the LiveOps Help Desk
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
December 14, 2011/Office of the NIH CIO Operational Analysis – What Does It Mean To The Project Manager? NIH Project Management Community of Excellence.
Service Transition & Planning Service Validation & Testing
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Gathering Network Requirements Designing and Supporting Computer Networks – Chapter.
Event Management & ITIL V3
Developing Plans and Procedures
Avoid Disputes, Not Complaints Presented by: Stuart Ayres and Derek Pullen Stuart Ayres, Scheme Manager Derek Pullen, Scheme Adjudicator.
ITPD PRODUCTION SUPPORT PROCESS OCTOBER 8, /23/2015 Guiding Principles 1.Resolve production issues in a timely and effective manner 2.Manage.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Gathering Network Requirements Designing and Supporting Computer Networks – Chapter.
University of Minnesota Internal\External Sales “The Internal Sales Review Process” An Overview of What Happens During the Review.
ITPD ISSUE MANAGEMENT PROCESS SEPTEMBER 5, 2008
Working with the LiveOps Help Desk
State of Georgia Release Management Training
Erman Taşkın. Information security aspects of business continuity management Objective: To counteract interruptions to business activities and to protect.
Introduction to ITIL and ITIS. CONFIDENTIAL Agenda ITIL Introduction  What is ITIL?  ITIL History  ITIL Phases  ITIL Certification Introduction to.
Staff Assessment Technology Services Department Palmyra Area School District.
Incident Management A disruption in normal or standard business operation that affects the quality of service Goal: restore normal service as quickly as.
Introduction to ITSM processes. CONFIDENTIAL Agenda Problem Management  Overview  High Level process Change Management  Overview  High Level process.
6/13/2015 Visit the Sponsor tables to enter their end of day raffles. Turn in your completed Event Evaluation form at the end of the day in the Registration.
Progress of ITSM at Pomona College and the use of Footprints Information Technology Services IT Service Management and the Tools Supporting it Pomona College,
Problem Management for ITSD “Getting to the root of it” Thatcher Deane Feb 28, 2013.
Business Continuity Planning 101
Info-Tech Research Group1 Info-Tech Research Group, Inc. Is a global leader in providing IT research and advice. Info-Tech’s products and services combine.
ITIL® Core Concepts “Foundations to the Framework” Thatcher Deane 02/12/2010.
Account Management Overview
Ticket Handling, Queue Management and QlikView Dashboard Workshop
IT Service Operation - purpose, function and processes
Description of Revision
Portfolio, Programme and Project
Manage Business Continuity Introductory Brief
ISSUE MANAGEMENT PROCESS MONTH DAY, YEAR
{Project Name} Organizational Chart, Roles and Responsibilities
Presentation transcript:

Major Incident Handbook for Services July 2015 Hotline Questions? Contact IT Service Management at

Table of Contents Overview 3 Incident Priority Levels 4 Report Major Incidents5 Goals of Major Incident Process6 High-Level Process and Steps7 Major Incident Categories8 Key Roles9 Communications10 Process and Procedures11 Assess12 Contain18 Resolve21 Roles & Responsibilities23 Code of Conduct24 Incident Manager26 Technical Teams28 Service Desk and SOC Ops29 Incident Coordinator30 Incident Leader31 RACI Chart32 Appendix33 2

Major Incidents: Overview 3

Incident Priority Levels Crisis Major < 0.1% Critical < 1% High ~ 5% Normal ~ 95% Unplanned service interruption or degradation that disrupts teaching, learning, research, and/or administration. University digital emergency Operational Category 1 Initial assessment or limited impact 2 Known solution or working toward a solution 3 Significant impact with no known solution

Report Major Incidents 5 Provide as much information as possible to facilitate the process: Incident start time Services or applications impacted Impact to users or University functions Teams need for troubleshooting Initial diagnosis or current actions, if any Call the Hotline Anytime! See something, say something… If you think there may be a major incident, call the hotline. If in doubt, err on the side of calling; don’t wait!

Goals of Major Incident Process 6 Minimize negative impact to the institution and its mission. Restore normal service as quickly as possible. –Implement workaround, if it enables a faster resolution. –Balance service restoration (incident management) with gathering root cause information (problem management). Marshal necessary staff and resources for resolution. Communicate appropriately and promptly. –Internal: HUIT, including senior leadership as required –External: Users and stakeholders –For security incidents, different protocols may be necessary. When possible, preserve forensics for further analysis.

Declaration of MI Triggered by Service Desk reporting, event monitoring, and/or individual Preliminary information gathering regarding incident (Incident Coordinator) Initial Coordination Initial assessment and categorization of MI (Service/Offering Owner) Convene service/offering owner and technical team on conference bridge (Incident Coordinator) Notify HUIT staff (Incident Coordinator) and users (Service/Offering Owner) Investigation and Diagnosis Marshaling of appropriate troubleshooting resources (Service/Offering Owner) Ongoing communication (Service/Offering Owner) Escalation as needed (Service/Offering Owner or Incident Coordinator) Vendor management as needed (Service/Offering Owner) Implement temporary workaround or permanent solution (Service/Offering Owner) Resolution and Closure Validation that all services are operational, including downstream (Service/Offering Owner) Gathering of all key information, e.g., end time, actions, root cause when known (Service/Offering Owner and Incident Coordinator) Final communication, including HUIT Alert notification of resolution (Incident Coordinator) Trigger problem review and After Action Review ASSESS CONTAIN RESOLVE 7 Major Incident: High-Level Process and Steps

Major Incident Categories and Flow CategoryDescriptionIncident Leader Role 1 Initial assessment or limited impact with risk of escalation Informed 2 Known solution or working toward a solution Involved, if extended duration 3 Significant impact with no known solution Leading 8 Time-boxed: Assess category status every 30 minutes

Area of Leadership Role Leadership Incident Leader (Member of SLT; typically MD of affected service) Owns progress during category 2 or 3 Major Incident Communicates with CIO/DCIO, other senior HUIT staff, and key stakeholders (e.g., deans) Service and Technical Incident Manager (Service/Offering Owner) Qualifies incident Leads troubleshooting Marshalls resources for troubleshooting and resolution Accountable for communication with users and IT stakeholders Determines when incident has been resolved Conducts After Action Review and problem review Process Incident Coordinator (ITSM) Lead process facilitator Initiates contact with service/offering owner Establishes conference bridge Provides initial communication (e.g., HUIT alerts & website, end users if predefined for affected service) Documents material technical, communications, and related information in ticket 9 Key Roles in Major Incident Management

Major Incident Communications 10 HUITCommunity Declaration Ticket created (Incident Coordinator) HUIT alert - New (Incident Coordinator) Updates to UCIO/DCIO, if category 2 or 3 (Incident Leader) HUIT website and Twitter (Incident Coordinator) Service Desk phone message End users and stakeholders (Service/Offering Owner) Assess and Contain Updates every 30 minutes to Incident Coordinator and, if applicable, Incident Leader (Service/Offering Owner) Updates to UCIO/DCIO, if category 2 or 3 (Incident Leader) HUIT alert – Category Change (Incident Coordinator) As necessary, update end users and stakeholders (Service/Offering Owner) Resolve Declaration of resolution to Incident Coordinator and, if applicable, Incident Leader (Service/Offering Owner) HUIT alert - Resolved (Incident Coordinator) Updates to UCIO/DCIO, if category 2 or 3 (Incident Leader) HUIT website and Twitter (Incident Coordinator) End users and stakeholders (Service/Offering Owner)

Major Incidents: Process and Procedures 11

Declaration of Major Incident 12 Triggers Criteria Responsible for declaration: Incident Manager If above unavailable, Incident Coordinator End UsersSudden flood of tickets StaffPreventative measure, alerts, vendor notifications Urgency Intolerance for delay LMH Risk of Escalation into more widespread issue LMH Size of PopulationLMH Declare Major Incident, if two criteria are M / H. } ASSESS CONTAIN RESOLVE

Initial Coordination Incident Coordinator convenes service/offering owner and technical team on conference bridge. 13 NB: Activity on bridge should focus primarily on service restoration. Root cause is important, but should generally remain secondary. Whenever possible, preserve forensic information in support of further analysis and After Action Review. The Conference Bridge is critical for efficient troubleshooting and centralized communications. All required resource should join within 15 minutes. If no response, Incident Coordinator escalates to Service Owner and/or Director/MD. Once service/offering owner (or proxy) joins, s/he leads call. Initial discussion on bridge: 1. Articulate issue. 2. Assess business impact. 3. Review recent changes. Open separate, technical bridge, if needed. ASSESS CONTAIN RESOLVE

Conference Bridge Phone Numbers and Access BridgeLeaderParticipant MI Bridge # # Secondary Bridge # # Tertiary Bridge # # LEADER *2Begin / end recording *7Lock / unlock conference 72#Roll call 81#Mute all lines 80#Unmute all lines Conference Bridge Features 14 Join the Bridge: or ANYONE *0Operator *6Mute / unmute your line ASSESS CONTAIN RESOLVE

Major Incident Classification Incident Manager (or proxy) provides initial classification. –Based on reported and actual user impact, event monitoring, availability of known solutions, and potential to become a crisis. –If Incident Manager unreachable, this assessment defaults to the Incident Coordinator. Preliminary assessment should be made and then updated every 30 minutes (may be longer, depending on investigation needed). –It is better to err on the side of caution and, if appropriate, downgrade to Critical during or after the incident. If no possible solution is identified within 2 hours, MI category should be upgraded. Incident Manager may adjust classification, as additional information is gathered, based on time of year and/or business needs, etc. 15 ASSESS CONTAIN RESOLVE CategoryDescriptionIncident Leader Role 1 Initial assessment or limited impact with risk of escalation Informed 2 Known solution or working toward a solutionInvolved, if extended 3 Significant impact with no known solutionLeading

Information Security Incidents Notify Information Security Operations of any incidents that may pertain to information security (e.g., system compromise, compromise of administrative credentials). Incidents with a significant information security component should be handled differently from the standard Major Incident protocol. –In the case of system and/or account compromise, sufficient time must be allotted for accurate and detailed assessment of the scope of the incident. –Communications outside of – or even internal to – HUIT are often limited (e.g., no ticket in ServiceNow, no posting on HUIT status page) to avoid “tipping off” the perpetrator (either through our own or public channels) and minimize anxiety. For these incidents, the CISO (or designate) acts as Incident Leader. 16 ASSESS CONTAIN RESOLVE

Initial Communication of a Major Incident Incident Coordinator creates ticket in ServiceNow –Primary channel for internal updates on progress Communications chain initiated for category 2 (extended) or category 3: Incident Manager  Incident Leader  DCIO/CIO Incident Coordinator notifies HUIT staff –HUIT alerts –HUIT website (status.huit.harvard.edu) –Twitter Update of Service Desk phone message with any specific steps/information Incident Manager notifies users and stakeholders 17 ASSESS CONTAIN RESOLVE

Investigation and Diagnosis Guiding Principles & Procedures Pursue multiple leads and parallel work streams as appropriate. Avoid combining seemingly-related incidents. –Continue to troubleshoot as separate incidents until it’s confirmed they are related. Review change calendar to identify any potential causes or impact. Follow troubleshooting checklists; leverage workflows or service maps (if available). 18 ASSESS CONTAIN RESOLVE

Investigation and Diagnosis Incident Manager (aka Service/Offering Owner) Technical investigation and diagnosis Marshaling of appropriate troubleshooting resources –Launch 2nd conference bridge for technical discussions, if needed; coordinate between two conferences to ensure timely updates Vendor management as needed Ongoing status updates for communication needs Incident Coordinator Documentation of all material technical, process, communications, and related information in ticket Additional HUIT Alerts, if MI category changes Incident Leader Hourly updates to CIO/DCIO and key senior-level stakeholders 19 ASSESS CONTAIN RESOLVE

Technical Escalation Escalations raise involvement and awareness of incident to more advanced skill sets or senior decision-making levels. Incident Manager is accountable for the overall escalation process. Current level notifies the next level no later than the hour indicated below. 20 NLT* HourTechnical Troubleshooting Begins Admins, Engineers, Developers, etc. 2Sr. Tech. Engineers and Architects 4Vendor (if applicable) *No Later Than ASSESS CONTAIN RESOLVE

Resolution and Closure Resolution –Incident Manager validates all services are operational, including those downstream –Incident Coordinator gathers all key information (e.g., end time, actions, root cause when known) and includes in ticket Incident Closure – Final communication, including HUIT Alert notification of resolution – Problem review and After Action Review for all involved Occurs following Tuesday during Problem Management meetings Confirm/update root cause Address any open issues or concerns Identify necessary mitigation and next steps, including owners and timelines 21 ASSESS CONTAIN RESOLVE

After Action Review Template Incident # Duration –Start date –End date –Total time Affected Systems Symptoms Chronology and Summary of Events Workaround/Solution 22 Root Cause Analysis –Technical Process & Communication Review –What went well? –What can be improved? Next Steps –Preventative measures to prevent recurrence –Recommendations ASSESS CONTAIN RESOLVE

Major Incidents: Roles & Responsibilities 23

Remember to Act according to the HUIT Values (i.e. user-focused, collaborative, innovative, and open), especially during these critical periods. Report Issues: Call the hotline at ! –“See something, say something.” –Avoid calls to the Service Desk or individual staff. –For crisis, calls may be made directly to the ESF leader. Respond Quickly when contacted by the Incident Coordinator. –Call into the bridge as soon as possible, but no later than 15 minutes after notification. Respect the Process to ensure efficient resolution and maximum collaboration. –Limit conference bridge to essential staff: Avoid unnecessary participants, which may impede progress and clear communication. 24 Code of Conduct

Ensure Timely Access to Appropriate Staff (Yourself or Others). –Maintain and share up-to-date lists of on call information, staff (including vacation and back-up coverage), and phone numbers. –Provide readily available access to needed technical expertise (especially during times of escalation or transition). –Dedicate your own or your team’s time as warranted and attention to ensure speedy service restoration. Coordinate Internal Communications: Material troubleshooting, communications, and decisions should be communicated/validated on conference bridge and in the ticket. 25 Code of Conduct

Incident Manager Responsibilities 26 Confirm and classify Major Incident, based on organizational impact. Lead troubleshooting effort: –Identify, marshal, and deploy technical resources. –Lead discussion on primary conference bridge (and coordinate with technical bridge, if applicable). Approve proposed fix or workaround. Confirm resolution of Major Incident. Accountable for communications with stakeholders and end-users. Responsible for After Action Review: –Review and validate record of events. –Determine and implement preventative measures and next steps. Service/Offering Owner or proxy is accountable for the restoration of an interrupted or degraded service.

Additional Responsibilities for Service Owners 27 To optimize service restoration and prevent future Major Incidents, these activities should be well defined for the MI process: Vendor Management Manage relationships with vendors and set response expectations. –Review of vendor contracts for incident response times –Enforcement of contracts breaches –Process to review root causes of incidents Relationship Management Predefine business impact of services for different levels of outages. Configuration Management Document and map service components and relationships for troubleshooting.

Technical Teams Responsibilities 28 Identify and escalate a Major Incident to Incident Coordinator. Participate in conference bridge, if appropriate. –Join bridge within 15 minutes of HUIT alert or on-call contact. –If not available, an alternate should be pre-identified and easily accessible. Understand technical landscape, including: –Technical or service dependencies, such as downstream effects –Recent changes Troubleshoot and work to resolve incident in accordance with internal procedures. Help document incident details and any material steps taken to resolve underlying problem. Provide regular updates to the Incident Manager and/or Incident Coordinator on status of investigation and resolution of incident. Any HUIT technical resource (Infrastructure, Development, Operations, etc.) that receives alerts or escalations or has a role in restoring normal operations.

Service Desk and SOC Ops Responsibilities 29 Identify Major Incidents and escalate to the Incident Coordinator. Track individual tickets/calls, relating them to the main/master incident. Communicate with end users, individually via calls and and/or generally through an updated incoming message. Provide technical information to the bridge as appropriate or requested. Help document incident details and any material steps taken to resolve the underlying problem. These frontline resources, including after hours, may be the first to identify a Major Incident.

Incident Coordinator Responsibilities 30 Initiate and participate in primary conference bridge, adhering to timelines and guiding principles. Open a Major Incident ticket and maintain ongoing record of events. Maintain update and escalation intervals and associated communications with technical resources, service owners, and management. Ensure that internal communications about a Major Incident are conducted in a timely manner, including: –HUIT alerts at the beginning and resolution –HUIT website postings (statuspage.io) –Twitter Create an associated Problem Record. Facilitates the major incident process through coordination, documentation, and communication.

Incident Leader Responsibilities 31 Supervises service restoration at a high level. Accountable for communicating with CIO, DCIO, and key senior-level stakeholders. Facilitates availability of appropriate troubleshooting resources (e.g., clearing calendars, removing barriers). Oversees decisions about restoration activities that affect other non- impacted services. Has discretion to mobilize resources across HUIT (e.g., other service areas). If warranted, with CIO/DCIO, invokes assessment of incident as Crisis. Owns progress towards service restoration during category 2 (extended) or category 3 Major Incident.

32 RACI Matrix for Key Roles during Major Incidents Activity Incident Leader (category 2/3) Incident Coordinator Service / Offering Owner Service Desk SOC Operations Technical Resource Technical Line Manager Incident Identification C ARRRRR Initial Coordination C ACCCC C Conference Bridge + Open Ticket R ARRRRR Initial Communications R ACCCCC Troubleshooting C/R A Technical Investigation & Diagnosis R CACCRR Escalation R ARRRRR Ongoing Communication R RACCC C Incident Documentation C ACC/R Resolution C/R A After Action Review RRARRRR

Appendix 33

Major Incident and Crisis Workflow 34 AssessContainResolve MI Declared MI ClosedMI AAR Asses s ContainResolve Crisis Closed Crisis AAR Crisis Declared Determine if Crisis

Template: HUIT Alerts Subject: [huit-alerts] INC####### - Description Major Incident Status: Category: NEW field Time Reported: Incident Start: Customer / Business Impact: Services Affected: Major Incident Conference Bridge: ,, # **Representatives from technical and service owner groups are expected on the call bridge for incident assessment within 15 minutes of HUIT alert. Group(s) Responsible: Service Owner Group: IT Service Management (ITSM) Status updates: HUIT Service Status Dashboard ServiceNow: Incident url Current Actions: Incident Coordinator Name HUIT Incident Management *****HUIT INTERNAL USE ONLY***** 35