Customer Engagement Workshop IT Service Continuity Phoenix, Aston 6th May 2015 Paul Gant, Head of BCM Assurance David Davies, BCM Assurance Consultant
Agenda 11:00 Registration, refreshments and networking. 11:30 Why get fit, anyway? 11:50 Fictitious live incident. 12:10 Post incident review. 12:30 Steps to success. 12:50 Questions & answers. 13:00 Lunch, tours, event close. 13:30 BCM Assurance sessions by appointment.
Why get fit, anyway?
Introducing BCM Assurance – your personal trainers
What if?
Real Recovery (Invocations) is like a Battle YOUR ENEMIES (Lack of) time. You can’t recover what you haven’t backed up. You can’t upgrade recovery technology during an invocation. YOUR FRIENDS Phoenix. Your preparation.
What does “Preparation” involve? It’s not just about the technology! But aren’t policies, analysis, plans and reports only there to satisfy to auditor? Is there any rhyme or reason to them?
PrioritiesDependenciesPlansTestingMaintenance IT Service Continuity Management
1. What’s needed first? PrioritiesDependenciesPlansTestingMaintenance
2. What rests on what? 3 DependenciesPlansTestingMaintenancePriorities
3. Make a plan DependenciesPlansTestingMaintenancePriorities
4. See if it works DependenciesPlansTestingMaintenancePriorities
5. Keep it up-to-date PrioritiesDependenciesPlansTestingMaintenance
What goes wrong? Issues reported in the media DATACOM co-location datacentre flood, Melbourne Australia, March 2010 Heavy rain broke a ceiling panel and poured water into the data centre. Water damaged SANs, servers and routers. All equipment impacted by 12 hour power outage. Camera Corner / Connecting Point datacentre fire, Green Bay, Wisconsin, USA, 19 th March 2008 Fire alarms but no fire suppression. 75 hosted servers destroyed. “10 day outage” reported, with 98% of services resumed by 1 st April.
Phoenix Standby Reasons
Phoenix Invocation Reasons
The reccurring dangers that we see IT recovery requirements haven’t been agreed with the business (through a BIA). IT recovery strategy isn’t joined up (i.e. a full end to end solution isn’t there). Strategy isn’t supported by plans and isn’t tested rigorously enough (resulting in inefficiencies and failures during actual recovery).
Fictitious Live Incident (Why have a personal trainer to help you?)
Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape)
Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor 1 gbps CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape)
08:07 Fire Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape)
12:15 Servers onsite Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape) 08:07 Fire
Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape) 12:15 Servers onsite 08:07 Fire 12:45 Exec Report
Warehouse and second server room (ground floor) Backup SAN and tapes Offices and Server room 2 nd (top) floor 12:15 Servers onsite 08:07 Fire 12:45 Exec Report CRITICAL SYSTEMS: Recovery Time Objective 24 hours Recovery Point Objective 24 hours (disk to disk daily) NON CRITICAL SYSTEMS: Recovery Time Objective 5 days Recovery Point Objective 1 day (local tape) and 7 day (offsite tape) 13:15 Start recovery
12:15 Servers onsite 08:07 Fire 12:45 Exec Report 13:15 Start recovery
12:15 Servers onsite 08:07 Fire 12:45 Exec Report 13:15 Start recovery 09:30 Server recovered?
12:15 Servers onsite 08:07 Fire 12:45 Exec Report 13:15 Start recovery 09:30 Server recovered? 11:45 Recovery stalled
Post Incident Review (What are the consequences of being unfit?)
Post Incident Review What went well? (Where were they fit?) what went badly? (Where were they unfit?) What could the IT manager have done differently during the recovery? What could the IT manager have done differently before the recovery?
IT Service Continuity Issues Have you experienced any of the issues raised? Difficulty in getting board engagement. No business requirements for IT recovery (i.e. not BIA). Single points of failure in key skills sets. Lack of recovery documentation (perhaps no spare time to write it?) Lack of formal testing and test reporting. Any other issues?
The Barriers and Results What’s stopping you / stopped you from making changes? What would happen if changes aren’t made and you invoke? What would happen if you do make the changes?
Steps to Success (How to become IT service continuity fit.)
What if?
The Steps to Successful IT Service Continuity 1. Engagement and sponsorship at a strategic level. 2. Balance between the technology and ITSC management. 3. Do all of ITSC, and run it as a repeating programme.
1. Strategy: Talk the Language of the Business
1. Strategy: Engage with the Executive Team Does the Executive Team know: What are the impacts if IT fails? What are the risks associated with IT failure? What is the RTO and RPO of services – and what these terms mean. What is the recovery and hand back process?
2. Balance Technology with ITSC Management Priorities Depende ncies PlansTesting Maintena nce
3. Do all of the Programme Steps, and Repeat Business Impact Analysis IT Service Continuity Plan IT Recovery Testing Time Trigger PEAKPEAK BC Readiness PrioritiesDependenciesPlansTestingMaintenance
3. Do all of the Programme Steps, and Repeat Business Impact Analysis Time Trigger PEAKPEAK BC Readiness PrioritiesDependenciesPlansTestingMaintenance IT Service Continuity Plan IT Recovery Testing
What if?
Trap 1: The Scope Trap
Trap 2: The Audit Trap
Trap 3: The Importance and Urgency Trap
Trap 4: The Gambler’s (or Optimist’s) Trap
Trap 5: The Hero Trap
Any Questions?
Thank you for participating. Lunch is now ready. Would you like a tour or a meeting?