Office for Information Resources Crisis Management and DR Larry K. Peck Disaster Recovery Consultant Office of Information Resources State of Tennessee
Office for Information Resources May 2, Downtime and Availability - Factors Contributing to Downtime The following seven categories were identified as the factors contributing to downtime (Gartner-April 2004 Survey of 145 IT Organizations): Software System Failure Hardware System Failure Network or Telecommunications Carrier Failure Human Error Cause Uncertain or Unknown Environmental Factors (Such As Power Outages) Security Breach or Virus 26% 21% 15% 13% 12% 10% 3%
Office for Information Resources May 2, Planning and Testing Strategies Technologies
Office for Information Resources May 2, Emergency Management Team EMT is First Response to Crisis Event Identified 1 st Responders from various functional and business units Disaster Assessment Teams (DAT) – inspect equipment and facilities, report to EMT Interfaces together Executive Management Financial Management Technical Management Functional Response Teams Press Relations Team Conduct TWO “Exercises” per year, 1 st planned, 2 nd Surprise
Office for Information Resources May 2, Planning-Preparation Business Impact Analysis (BIA) Conducted high level BIA as part of recent study – Annual detailed BIA with every agency now in progress Established annual BIA review process
Office for Information Resources May 2, Enterprise View Application A Agency/Operations View Media View Technology View LAN WAN MAN Internal/ External Government/Agency Communications Customer Service High Speed Telephony Financials Billing Call Centers Applications 3rd Party Technologies Data Center/NOCs Telephony Services Business impact analysis (BIA) and risk assessment approach: The analysis and report are structured around the following systems and critical, dependent business processes
Office for Information Resources May 2, Planning-Preparation New approach to system criticality identification Level 1 - < 5 minute RTO/RPO (0 downtime) Level 2 – 8 hour or less RTO/RPO Level 3 – 48 hour RTO/RPO Level 4 – 72 hour RTO/RPO Level 5 – NR – No specific disaster recovery requirements
Office for Information Resources May 2, Planning-Preparation Implemented new WEB based Disaster Recovery Application and Inventory Planning Application
Office for Information Resources May 2, Strategies Outside Analysis and Review Confirmed what we thought we knew – our strengths and weaknesses DR for Mainframe is mature, stable, and very supportable utilizing 3 rd party services DR for Distributed Systems is very complex and poorly suited for 3 rd party services Some existing technologies are still viable New approaches are necessary for others Migration to self-supporting recovery model is necessary, especially for Distributed Systems
Office for Information Resources May 2, Technologies Construction of Second Data Center Full Tier III facility* Self-Recovery Model (just one example) Each data center runs 50% of production Each data center runs 50% of total dev/test/training DR event – utilize dev/test/training hardware to recovery most critical systems Various data replication schemes and technologies Server Virtualization/Clustering over WAN/ HA technologies
Office for Information Resources May 2, Thoughts Plan, Plan, Plan Review, Review, Review Test, Test, Test Revise, Revise, Revise