Business Continuity and Disaster Recovery CIO Strategic Initiative Discussion June 15, 2015
Agenda Introductory comments on CIO Initiative (Ben G) Central Administration Initiative (Sue W) Goals FY 15 work Current DR timeline for CA Appendix (ex of DR options) CIO Initiative discussion and next steps (All)
Central Administration BCDR Initiative
HUIT Business Continuity Vision 10/30/2017 HUIT Business Continuity Vision HUIT Business Continuity Vision To provide technology services that can withstand a broad range of incidents and disasters Strategic Objectives Guiding Principles Key Performance Indicators Strengthen our ability to recover the University’s most critical existing systems Design new systems to incorporate Business Continuity Define and operationalize Crisis Management and Disaster Recovery processes Academic and administrative needs drive priorities and requirements. Ensure clear visibility into trade-offs between investment and risk Solutions based on a small set of standardized options rather than custom solutions for each situation Our plans and solutions need to be communicated, understood, tested, and validated. Solutions maximize “bang for the buck”. Qualitative understanding of risk and exposure Assessment of maturity in our business continuity systems and practices Number of critical IT services that have a regularly tested DR solution. Existence of documented and implemented reference architectures Uptime or # of major incidents for critical infrastructure (network, power, etc.)
Goal Framework The Business Continuity and IT Disaster Recovery (BCDR) initiative has three primary goals: Goal #1: Establish and operationalize crisis management processes to handle a broad range of incidents/crises Goal #2: Change our existing systems to be able to recover from disasters in a way that meets the business’ needs Goal #3: Change the way we implement new technologies to incorporate business continuity
Goal #1 –FY15 Update The following activities were accomplished in 2015 to establish and operationalize crisis management processes: Held Crisis Management workshops fall 2014 Developed runbook with clear processes to manage a crisis, roles and responsibilities outlined, checklist of tasks. Trained staff on Crisis Management Held Spring Emergency Management training Conducted Tabletop exercise in June 2015 in conjunction with Harvard’s Emergency Management Team FY 15 Result: Crisis process has been developed, documented and tested with HUIT’s Emergency Team. We are better prepared to manage an IT Crisis!
Goal #2 – FY15 Update The following efforts were completed to improve and protect our current systems from a disaster: Completed assessment for CA critical applications (criticality, roadmap, dependencies) Developed approach to migrate applications into the cloud over three years while embedding DR in design- Designed first critical 1 application for cloud deployment Identity and Access Management systems for high availability and disaster recovery in AWS (July release) Developed option to support interim DR support- Design and tested interim Premise /DR solution Trialed several products (Vcloud Air, Cloud Endure) POC of for Library (Aleph Development Environment) (Complete) Exploratory discussions with PeopleSoft for DR implementation (In Process) Results: We have a plan to improve DR through full cloud migrations as well as an interim cloud DR solution. We have designed and will soon deploy our first critical 1 application in the cloud. LDAP – dif files PIN -
Application Assessment of CA *Current contract provides a bare-metal DR solution, majority of applications at Sungard do not conduct regular tests.
Application Criticality Ratings DR Tier Application Description DR Scenario Foundational - Critical systems dependent on infrastructure (e.g. networking, connectivity, DNS) Support infrastructure for critical business systems High availability – no interruption in service, no degradation in service 1 Mission Critical - Critical services impacted, customer interactions are interrupted (Critical business function. 2 Critical - Services continue through data collection and transaction processing may be delayed. Operational workaround exists. Failover – minimal to no interruption in service; degraded service is acceptable 3 Important - Support services that can fail-back to manual execution. System loss may slow, but will not stop business critical activities DR scenario – application recoverability outside of 30 days Deferrable Non-Critical
Goal #3 – FY15 Update The following initiatives are underway to change the way we implement technologies as they relate to DR: Included DR requirements as part of cloud migration 5 step process and all new system architectural review (Completed) Incorporate DR questions into PMO templates for all new systems (Completed) Result: We are thinking about DR needs upfront before we design a system and address requirements as part of the design process.
Current DR Plan with Cloud Migrations Interim DR CloudEndure** Cloud / SaaS Migration FY16 FY17 FY18 Foundational (Tier 0) Infrastructure OID Net TWS SunGard* - Cloud Cloud Cloud Mission Critical (Tier 1) CS - Blackboard CS - CS Gold CS - Message Me Aleph Ace IAM eMail SIS TLT Yes No SaaS Partial SaaS SaaS- - SaaS TBD- - - - Critical (Tier 2) PeopleSoft Yardi Oracle eBusiness** GMAS CAADS iSites Data Warehouse Yes- No - - - - TBD- TBD - *Current contract provides a bare-metal DR solution **Exploration of Oracle virtualization in FY16, prerequisite for cloud or interim Cloud Endure
Appendix
AWS Physical/Logical Architecture US West Region US East Region Availability Zone 1 Availability Zone 2 Availability Zone 1 Availability Zone 2 Availability Zone 5 DC-e-1a DC-e-3a DC w-1a DC-w-2a DC-e-2a DC-e-1b DC-e-3b DC-e-1c Regions are physically and logically independent Each Region has 2-5 Availability Zones (AZ) Each AZ is comprised of 1-6 Data Centers* (though act and perform as one logical DC) AZs are similar in concept to 60 Oxford St. + 1 Summer St model at Harvard *Exact number and location of AWS data centers is not publicly released
Standard Cloud Architecture – Active / Active “Good Design” Multi-Availability Design Availability Zone (AZ) 1 Availability Zone (AZ) 2 Synchronized Key DR Architecture Active/Active (multi-AZ) design Relative Cost $ Typical Recovery Time < 5 min. Protects Against Individual DC or AZ Failure or Degradation Appropriate for All Production Applications Similar to 1 Summer St. + 60 Oxford St. HA design
DR Cloud Architecture - Pilot Light Copy code, configuration, data <15 min lag using snapshots and replicas Synchronized AZ 1 AZ 2 Region 2 West Region 1 East Key DR Architecture Maintain minimal live infrastructure in remote region Relative Cost $$$ Typical Recovery Time < 1 hour Protects Against Regional Failure Appropriate for Select Critical 2 Applications Similar to DR Contracts at Sungard