Service Recovery & Availability Robert Dickerson June 2010.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

VCS 5.0 for VMware ESX.

Tom Hamilton – America’s Channel Database CSE

ITEC474 INTRODUCTION.

IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.

Virtualization & Disaster Recovery

2  Industry trends and challenges  Windows Server 2012: Modern workstyle, enabled  Access from virtually anywhere, any device  Full Windows experience.

Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.

Business Plug-In B4 MIS Infrastructures.

Module – 9 Introduction to Business continuity

Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.

1 Vladimir Knežević Microsoft Software d.o.o.. 80% Održavanje 80% Održavanje 20% New Cost Reduction Keep Business Up & Running End User Productivity End.

© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.

VERITAS Confidential Copyright © 2002 VERITAS Software Corporation. All Rights Reserved. VERITAS, VERITAS Software, the VERITAS logo, and all other VERITAS.

1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ

High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.

Service Design – Section 4.4 – Availability Management.

Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data

1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.

Why Improve Datacenter Automation? “ Studies show up to 80% of network availability incidents and 60% of security issues can be tied to human error” CIO.

Keith Burns Microsoft UK Mission Critical Database.

Microsoft Virtual Server 2005 Product Overview Mikael Nyström – TrueSec AB MVP Windows Server – Setup/Deployment Mikael Nyström – TrueSec AB MVP Windows.

1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.

National Manager Database Services

H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

November 2009 Network Disaster Recovery October 2014.

Real Security for Server Virtualization Rajiv Motwani 2 nd October 2010.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

© 2006 Avaya Inc. All rights reserved. Avaya Services Michael Dundon Business Development Manager.

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.

The Infrastructure Optimization Journey Kamel Abu Ayash Microsoft Corporation.

Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.

Neil Sanderson 24 October, Early days for virtualisation Virtualization Adoption x86 servers used for virtualization Virtualization adoption.

Uwe Lüthy Solution Specialist, Core Infrastructure Microsoft Corporation Integrated System Management.

OSIsoft High Availability PI Replication

Essential Capabilities The Platform You Own Choice of Solutions for your Business needs Choice of Solutions for your Business needs Preserve your.

VMware vSphere Configuration and Management v6

Ken Brumfield | Premier Field Engineer Ward Ralston| Group Product Manager Microsoft Corporation.

Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Enhancing Scalability and Availability of the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada

Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.

Data Center Management Microsoft System Center. Objective: Drive Cost of Data Center Management 78% Maintenance 22% New Issue:Issue: 78% of IT budgets.

Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.

Virtual Machine Movement and Hyper-V Replica

Deploying Highly Available SAP in the Cloud

A Measured Approach to Virtualization Don Mendonsa Lawrence Livermore National Laboratory NLIT 2008 by LLNL-PRES

Service Design.

OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.

Self Service Service Delivery & Automation Deploy Configure Service Model DC Admin Operate Monitor Virtual Physical Public Cloud Private Cloud Virtual.

Microsoft Dynamics NAV Dynamics NAV 2016 one Azure SQL Dmitry Chadayev Microsoft.

Azure Site Recovery For Hyper-V, VMware, and Physical Environments

Understanding The Cloud

ATS Service Assurance Suite presentation

Providing Application High Availability

Module – 9 Introduction to Business continuity

High Availability 24 hours a day, 7 days a week, 365 days a year…

Maximum Availability Architecture Enterprise Technology Centre.

Bharath Ram Ramanathan, Storage Solutions TME,

How to prepare for the End of License of Windows Server 2012/R2

Introduction of Week 6 Assignment Discussion

Infrastructure, Data Center & Managed Services

SpiraTest/Plan/Team Deployment Considerations

Agenda The current Windows XP and Windows XP Desktop situation

Monitor VMware with SC2012 SP1 Operation Manager & Veeam Microsoft Tools for VMware Integration & Migration Symon Perriman Michael Stafford Senior.

Productive + Hybrid + Intelligent + Trusted

Presentation transcript:

Service Recovery & Availability Robert Dickerson June 2010

Paychex Assets Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k) / Employee Benefits / HR Admin / Time & Labor / Employee Leasing / 1-50 Employees / 50+ Employees Disaster Recovery & Business Continuity 13,000 employees across 100 locations, 9 within Monroe County, 2 in Germany. Data centers running critical applications to support the movement of $1B per day in support of 500,000 businesses. 1,600 windows servers. 2 megawatts of power. 1,400 unix servers. 1,000 terabytes storage. 147tb backed up weekly. 75,000 inbound / 65,000 outbound phone calls each day. Technical infrastructure to support branch offices, data centers and connections to banking and trading partners. The technical solutions in place to recover IT Computer processing functions and restore service to internal and external clients. The ability to restore critical business functions and protect the corporate assets.

High Availability (HA): Minimized Unplanned Downtime Unplanned = Hardware failures, network outages, data corruptions, data center failures, etc. Continuous Operations (CO): Minimize Planned Downtime Planned = Patching, application upgrades, code releases, data center maintenance, etc. High Availability + Continuous Operations = Continuous Availability Definitions

Service Availability Percentages Availability PercentageDowntime per YearDowntime per Month 90%876 Hours72 Hours 99%87.6 Hours7.20 Hours 99.9%8.76 Hours43.0 Min 99.95%4.38 Hours21.9 Min 99.99%52 min4.32 Min %5 min 15 sec25.9 Sec %31.5 sec2.59 Sec %3 sec0.25 Sec Best In Class Platform Availability (per Gartner) High Availability – approx. 5 hours of unplanned downtime per year or 99.95% Continuous Operations – 12 Hours of planned downtime per Year or 99.86% Continuous Availability – 17 Hours of downtime per Year or 99.81%

Cost of Availability Levels (Gartner) The cost of continuous availability increases by multiples as the target level increases. If X represents the cost of 99% availability then 99.9% has a cost equal to 3X, and % has a cost equal to 8X Achieving Continuous Availability in the range of % (just minutes of downtime a year) is very rare and expensive.

Downtime Root Cause Outages affecting Availability Industry Breakdown Unplanned 20% - Hardware, OS, Environmental Factors 40% - Applications 40% - Operations Planned 65% - Application and Database Updates 35% - HW/SW/DC Maintenance, Systems and Applications Management Paychex Unplanned 2% - Hardware, OS, Environmental Factor 22% - Operations (network, configurations, memory, data integrity) 52% - Applications 24% - Unknown (efforts underway to minimize this classification) Contributing factors to downtime: component failure, rate of change, expertise, process & procedures.

High Level Standards for Achieving Increased Availability Infrastructure and Data Resiliency Design and Implement infrastructure that prevents unplanned downtime. Use fault-tolerant hardware and operating systems. Provide redundancy to account for component failures as well as disasters. Limit complexity. Reduce MTTF (Mean Time to Failure). Automation Avoid human intervention for failovers as well as maintenance, test and release procedures. Use hardware load balancers or clustering software to automate failovers. Use automated release and testing tools and procedures. Reduce MTTR (Mean Time to Recovery). IT management processes Improve the maturity level of standard IT management processes (proactive monitoring, change, release, configuration and incident/problem). Reduce amount of change and/or probability that change will cause downtime. Monitor failures and capacity proactively vs. reactively. Understand integrations and maintain consistent configurations. Formally report and investigate incidents affecting availability. Per Gartner, IT Maturity Levels of 3 or above are required in order to be highly available. Application Development Design Applications with availability in mind (non-disruptive installation, no hard-coded limits, dynamic initialization, backwards compatibility) to support minimizing planned downtime (i.e. rolling upgrades. Consider availability requirements early in the design process. Once 1/3 of code has changed, Gartner recommends a complete application rewrite. Testing Test all possible use cases in representative test environments. Complete integrated, performance, and post-release testing.

Strategic HA and CO Technologies (Currently Available) Load Balancing F5 Devices for Load Balancing active/active, redundant configurations and automated failovers Session State Management Coherence*Web for session state management to support active/active configurations Hardware Clustering Veritas and Microsoft Hardware clustering software to mitigate hardware and OS failures Database/Application Clustering Oracle RAC for multi-node database clustering to mitigate hardware or software failures on database server nodes Fault-Tolerant Hardware/Self-Healing OS We use hardware models and Operating Systems that have features to prevent downtime from occurring due to hardware/OS failures. Data Replication Dataguard for database replication and SRDF for storage-based date replication to allow for hardware recovery on standby systems Data Recovery Backups to disk and tape for data recovery. Snapshot and snapback technologies. Proactive Monitoring RUM, Aternity and Edgesight for End User Experience Monitoring. OEM for DB monitoring. SCOM and OVO for hardware/OS/App monitoring. Incident Tracking and Reporting Service Center is currently in used for Problem Management. Automation Tidal for automated releases. F5/SM & WL plugins for failovers. Configuration Management Service Center is currently in used for Configuration Management. Development Lifecycle Quality Center for development lifecycle quality and consistency Design Principles and Standards Troux for architecture principles and standards for quality and consistency of designs. Continuous Operations Hardware Clustering allows for rolling HW/OS patching. Some hardware has hot- swappable components. = Relatively New to Paychex

Strategic HA and CO Technologies (Investigative Stage) Load Balancing All Feature/functions of the F5 devices (LTMs/GTMs) that can enhance availability Session State Management No new technology investigations at this time. Hardware Clustering No new technology investigations at this time. Database/Application Clustering Oracle SOA Suite 11g to allow for implementation across datacenters Fault-Tolerant HW/Self-Healing OS Use of Superdome technologies. Data Replication Active Dataguard for data replication and read-only access to standby. Coherence for data caching and replication. Data Recovery Volume snapshots for data recovery. Proactive Monitoring Central monitoring console, extend use of existing tools for proactive vs. reactive monitoring Incident Tracking and Reporting Event Correlation tools, enhanced service level reporting tools Automation Technologies to: provide automated post-release test to decrease planned downtime windows, further automate release processes and build/turnover processes Configuration Management Is Service Center adequate for mapping application and system interdependencies? Development Lifecycle Technologies to deliver self-healing applications with rolling upgrade capabilites. Design Principles and Standards No new technology investigations at this time. Virtualization Investigate role in providing HA and Continuous Operations solutions. Continuous Operations Transient Logical Standby procedures in Oracle 11g to apply database patches in a rolling fashion

HA and CO Design Considerations and Procedures Redundant Infrastructure Storage, Network, Security, Server infrastructure must have redundant components. Load Balancing Load Balance Across data centers when appropriate. Load Balance based on performance metrics of target vs. using round robin approach. Session State Management Maintain active/active configurations by utilizing Session State tools. Proactive Monitoring Build health monitors into applications to be utilized by F5 devices. Continue process toward making Proactive Monitoring part of the design process not just implementation. Incident Tracking and Reporting Capture all Availability affecting incidents in Service Center while reducing the number incidents classified with unknown root cause. Automation Further Automate failover and recovery procedures, post-release testing and build processes to reduce time and errors. Configuration Management Ensure standby and primary systems have matching configurations. Thoroughly map application and system interdependencies/integrations. Development Lifecycle Matching availability of Build/Release automation tools with Test Engineering requirements Analyze Agile Development Model for impacts across all areas (support, release management, testing) Architect applications with high availability features to allow for more seamless failovers and rolling upgrades as well as self-healing features. Consider rewriting applications that have been over-updated. Determine best balance between frequency and complexity of release. Design Principles and Standards Develop Availability Design standards Service Level Agreements/BIAs Define accurate and comprehensive SLAs and availability requirements early in the project lifecycle process in order to affect approaches and designs. Define conditional SLAs for impaired vs. unavailable. Testing Environments Assess impacts for downtime on build automation and testing tools on testing timelines and completeness. Enhance integrated testing to make it more comprehensive. Design production-like test environments. Refine resource scheduling processes to make the most of the available test environments.

Predictive Model What is it? Model used to predict the overall platform availability based on probabilities of unplanned downtime. What does it provide? Allows for calculating the impact of component changes to overall availability. Can be used to help create Availability roadmap. Provides consistent way to assign availability expectations to given design as well as identify gaps. How is it used? Prediction is attained through industry failure rates and Paychex history. Continuous feedback from actual experience used to update predictions. When is it used? This model can be used during the Solutions Approach and Design Phases to assess the predicted availability percentage of given Solution/Design. It can also be used as part of an assessment of an existing Product. Work is being done to add it to the project lifecycle as a deliverable.

Predictive Model Sample