Five Nines - To Dream the Impractical Dream? Presentation to the CSG Bruce Vincent.

Slides:



Advertisements
Similar presentations
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice HP and Carrier Network System Availability.
Advertisements

Complete Event Log Viewing, Monitoring and Management.
Complete Event Log Viewing, Monitoring and Management.
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Mecanismos de alta disponibilidad con Microsoft SQL Server 2008 Por: ISC Lenin López Fernández de Lara.
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
Chapter 9 Auditing Database Activities
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Lesson 1: Configuring Network Load Balancing
Information Technology Report Trey Felton Manager, IT Service Delivery January 2012 ERCOT Public.
Backup and Recovery (2) Oracle 10g CAP364 1 Hebah ElGibreen.
Backup and Recovery Part 1.
Managing LOB Applications by Using System Center Operations Manager Published: March 2007.
Crisis Leadership Business Continuity Technology & Operations Critical Incident Cyber Information Security KeyBank Presents Critical Incident at itSMF.
Transaction log grows unexpectedly
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
Implementing High Availability
VMware vCenter Server Module 4.
Event Viewer Was of getting to event viewer Go to –Start –Control Panel, –Administrative Tools –Event Viewer Go to –Start.
Is Windows Right for High-Availability Enterprise Applications? Dan Kusnetzky, Vice President System Software Research IDC.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Today’s Agenda Chapter 12 Admin Tasks Chapter 13 Automating Admin Tasks.
© Richard Mironov, All rights reserved. Tel: Service Metrics Measure, improve, measure, improve…
DB-12: Achieving High Availability with Clusters and OpenEdge® Replication Combining the two technologies Hugo Loera Chávez Senior Tech Support Engineer.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.

Introduction and simple using of Oracle Logistics Information System Yaxian Yao
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
PPD Computing “Business Continuity” Windows and Mac Kevin Dunford May 17 th 2012.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Implementing and Administrating Redundant PI-Advanced Computing Engine (ACE) Servers Craig Taylor PI Administrator.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
16 Copyright © 2007, Oracle. All rights reserved. Performing Database Recovery.
DATABASE MIRRORING  Mirroring is mainly implemented for increasing the database availability.  Is configured on a Database level.  Mainly involves two.
1 Microsoft Exchange 2000 Server Maintenance and Troubleshooting System Maintenance and Monitoring Database Operation and Maintenance Backup, Restore,
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Testing and Updates November, 2005.
Event Management & ITIL V3
Module 10: Maintaining High-Availability. Overview Introduction to Availability Increasing Availability Using Failover Clustering Standby Servers and.
11 MANAGING AND MONITORING DNS Chapter 4. Chapter 4: MANAGING AND MONITORING DNS2 DNS MANAGEMENT TOOLS  DNS console  Nslookup  DNSLint  Logging features.
ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2 TDTWG April 2, 2008.
Database Security and Auditing: Protecting Data Integrity and Accessibility Chapter 9 Auditing Database Activities.
ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008.
Slide 1 CFEngine. Slide 2 Confidential Quotes “ ” CFEngine offers a highly scalable approach with a pull-based, distributed architecture. “ ” CFEngine.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Security components of the CERN farm nodes Vladimír Bahyl CERN - IT/FIO Presented by Thorsten Kleinwort.
High Availability in DB2 Nishant Sinha
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
Stanford Linear Accelerator Center Michael Zelazny EPICS Collaboration Meeting Dec 3&4, Channel Watcher Bumpless Reboot Replacement Related Web Page:
18 Copyright © 2004, Oracle. All rights reserved. Recovery Concepts.
April 2010 COPS/RMS Information Technology Service Availability Metrics Trey Felton Manager, IT Administration.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Virtual Machine Movement and Hyper-V Replica
High-Availability MySQL with DR:BD and Heartbeat: MTV Japan mobile services ©2008 MTV Networks Japan K.K.
WHEN DATABASE CORRUPTION STRIKES Presented by Steve Stedman Founder/Owner of Stedman Solution, LLC.
Managing a database environment in the cloud
Understanding the New PTC System Monitor (PSM/Dynatrace) Application’s Capabilities and Advanced Usage Stephen Vaillancourt PTC Technical Support –Technical.
Service Restore Flow Receives/retrives input of list of server involved in the process Flow Performs multi level health check like process status, replication.
ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2
Maximum Availability Architecture Enterprise Technology Centre.
WLCG Service Report 5th – 18th July
1z0-320 Exam dumps - Get 1z0-320 PDF With Actual Questions Answers
Manage the Active Directory Database
Performing Database Recovery
Distributed Availability Groups
Distributed Systems and Concurrency: Distributed Systems
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Designing Database Solutions for SQL Server
Presentation transcript:

Five Nines - To Dream the Impractical Dream? Presentation to the CSG Bruce Vincent

Agenda  What is driving high-availability?  How do we judge which services need HA?  How to achieve HA services?  What’s working and what’s not

What’s driving high-availability?  Frankly, [we] are…and should be.  Central IT services have gone from popular to essential  Interdependencies of services  The hassle of outages!  Choices of providers

Failure without loss of service

Failure with loss of service

How do we judge which services need high-availability?  If a service isn’t that important, why are we running it?  Turn it around…why doesn’t it need to be fault-tolerant and scalable?

How to achieve HA services?  Build in fault-tolerance and scalability by design  Monitoring and metrics  Learn from outages service  Manage Risk - Balance efforts  Service - Don’t focus on SLA legalese

Major Failure with No Outage Incident Summary: Active Directory Server rebooted after determining that it was in an impaired state, cause under investigation Incident Started: :59 Incident Stopped: :05 Systems Affected: Godzilla Server functions are replicated on other servers, no end- user outage Incident Detail: Server rebooted and has resumed operations Investigating logged error messages.

As Opposed To… Incident Summary: A failed memory board on the Production Oracle machine caused Production to be unavailable. The machine was re- booted to clear memory errors. Incident Started: :15 Incident Stopped: :50 Systems Affected: Oracle Applications Production Incident Detail: A failed memory board on the Production machine caused Production to be unavailable. The machine was re- booted to clear memory errors.

DOOH! At Tue Jan 4 10:31:30 PST 2005, the following responses were received: Incident Summary: Ofdbprd1 was down because of a failed CPU/memory board. Incident Started: :55 Incident Stopped: :10 Systems Affected: Ofdbprd1 (Oracle Financials transaction database server). Incident Detail: The failed CPU/memory board was removed from the configuration and the system was brought up on the remaining boards. The failed board has been replaced and will be returned to service on Jan 6.

Make Services Boring (sort-of)  Build it to mask failures  Control/communicate changes  Test failover regularly (A-A,A-P)  Keep service profiles for monitoring and resource trends up-to-date  Create enterprise system-wide view