Establishing a Service Level Agreement SLA

Slides:



Advertisements
Similar presentations
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Advertisements

1EMC CONFIDENTIAL—INTERNAL USE ONLY Overview of SQL Server 2012 High Availability and Disaster Recovery (HADR) Wei Fan Technical Partner Management – Microsoft.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Database Optimization & Maintenance Tim Richard ECM Training Conference#dbwestECM Agenda SQL Configuration OnBase DB Planning Backups Integrity.
National Manager Database Services
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
DotHill Systems Data Management Services. Page 2 Agenda Why protect your data?  Causes of data loss  Hardware data protection  DMS data protection.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
Leaders Have Vision™ visionsolutions.com 1 Database Archiving Michelle Ayers Advisory Solution Consultant November 2010.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
Enhancing Scalability and Availability of the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Digging Out From Corruption Eddie Wuerch, MCM - Principal, Database Performance - Salesforce Marketing Cloud Data protection and loss recovery with SQL.
Optimizing SQL Server and Databases for large Fact Tables =tg= Thomas Grohser, NTT Data SQL Server MVP SQL Server Performance Engineering SQL Saturday.
Implementing HA/DR based on a SLA =tg= Thomas Grohser, NTT Data SQL Server MVP SQL Server Performance Engineering SQL Saturday #500 Boston, MA March 19.
Establishing a Service Level Agreement SLA =tg= Thomas Grohser SQL Server MVP SQL Server Performance Engineering.
Implementing HA/DR based on a SLA =tg= Thomas Grohser, NTT Data SQL Server MVP SQL Server Performance Engineering SQL Saturday #484 Chicago, IL March 5.
SQL Server High Availability Introduction to SQL Server high availability solutions.
Processing Temporal Telemetry Data -aka- Storing BigData in a Small Space =tg= Thomas H. Grohser, SQL Server MVP, Senior Director - Technical Solutions.
Microsoft Connect /23/ :39 PM
Managing a database environment in the cloud
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Database recovery contd…
Planning for Application Recovery
Azure Infrastructure for SAP®
Tips for SQL Server Performance and Resiliency
Preparing for Automation: Expanding Your Coverage
Establishing a Service Level Agreement SLA
Optimizing SQL Server and Databases for large Fact Tables
Navigating the options for Data Redundancy
Disaster Recovery Where to Begin
Establishing a Service Level Agreement SLA
Disaster Recovery and SQL for new and non-DBAs
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Optimizing SQL Server and Databases for large Fact Tables
# - it’s not about social media it’s about temporary tables and data
# - it’s not about social media it’s about temporary tables and data
From SLA to HA/DR solution
Introduction to SQL Server Management for the Non-DBA
Introduction of Week 6 Assignment Discussion
Advanced Security Protecting Data from the DBA
A Beginners Guide to HADR
Disaster Recovery Services
Integration services: Analysis Services:
Tips for SQL Server Performance and Resiliency
Why most candidates fail the interview in the first five minutes
Service Level Agreement
Peter Shore SQL Saturday Cleveland 2016
RPO, RTO & SLA: 3 Letter Words for When the SHT hits the FAN
Shaving of Microseconds
Why most candidates fail the interview in the first minute
Optimizing SQL Server and Databases for large Fact Tables
Disaster Recovery is everyone’s job!
Why most candidates fail the interview in the first five minutes
Top 5 TIPS TO KEEP Always on AGs humming and users happy
From SLA to HA/DR solution
Windows Azure Hybrid Architectures and Patterns
=tg= Thomas Grohser SQL Saturday Philadelphia 2019 TSQL Functions 42.
Disaster Recovery Done Dirt Cheap Founder Curnutt Data Solutions
Why most Candidates fail the Interview in the first five Minutes
Why most Candidates fail the Interview in the first five Minutes
42 TSQL Functions =tg= Thomas Grohser SQL Saturday
Hybrid Buffer Pool The Good, the Bad and the Ugly
Visual Studio and SQL Server Data Tools
Designing Database Solutions for SQL Server
Presentation transcript:

Establishing a Service Level Agreement SLA =tg= Thomas Grohser, NTT DATA SQL Server MVP SQL Server Performance Engineering Silicon Valley, CA 4/9/2016

select * from =tg= where topic = @@Version Remark SQL 4.21 First SQL Server ever used (1994) SQL 6.0 First Log Shipping with failover SQL 6.5 First SQL Server Cluster (NT4.0 + Wolfpack) SQL 7.0 2+ billion rows / month in a single Table SQL 2000 938 days with 100% availability SQL 2000 IA64 First SQL Server on Itanium IA64 SQL 2005 IA64 First OLTP long distance database mirroring SQL 2008 IA64 First Replication into mirrored databases SQL 2008R2 IA64 SQL 2008R2 x64 First 256 CPUs & >500.000 STMT/sec First Scale out > 1.000.000 STMT/sec First time 1.2+ trillion rows in a table SQL 2012 220.000 Transactions per second 1.2 Trillion Rows in a table SQL 2014 400.000 Transactions per second Fully automated deploy and management SQL Next NDA 20 Years with SQL Server =tg= Thomas Grohser, NTT DATA email: tg@grohser.com Focus on SQL Server Security, Performance Engineering, Infrastructure and Architecture New Tool coming in 2015 Close Relationship with SQLCAT (SQL Server Customer Advisory Team) SCAN (SQL Server Customer Advisory Network) TAP (Technology Adoption Program) Product Teams in Redmond Active PASS member and PASS Summit Speaker

NTT DATA Overview Why NTT DATA for MS Services: 20,000 professionals – Optimizing balanced global delivery $1.6B – Annual revenues with history of above-market growth Long-term relationships – >1,000 clients; mid-market to large enterprise Delivery excellence – Enabled by process maturity, tools and accelerators Flexible engagement – Spans consulting, staffing, managed services, outsourcing, and cloud Industry expertise – Driving depth in select industry verticals Why NTT DATA for MS Services: NTT DATA is a Microsoft Gold Certified Partner. We cover the entire MS Stack, from applications to infrastructure to the cloud Proven track record with 500+ MS solutions delivered in the past 20 years

Drawing at the end of the session Drop your business card or fill out provided blank card and drop in the red bag Must be present at the time of drawing at the end of the session to win:

Agenda Why & When? What & How? Q&A

Why do we need SLA’s Management and coworkers need to understand and agree to reality Help you to request and argue the resources you need Avoid lawsuits

Rule Number One! SLA first Solution later If you already have a solution don’t agree to a SLA the solution can’t support

What should be in a SLA? Everything Operational requirements Maintenance windows Responsibilities Dependencies What happens if the SLA is not met

RPO – Recovery Point Objective In plain English: How much data can we lose? Samples Your last log backup is from 12 minutes ago 12 minutes Your last full backup is from last week 1 week You do not have a backup all

RTO – Recovery Time Objective In plain English: How much time after a failure till we have to be available again Samples Your restore takes 6 hours 6+ hours Your last backup does not work you have to go to tape 24+ hours

Availability Time the database is available within a period of time divided by the length of the period of time. Don’t confuse luck with availability! How fast to you think you can fix data corruption or human error in your database?

Availability 99.0 % 99.7 % 99.9 % 99.99 % 99.999 % the famous five nines 99.9999 %

Availability how long can I be offline 1 Year Days Hours Minutes Seconds 1 Month 0% 365.25 8766 525960 31557600 30.4375 730.5 43830 2629800 99.0 % 3.65 87.66 5260 315576 0.30 7.31 438 26298 99.7 % 1.10 26.30 1578 94673 2.19 131 7889 99.9 % 8.77 526 31558 0.73 44 2630 99.99 % 0.88 53 3156 4 263 99.999% 5 316 26 99.9999% 0.5 32 3

Available Is a database available when It is online in SSMS? I can login? I can select data? I can update data? I can insert data? I can change the schema? … you get the idea? And don’t get me started on defining performance

When is a database needed? Is the database used on the web 24x7 or just in the office from 9 to 5 or just once a month to process payroll? Do the availability requirements apply all the time or just during the periods its actually used?

Service Windows Specify times when you can service your system The more the better Every night from 11pm till 5am, all day Saturday and Sunday, except the weekend before the year end results are due. First Sunday every month from 2am till 4am

Planned versus Unplanned Big debate is planned maintenance part of the yearly downtime or not? Big difference between the two cases Make sure its clearly defined and understood

Monitoring availability 99.999% is equivalent to less than 5.2 minutes of outage per year or less than a 0.8 seconds per day This requires you to do an availability check at least every 0.4 seconds otherwise you waste valuable time.

Availability Having a certain availability vs. guaranteeing it. Easy to end up with 100% availability Hard to guarantee even 99.7

Differentiate between HA, DR and LR HA … High Availability DR … Disaster recovery LR … Last Resort Have different RPO and RTO values for all three cases. Define worst case scenarios each level has to deal with

HA … High Availability RTO: seconds to minutes RPO: Zero to seconds Automatic failover HA site usually close by (< 30 miles) Well tested (maybe with each patch or release)

DR … Disaster recovery RTO: minutes to hours RPO: seconds to minutes (even hours) Manual failover into prepared environment DR site usually several hundred miles away Tested from time to time

LR … Last Resort RTO: days to weeks RPO: minutes to hours (even a whole day) Rebuild system from scratch Hardware has to be ordered Floor space, connectivity to be rented LR site usually on different continent and jurisdiction Have a rough plan

Define worst case scenario for HA Failure of a single component Failure of two components (which are of a different kind) Failure of server Failure of multiple servers Failure of any two components That means you need everything at least three (3) times (not so easy for disks)

Define worst case scenario for DR Human error Failure of server Failure of multiple servers Partial failure of data center Full failure of data center Failure of multiple data centers

Define worst case scenario for LR Destruction / failure of multiple datacenters Natural disaster Sabotage Political incident (i.e. war, regime change) Destruction of planet earth …

Outside SQL Server Make sure you state that you depend on the underlying infrastructure and failures of that infrastructure don’t count for you! Make sure no processes are interfering Example: async database mirroring + failover OK for loosing data. Only one person allowed to give the OK. The guy is 3 weeks on vacation and availability is down to 94.2%

Dependencies Who needs this database/server What does this server need to operate Power Cooling Network Firewall rules Domain Controller Other servers (linked server) …

Responsibilities Who can actually make a decision for a database/server Who owns the data Who needs to be notified if something is wrong

Backup retention and granularity How far must you be able to go back? Hours, Days, Weeks, Months, Years And how accurate must the restore be I need the database restored to November 15, 2008 at 6:27… How much time do you need for this historic restores Test them from time to time (you need the resources and time)

Tips for keeping the SLA Make sure your monitoring and alerting works Monitor your Monitoring Test your HA, DR, LR solutions regular and especially after every change to your infrastructure.

Surprise! Database Recovery It depends So somewhere between On oldest uncommitted transaction Number and size of transactions in flight that need rollback So somewhere between Less than a second Several weeks Surprise!

DR and LR Instructions Keep printed copies in several places trust me your electronic documentation won’t be there when you are in a DR or LR situation

Summary Remember Rule Number one SLA first Solution later If you already have a solution don’t agree to a SLA the solution can’t support The laws of physics apply (even to the best DBA :-)

A good bare minimum SLA SLA HA DR LR Failure Modes (single failures) RTO Seconds Minutes Hours Days RPO Close to Zero Seconds to Minutes a lot Failure Modes (single failures) Single component failure  HA Server failure  HA (DR) Datacenter failure  DR Network failure between datacenters  HA/DR Data loss/corruption Deletion  HA/DR/LR Corruption  DR (LR) Sabotage  LR (DR) Concurrent failures Any different double failure HA  DR Any double failure DR  LR Any triple failure LR Priorities Recoverability (no data loss) Availability (keep going) Performance (keep running) Defined maintenance windows and adjusted values based on the day and time of the day

Sample SLA SLA HA DR LR During Maintenance Window RTO < 30 Seconds < 30 Minutes 3 days RPO Close to Zero up to 1 minute up to one day During Maintenance Window RTO N/A 30 Minutes after end of window 3 days RPO N/A before maintenance window up to one day Failure Modes (single failures) Single component failure  HA Server failure  HA Datacenter failure  DR Network failure between datacenters  DR Data loss/corruption Deletion  DR Corruption  DR Sabotage  LR Concurrent failures Any different double failure HA Any double failure DR/LR Any triple failure LR Priorities Recoverability (no data loss) Availability (keep going) Performance (keep running) Defined maintenance windows Saturday 1pm till Sunday 9pm except if EOM

THANK YOU! and may the force be with you… Questions? thomas.grohser@nttdata.com tg@grohser.com