Schrödinger’s Backup Will your recovery work?

Slides:



Advertisements
Similar presentations
Building the business case for Business Continuity Justin Davey Senior Consultant CA.
Advertisements

SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Challenge for all the Seniors (DBAs) QuestionAreaYou (Today) You (6 Months) You (1 Year) 1Design Tables 2Write Queries 3Deploy Changes 4Tune Queries 5Monitor.
Database Optimization & Maintenance Tim Richard ECM Training Conference#dbwestECM Agenda SQL Configuration OnBase DB Planning Backups Integrity.
Exchange 2013 (backup &) Disaster Recovery
Five Battle-Tested Practices to Avoid Data Loss Greg Shields, MVP, vExpert.
November 2009 Network Disaster Recovery October 2014.
Whiteboard Development Develop whiteboard visually in this slide. Use PowerPoint drawing tools and assets copy/paste from previous slide. Duplicate as.
Windows Servers.  HP Advanced Model (7.5 Unit) model 3 - $5912  ML 350 Chassis, Tower Form Factor  Dual-Core Intel Xeon 5140 Processor (IA64)  5 Gb.
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.
11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.
Backup & Restore The purpose of backup is to protect data from loss. The purpose of restore is to recover data that is temporarily unavailable due to some.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Digging Out From Corruption Eddie Wuerch, MCM - Principal, Database Performance - Salesforce Marketing Cloud Data protection and loss recovery with SQL.
Putting Your Head in the Cloud Working with SQL Azure David Postlethwaite 19/09/2015David Postlethwaite.
Establishing a Service Level Agreement SLA =tg= Thomas Grohser SQL Server MVP SQL Server Performance Engineering.
SQL Server High Availability Introduction to SQL Server high availability solutions.
Dealing with Database Corruption DBA 911. Who am I? 2 David M Maxwell twitter.com/dmmaxwell or twitter.com/upsearchsqltwitter.com/dmmaxwelltwitter.com/upsearchsql.
Backup and Disaster Dr Stuart Petch CeG IT/IS Manager
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
KEEPS – a system for UELMA preservation and security
RMAN Maintenance.
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Database recovery contd…
RMAN Maintenance.
Planning for Application Recovery
KEEPS – a system for UELMA preservation and security
You Inherited a Database Now What?
Adam Backman Chief Cat Wrangler – White Star Software
Disaster Recovery Where to Begin
Establishing a Service Level Agreement SLA
What, When, Why, Where and How SCC maintains your Oracle database
Backups for non-DBAs the Why…not the How
Database Corruption Advanced Recovery Techniques|
Disaster Recovery and SQL for new and non-DBAs
Schrödinger’s Backup Will your recovery work?
Building Effective Backups
Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint
Unit 10 NT1330 Client-Server Networking II Date: 8/16/2016
Tips for SQL Server Performance and Resiliency
Tips for SQL Server Performance and Resiliency
How to Lose Your Job in 3 Easy Steps
X in [Integration, Delivery, Deployment]
dbatools! The reason to finally start learning and using Powershell
Database Corruption Advanced Recovery Techniques
Understanding and Handling Database Corruption
Making PowerShell Useful
What’s new in SQL Server 2016 Availability Groups
Backup and Restore your SQL Server Database
Database Corruption Advanced Recovery Techniques
RPO, RTO & SLA: 3 Letter Words for When the SHT hits the FAN
Turbo-Charged Transaction Logs
Database Corruption Advanced Recovery Techniques
dbatools! The reason to finally start learning and using Powershell
Backup to Basics Tom Fox
Reliable, Repeatable, Configurable & Automated Validation with
PowerShell & PowerBi Reducing DBAs Context Switching
PowerShell & PowerBi Reducing DBAs Context Switching
PowerShell & PowerBi Reducing DBAs Context Switching
Disaster Recovery is everyone’s job!
You Inherited a Database Now What?
Using the Cloud for Backup, Archiving & Disaster Recovery
GitHub 101 Using Github and Git for Source Control
Advanced Recovery Techniques
Administrator’s Manual
dbatools! The reason to finally start learning and using Powershell
Michelle Haarhues Keeping up with SSMS.
Jamie Cool Program Manager Microsoft
Presentation transcript:

Schrödinger’s Backup Will your recovery work? Patrick Flynn SQL Saturday South Island 8th April 2017

Thank you to our sponsors: Gold Sponsors Silver Sponsors Bronze Sponsors

Please fill out your evaluation forms. You have them in your A4 pack from registration. Please put them in the box at the front of the room. There are spot prizes for completed evaluation forms. Patrick Flynn | Schrödinger’s Backup SQL SATURDAY | #614 | South Island 2017

Who am I MCM – SQL Server 2008 MCSM – Data Platform Patrick Flynn Twitter @sqllensman email sqllensman@outlook.com MCM – SQL Server 2008 MCSM – Data Platform Production DBA for 10+ years.

Schrödinger’s cat A thought experiment devised by Austrian physicist Erwin Schrödinger in 1935 a cat, a flask of poison, a radioactive source are placed in a sealed box. If an internal monitor detects radioactivity the flask is shattered, releasing the poison that kills the cat. While box is closed the cat can be thought to be both alive and dead. Only when box is opened can actual state be determined.

Schrodinger’s Backup Not testing your recovery plan is unknowingly running a Schrödinger’s backup experiment. Unless tested a Backup can be either good or bad. Only by completing a Restore can you be assured that your Backup was valid. A failed Schrödinger’s backup experiment will often become a RGE* RGE – Resume Generating Event GitLab.Com (used by 100,000+ organisations – January 31 2017 A tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Problems Encountered LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented Our backups to S3 apparently don’t work either: the bucket is empty We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now. So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked

Restore Strategy Requires A defined RPO* and RTO* Regular and Correct Backups Tested Restore Process (automated and manual) Documented Processes and Procedures RPO – Recovery Point Objective (how much data can be loss is acceptable) RTO – Recovery Time Objective (how much downtime is acceptable) Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data.

A Defined RPO* and RTO* RPO: Recovery Point Objective In terms of time, how much data are we willing to lose? RTO: Recovery Time Objective Our goal for how quickly we can restore the database. These Are A Business Decision! Need to know how long it will take to restore if everything but Backup Files are gone Includes: Time to get backup files from Offsite / Tape / Data Domain / NAS etc Time to copy to Restore Location If you don’t have a server to restore to, how long will it take to bring one up and configure it How long will Restore take

Regular and Correct Backups Backup Your Databases Use Checksum on Backups Restore Verify Only How Do You Know When Your Backups Aren’t Successful? Alerts on Failure Run Reports to Check for Missing Backups TraceFlag 3023 Backup Report sp_Blitz SQL Script DBA Reports Backups Maintenance Plans Ola Hallengren Minion Backup Building a Centralised Database Maintenance and Monitoring Solution Manohar Punna 3:45pm

Regular and Correct Backups 3-2-1 Backup Rule How Much Do You Lose if Even Just One Backup File Goes Bad? The accepted rule for backup best practices is the three-two-one rule. It can be summarized as: if you’re backing something up, you should have: At least three copies, In two different formats, with one of those copies off-site.

Demo’s of Backup Script Maintenance Plans Ola Hallengren Minion Backup

Tested Restore Process You must Test your Restores Restore with Checksum Automated Restore Testing Restore to Test Environments Manual Testing (Fire Drills) Regular CheckDB Minion CheckDB NSA may be backing up your data but not seen a successful restore

Demo of DBATools Restore Database

Documented Processes and Procedures Restore Strategy must be documented How and where to restore the data Currently used versions of relevant software Contact Information Must be kept Up-to-Date Must be Tested Having the right data selected for backup and backups running regularly and correctly is still only part of the solution to know you have a good backup that can get you back up and running. You must document the procedure on how and where to restore the data and/or application that you are backing up. Restore documentation shouldn’t just contain information about the restore. It should also include hard copies of currently used versions of relevant software, serial numbers for any software, contact numbers for support, and support contract reference numbers. Any documentation should have a date of when it was last updated and tested, for wrong documentation can be worse than no documentation at all. All documentation should have a glossary page explaining the acronyms to ensure that the person reading the documentation understands what the person who wrote it was trying to say. The restore procedure should also periodically be tested, and the documentation that is created should be followed to the letter, and any changes needed to the documentation should be noted and updated. Think of it like a smoke alarm, you are supposed to test your smoke alarms when you change your clock for daylight savings time. Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data.

In Summary Build a Restore Strategy Test It! Document it. Questions ? Having the right data selected for backup and backups running regularly and correctly is still only part of the solution to know you have a good backup that can get you back up and running. You must document the procedure on how and where to restore the data and/or application that you are backing up. Restore documentation shouldn’t just contain information about the restore. It should also include hard copies of currently used versions of relevant software, serial numbers for any software, contact numbers for support, and support contract reference numbers. Any documentation should have a date of when it was last updated and tested, for wrong documentation can be worse than no documentation at all. All documentation should have a glossary page explaining the acronyms to ensure that the person reading the documentation understands what the person who wrote it was trying to say. The restore procedure should also periodically be tested, and the documentation that is created should be followed to the letter, and any changes needed to the documentation should be noted and updated. Think of it like a smoke alarm, you are supposed to test your smoke alarms when you change your clock for daylight savings time. Making sure that you are alerted that the building is burning down needs to be at the very least as important as knowing that if the building did burn down that you could restore your data.

Thank you to our sponsors: Gold Sponsors Silver Sponsors Bronze Sponsors

Please fill out your evaluation forms. You have them in your A4 pack from registration. Please put them in the box at the front of the room. There are spot prizes for completed evaluation forms. Patrick Flynn | Schrödinger’s Backup SQL SATURDAY | #614 | South Island 2017

Resources Backup Monitoring Backup Solutions Testing Restores http://minionware.net/backup/ https://ola.hallengren.com/ https://www.pluralsight.com/courses/sqlserver-database-maintenance-plans Backup Monitoring https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit https://dbareports.io/ Testing Restores https://dbatools.io/presentations/ https://sqldbawithabeard.com/2017/03/20/testing-your-sql-server-backups-the-easy-way-with-powershell-dbatools/ https://thomaslarock.com/2010/05/statistical-sampling-for-verifying-database-backups/

Resources Scary DBA http://www.scarydba.com/2014/01/20/time-for-a-quick-rant/ http://www.scarydba.com/2016/03/07/backups-are-a-business-decision/ https://www.simple-talk.com/sql/backup-and-recovery/backup-verification-tips-for-database-backup-testing/ SQL Skills The Accidental DBA (Day 6 of 30): Backups: Understanding RTO and RPO The Accidental DBA (Day 7 of 30): Backups: Recovery Models and Backup Types The Accidental DBA (Day 8 of 30): Backups: Planning a Recovery Strategy The Accidental DBA (Day 9 of 30): Backups: Essential BACKUP Options The Accidental DBA (Day 10 of 30): Backups: Backup Testing for Validation The Accidental DBA (Day 11 of 30): Backups: Backup Storage and Retention