Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rie Irish http://www.riepedia.net/ RieIrish@gmail.com @IrishSQL @PASS_WIT https://www.linkedin.com/in/rie-irish-07834a2/

Similar presentations


Presentation on theme: "Rie Irish http://www.riepedia.net/ RieIrish@gmail.com @IrishSQL @PASS_WIT https://www.linkedin.com/in/rie-irish-07834a2/"— Presentation transcript:

1 Rie Irish http://www.riepedia.net/ RieIrish@gmail.com
@IrishSQL @PASS_WIT

2 So your boss asked for a copy of your DR plan
So your boss asked for a copy of your DR plan.  Once you've wiped that deer-in-the-headlights look off your face, you realize "We've got database backups," isn't exactly a plan.  You need to come up with a plan to recover your business when disaster strikes. Today we’re going discuss building this plan by defining what a disaster could be, documenting the business impact, and identifying your limitations.  We will show how to use this information to establish metrics (such as RTO and RPO), document current recovery configurations, and design an effective recovery strategy that meets the needs and budget of your business. Attending this session will give you the knowledge and tools to create an effective disaster recovery plan that will make your boss happy and ensure the continuity of your business. Today we’ll run through a check list of the questions you should ask & answer at your company to come up with your plan. We’ll build a template that you can use to get started with documenting your DR plan.

3 How to Create your Disaster Recovery Plan -Check List
Define what’s important.  Define stakeholders. Define Critical Systems. Define RPO & RTO. Document your Infrastructure. Build a Back Up Strategy. Build a Recovery Strategy. Define Disasters. Lay out plan for budgeting & building the infrastructure.  Test your plan I say this is a beginner session because I won’t get too deep into any particular aspect of your DR needs. Every business is different, so what works for one company won’t necessarily be a good fit for yours. We’ll cover things at a higher level & provide you with the tools you need to help your company build a DR plan. My goal today is to give you an outline for your plan and a map to fill in the blanks yourself. We’ll talk about what’s important, define some DR terms, stakeholders, backups and recovery.

4 According to ISO/IEC 27031, the global standard for IT disaster recovery, “Strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery and restoration are put in place.” Your strategy defines WHAT you will do while the plans describes the HOW. As you move through answering all the questions posed in the upcoming pages, keep track of what is identified as important by each team and department. Ultimately, that team may not get to say what goes into immediate business recovery plans. You’ll need this information to submit to those in the company that do make those decisions. You’ll need a clear definition of who thinks a system or process is vital, how it will impact their team and the level of effort it will take to maintain an appropriate level of backup & estimated recovery. The homework you do now, will directly affect how decisions on the final plan are made.

5 Knowing what’s important – before a disaster
Knowing how long you can be down and how much data you can lose. Defining what systems and data are important. Knowing your budget. Who is responsible for declaring a disaster? Who is responsible the actual recovery work? Does everyone understand the role they’ll play? Where do you store your plan, server lists, code, team contact information? How will you access it once disaster strikes? There are lots of questions to ask before you have a disaster. They’re even more important before you build your disaster recovery plan. This is the WHO WHAT WHEN & WHERE of your Disaster Recovery Plan. If you can answer the questions on this page, you’re a lot further along than you thought. Your next step is going to be getting is all documented & disseminated. If you can’t answer all those questions, you now have a great place to start. You have to know what’s important. Not just to you as a data professional but to your company as a whole.

6 Who are my stakeholders?
The people who ask & answer the questions. C-level Execs. The people that get the system up and running The people affected by data loss/system outage? Application Owners Business Users Clients (Internal & External) Who can answer all the questions we’re asking today? Probably not the marketing or billing department. They have to answer to the Board after a Disaster. They’ll also be the people that have to pay for it. Database team, System Administrators, Storage & Infrastructure, Networking, Security, etc. They’re the ones that are going to have to implement your DR plan. They’d better be involved in the development, ongoing review and at least yearly practice of your company’s plan. Go through each application/system and ask yourself “Who cares?” then follow up with that person. The users both internal & external. If your system is down, who can’t do business?

7 Define What’s Critical
If this data was lost, could your department or the company function without it? Long-term or short-term? If this system (application) were down, could your department or the company continue to operate? How long of an outage of system A, B or C is acceptable, before the customer is impacted? Before the company stops making money? Before the company starts losing money? HR dept – employee data & payroll Accounting dept – Billing clients & paying bills Sales dept – is your CRM database in the cloud, or local?

8 Define what’s critical
Would loss of this data or functionality affect clients and/or sales? Would loss of this data or functionality have financial or regulatory repercussions? Is there any short term alternative for this particular functionality?

9 Questions?

10 RTO & RPO The Recovery Time Objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. The Recovery Point Objective (RPO) is the previous point in time to which service must be restored, and defines an amount of data loss which is acceptable to the business. These are terms you see thrown around a lot in sessions, in meetings, on twitter. First the definitions, then what they actually mean to you.

11 RTO: how Long can you be down?
Or put another way, how long until you have to come back up? What’s the business definition of RTO? What’s the nature of your disaster? Disaster vs Disruption: Is this a BIG D disaster or little d disaster? This is where you might have your VP or even the CEO watching over your shoulder. How long until we’re back online? What are you doing now? What are you going to do next? Is this taking a longer than you expected? How long until this finishes? If you’ve ever been in that position, you know this only adds to your stress. Don’t be afraid to very politely tell them they aren’t helping. It’s important to establish baselines ahead of time: copying, restoring, rebuilding, etc. So how long can you be down? What kind of time can pass before you’re fully functional? These are questions to ask AND answer before disaster strikes. You have to develop your plan to deal with all kinds of disasters. Ask around! Chances are good team mates have experience dealing with all different kinds of disasters. Have them help you come up with different scenarios. Then plan. Did you lose a database, an instance or did someone kick the storage out from under everything? Did a tornado hit your main data center and you’re having to fail over to your DR site? Are you the victim of a Denial of Service attack? Either directly or indirectly.

12 RPO: how much data can you lose?
This answer will vary across the business: from none to minutes, hours and days. Sometimes rebuilding is easier than restoring. So how do you know? Executives & legal should help define this. If you can’t meet requirements, BE HONEST. Some systems, like payments and healthcare systems – the answer is zero. The approach and expense for those systems Is greatly different than others. The infrastructure, build & design will vary greatly when you’re allowed to lose milliseconds of data. In these cases you’ll likely need to build for DR and maintenance type of fail over at the same time. For slow changing systems – you can restore on Tuesday from a Sunday full backup & a Monday night differential. How do you know? Notice I say HELP define this. Execs can tell you all day how much you can lose or how long you can be down, but your team has to put this into practice. If they’re asking for the impossible, TELL THEM SO. Often, there are contractual obligations to external clients. These are usually outside of your control, so you have to work with what they give you. Encourage your sales & legal departments to work with you while they prep a contract. If they’re promising 5 9’s but you can only guarantee 3 9’s then they need to know. For sales, execs and finance people, use words like “clawback” and “refund of fees”. If they’re the legal department, say “violation of terms” that usually gets their attention.

13 Maximum Tolerable Period of Disruption (MTPD)
MTPD - the maximum amount of time that an enterprise's key products or services can be unavailable or undeliverable after an event that causes disruption to operations, before its stakeholders perceive unacceptable consequences. Simply put, this is when you’re down so long the business viability is irreparably harmed. MTPD is going to be greater than your RTO & RPO and is considered the end of the road, resume-generating, maximum acceptable outage point on the timeline.

14 Questions?

15 Define & Document Technology
Document your Physical Structure/Facilities Can you describe your company’s hardware set up? You shouldn’t have to do this from memory. Create a detailed list that includes data centers, Co-Lo facilities & office locations Define & Document Technology Server name, OS, IP Address, general overall system purpose (web server, application server, load balancer, etc.), installed software & version

16 Define Suppliers, Delivery Services, Contract Technicians –
Compile a list of contact information for suppliers, account reps, etc. Establish Responsibilities Operations DBA Team Network The most overlooked critical piece

17 Establish contact Plan
Critical Incident Bridge ALL CALL ;439765 Team/Task Owner Team Member Contact Operations Web Servers Jean Grey Application Connectivity Diana Prince Storage Carol Danvers Client Services Client Failure/Connectivity Natasha Romanova Database Administration SQL Server Availability Groups Wanda Maximoff Application Deployment Issues Harleen Quinzel Source: Fictitious data, for illustration purposes only

18 Building a backup strategy --Any Dba is as good as their last backup
Start with the basics: Based on your RPO, how frequently do you need to take backups? Is a better solution for you a High Availability option like Availability Groups? We’re not just talking database backups. If you use it, you’ll need it. Now that you know how much data you can lose and how long you’ve got to get things back up and running, let’s talk about making that happen. Can you lose up to 15 minutes of data? Can you take a few hours to come back online? Define your database backup procedures. Then document them. Make sure every server is following your own rules. Full backups on Saturday night. Differentials on Weeknights. Transaction logs every 15 minutes. Then make sure you know how to back up the tail end of the log. Sure, all DBAs know that’s the answer to an interview question. Most DBAs have never had to do it. Can you do it? More importantly, can you do it at 2am? Do you use Native backups or a product like Idera SQL Safe Backup? Trust me, having a point & click to restore solution when you’re bleary eyed is a big win. High Availability doesn’t replace your need for back ups Let me repeat that. If you have an AG, you’re still going to need backups. But this can change how you plan for disaster. The decisions you make before implementation really matter. Automatic failover or manual? You may want to put your Asynchronous copies in a secondary data center. Same subnet? Do you span everything across or just the “vital stuff”?

19 It’s not just about databases
Active Directory External Files Service Accounts Encryption keys SQL Agent job Passwords Create Logins Script Contact information Run Book Linked Server Info Drive Layouts Restore Scripts Backup Locations Application Configs Development Code Defining things other than DBs means an end-to-end examination of the business: Active Directory, Service Accounts, SQL Agent jobs, Application configuration files, development code base, external files, etc. Encryption keys, passwords, contact information, If the database team owns it on a day to day basis, it’s your job to script it ahead of time. If another team owns it, then make sure they maintain an up to date copy of their development code, any external files, passwords, encryption keys, etc.

20 Building your restore strategy Any DBA is only as good as their last restore
Establish recovery baselines. Practice recovery. Prove that your backups work! Code if you’re using native backups. A step by step guide if you’re using an add-on tool. You need to restore databases from back ups. If it’s stored locally, then you’re ready to go. Do you need it copied to your DR site? If you do, it should already be there. You’re ready to restore databases. How long does that normally take? Remember that hovering executive? It would be great to have that information available. Practice restoring! This is one of many reasons you have a test box or a staging environment! The biggest payoff to practicing a restore is knowing that your backups work. Routinely restore database backups to a test environment. Run a DBCC check on it to make sure it’s a good one. Do you use a native database backup solution? If so, are you confident of the restore code you need? Add on product? Make sure the agent is installed on a server you can use in case of a disaster. If you can’t do this, go ahead and use the Idera SQL Backup wizard to restore a backup today, only stop just short of hitting the “GO” button. Instead, choose to script it. Save that output file. You’ll thank me later.

21 How do you get to Carnegie hall
How do you get to Carnegie hall? Or at least get your environment back up? PRACTICE! PRACTICE! PRACTICE!! Some sage advice: Allan Hirt phrased this better than I’ve ever seen. I now have a printed copy of that tweet pinned up in my office. That being said, this seemed to be the one thing everyone agrees on. Document. Practice. Automate. Document. Practice again. Automate more. If your business can’t support a full failover test, there are other options. Consider a tabletop test. Don’t just practice a full disaster. Practice for the little ones. There are lots of little things that can bring a business down. Corruption. Loss of transactions. Denial of Service. Loss of accountability. Loss of reputation.

22 Defining Disaster  "If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it."  ― Edward A. Murphy Jr 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disaster

23 Disruptions vs Disasters
DDOS ISP goes down Power Outage Natural disaster Unnatural disaster Denial of Service, ISP down & power outage – not the DBA problem, but it’s your companies problem. And if its bad enough that you need to move to your secondary data center, it just became your problem. Natural disaster? Is your data center in Florida? Hurricanes. California has Earthquakes. Texas has tornadoes. Georgia has the Atlanta Falcons. Unnatural disaster? Ever seen Sharknado? Titanic? That Dwayne “The Rock” Johnson movie where there’s an earthquake and a tsunami or whatever? Malicious insider. Also known as a pissed off employee. They can cause real issues deleting or modifying data that is vital to your company. Be prepared to do a table level restore or a side by side restore to recover data that may have been compromised. Raise your hand if you’ve ever forgotten a where clause. Okay, put your hand down because I can’t see you. Accidental code change. This can happen during a deployment, even a DBA accidentally modifying a stored proc or Ops modifying a config file. What where clause? Storage corruption Drive Failure Malicious insider Accidental code change

24 Building it out -- What’s your budget
Can you replicate production? Wants vs Needs Hardware, Licensing and Maintenance Let’s be honest, most companies can’t afford to build a hot standby environment that’s ready to go at a moment’s notice. If your company can, good for you. But if they’re like many companies, DR is currently sharing real estate with Staging or QA. It’s not a hot site. Or it barely has the processing power you’d need to run your business. Lay out what you want your DR site to look like. How you need for it to function. How you’re going to keep it up to date. Then lay out what you HAVE to have for your disaster recovery site. I think the final product will lie somewhere in the middle. This means account for enough storage space for live databases and backups. Enough web servers to run your applications. Don’t forget to factor in enough time and people to build this out and maintain it on a monthly basis. You have to patch all those servers too. You’ll need to keep versions consistent across these environments. It would be a costly mistake to try to fail over anything and realize things aren’t backward compatible. Lastly, don’t forget licensing costs. Not everything makes you pay for a cold standby, but don’t assume that’s the case. How much will it cost you to build? Definitely less than it will cost you in lost revenue, client trust or public relations.

25 Then Reality sets in… Begin by describing your gold standard set up.
Break that list down so you can more easily identify your wants versus your needs.   Lay out what you WANT your DR site to look like and how you want for it to function. Identify how you're going to keep it up to date.   Then lay out what you NEED to have for your disaster recovery site.   

26 Hardware, licensing & maintenance
Don't forget hardware, licensing and maintenance.   Plan for enough storage space for the live databases and backups. You'll need enough web servers to run your applications.   Don't forget to factor in enough time each & the people required to build this out and maintain it on a monthly basis.  You'll have to patch all those servers & keep versions aligned. 

27 Based on cost, many companies are considering utilizing the cloud for their DR site. Sometimes referred to as Disaster Recovery as a Service (DRaaS). Smaller companies find that utilizing a usage-based cost of cloud services is a great fit for their DR needs. . In this model, the secondary infrastructure is idling most of the time. The cost savings is most noticeable by reducing the need for data center space, IT infrastructure and on-site resources. Migrating your DR site to the cloud enables smaller companies to develop disaster recovery solutions that look more like those of larger, enterprise software companies.   It’s important to note that migrating to the cloud is far from perfect. It raises additional questions, particularly with security. Are data files stored & transmitted securely? Does your cloud provider meet specific regulatory requirements for security that are required for your industry? How is access to the data controlled? Password protected? Is there two-factor authentication?

28 So you’ve got a plan? Where do you store it?
Where is your plan stored? Hard or soft copy? What’s in this so-called plan? Who has access? Who can modify it? Do you publish it for clients? I know this sounds like an obvious question, but seriously, where is it stored? Is it on a network share? Data center? On a drive you need special permissions to access? I hope your storage isn’t what went down. Can you access your data center or did a hurricane take that out because it’s in Florida? You’ve lost Active Directory, do you even have permissions to access that drive? Fine, I get it. We’ll just print it. Great! Who gets a copy? How often do you update it? Monthly? Quarterly? Put it in a binder? Yeah, IT folks aren’t great with paper copies. Let’s talk access? Does everyone involved get a copy? That might be a security violation… or just a security nightmare. Odds are pretty good it will contain user names, permissions responsibilities, IP addresses, etc. Your DR plan might be a “how to hack our system” guide if it falls into the wrong hands. You’d really need to be careful about who gets access to this. Unless you’re an executive, don’t make this decision yourself. They get paid a lot of money to make the call on something like this. Find that balance and make sure its accessible to the right people at the right time.

29 Disaster Recovery Plan checklist.
Define what’s important. Define your stakeholders for each part of the business. Define your critical systems. Define your company’s RTO & RPO. Document your infrastructure. Document your Technology. Document your contractors, suppliers & service providers. Build a backup strategy. Build a recovery strategy. Establish responsibilities. Compile your plan from what you’ve learned. Test you plan in part, in full or in theory. Build it out. Publish your plan.

30 Plan complete. Get yourself a drink.
If only it were that simple. Any disaster recovery plan is complete for only a moment. It’s a living, breathing document. Businesses change so your disaster recovery plan should too. It’s constantly changing as your environment modifies, contracts and expands. Set aside time quarterly for your team to review the document. Encourage your company to build time into a project plan for DR planning every time they build out new systems or develop new products.

31 Questions?

32 RieIrish@gmail.com http://www.riepedia.net/ @IrishSQL @PASS_WIT


Download ppt "Rie Irish http://www.riepedia.net/ RieIrish@gmail.com @IrishSQL @PASS_WIT https://www.linkedin.com/in/rie-irish-07834a2/"

Similar presentations


Ads by Google