Chapter 9 Business Continuity Planning and Disaster Recovery
BCP and DR (770) An organization is dependant on resources, personnel and tasks performed on a daily bases to be healthy and profitable. Loss or disruption of these resources can be detrimental. Causing great damage or even complete destruction of the business. Business MUST have a plan to deal with unforeseen events.
BCP and DR (770) Business Continuity Planning is a broad approach to ensure that a business can function in the event of disruption of normal data processing operations. Disaster Recovery Planning is a subset of BCP. The goal of a DRP is to minimize the effects of a disaster and take necessary steps to ensure that the resources, personnel and business processes are able to resume operation in a timely manner.
Terms for This Chapter Business Continuity Plan – a document describing how an organization responds to an event to ensure critical business functions continue without unacceptable delay or change. Business Continuity Planning – Planning to help organizations identify the impacts of potential data processing and operation disruptions and data loss, formulate recovery plans to ensure the availability of data processing and operational resources. (more)
Terms Business Impact Analysis – Process of analyzing all business functions within the organization to determine the impact of a data processing outage. Business Resumption Planning – BRP develops procedures to initiate the recovery of business operations immediately following and outage or disaster. (more)
Terms (pg 665 ISC book) Contingency Plan – a document providing the procedures for recovering a major application or information system network in the event of an outage or disaster. Continuity of Operations Plan – A document describing the procedures and capabilities to sustain an organizations essential strategic functions at an alternate site for up to 30 days. (more)
Terms Crisis Communications Plan – A document that outlines the procedures for disseminating status reports to personnel and the public in the event of an outage or disaster. Critical System – The hardware and software necessary to ensure the viability of a business unit or organization during an interruption in normal data processing support. (more)
Terms Critical Business Functions – The business functions and processes that MUST be restored immediately to ensure the organizations assets are protected, goals met and that the organization is in compliance with any regulations and legal responsibilities. (more)
Terms Cyber Incident Response Plan – strategies to detect, respond and limit the consequences of cyber incidents. Disaster Recovery Plan – A plan that provides detailed procedures to facilitate recovery of capabilities at an alternate site. Disaster Recovery Planning – The process to develop and maintain a disaster Recovery Plan (more)
Objectives of the BCP (771) The objectives of BCP are the following Provide an immediate response to emergency situations Protect lives and ensure safety* Reduce business impact Resume critical business functions Reduce confusion during a crisis Ensure survivability of the business Get up and running ASAP after a disaster
Business Continuity Planning
BCP Overview (771) The goal of a BCP is ultimately to help a company resume operating of business functions as soon as possible after a damaging event. If you think about it, a BCP is really part of the larger “security” program. As such a BCP should be part of the security policy*
Steps in BCP (overview) (772) ISC states 5 Phases in BCP. We will outline them now, and detail them later. Project Initialization – establish a project team and obtain management support Conduct BIA – identify time-critical business processed and determine maximum “outages” Identify Preventative controls Recovery Strategy – identify and select the appropriate recovery alternatives to meet the recovery time requirements. (more) .
Creating the BCP (overview) (772) 5. Develop the contingency plan – document the results of the BIA findings and recovery strategies in a written plan Testing, Awareness, and Training – establish the processes for testing the recovery strategies, maintaining the BCP, and ensuring that those involved are aware and trained in the recovery strategies. Maintenance – Maintain the plan
BCP: Phase 1 (776) Project Management and Initialization: In this step we must solidify managements support, because without management support, NOTHING will be successful. Develop a “Continuity Planning Policy Statement” – lays out the scope of the BCP project, roles and members, and goals. (more)
BCP: Phase 1 (776) We then must identify a “Business Continuity Coordinator”* (the BCP team leader) Establish a BCP team What types of people/roles should be on the team… Can anyone think of certain positions that should make up the team? (pg 776) Which people will be chosen for the team (more)
BCP: Phase 2 (BIA) (778) Phase 2 of the BCP steps is to conduct a Business Impact Analysis. In short this step is to outline what procedures and resources the company depends on, how important each processes is and how long the business can do without each resource. The formalized step are conversed next.
Phase 2: BIA (overview) (778) Select individuals to interview to determine what processes* we have to protect Create data gathering techniques to gather data about these processes Identify the companies critical business functions/processes Identify the resources these processes depend on (more)
Phase 2: BIA (overview) (778) 5. Calculate how long these functions can survive without these resources 6. Identify vulnerabilities and threats to these processes 7. Calculate the risk for each business process 8. Document findings and report them to management
BCP Phase 2: Step 1 (779) – Determine Information Gathering Techniques In this step the BCP committee needs to identify the types of people that will be part of the BIA gathering sessions. These people should represent the different departments that make up the business. After determining the general roles, we need to actually find the actual employees that fill these roles, so we can interview them.
BCP Phase 2: Step 2 – Select Interviewees In this phase the BCP team must create data gathering techniques to use when interviewing and gathering other information to support the BCP objectives. (surveys, questionnaires etc)
BCP Phase 2: Step 3 – Identify Critical Business Functions Based on the information gathered by the interviews and the data gathering techniques, we need to now identify which business processes and functions are critical for the successful operation of the business.
BCP Phase 2: Step 4 Analyze information One we know what the important processes are we need to determine what are the resources* that these processes depend upon. These resources can be all kinds of things such as servers, data, people, buildings etc! (not just IT related things) Determine “cost” whether qualitative or quantitative
BCP Phase 2: Step 5 – Determine MTD and prioritization (781) Now we need to prioritize and calculate the maximum time we can survive without the business processes identified in Step 3. This maximum time is called the “Maximum Tolerable Downtime (MTD)*” here are some common MTD classifications. Keep in mind when prioritizing things, we have to use quantitative and qualitative analysis to determine just what is critical. For example loss of some process might not cause immediate financial loss, but could damage reputation or competitive advantage, and that damage could be devastating. (more)
BCP Phase 2: Step 5 (782) Here are some common MTD classifications that you should memorize* Crititical: 1 – 4 hours Urgent: 24 hours Important: 72 hours Normal: 7 days Nonessential: 30 days
BCP Phase 2: Step 6 - Threats Now we need to identify vulnerabilities and threats to these processes and the resources that are required for them. (remember Risk Management/Risk Analysis! On the next slide we will examine some example threats.
BCP Phase 2: Step 6 Some examples are: Equipment malfunction Hacking Failure in utilities (power, WAN connections) Critical personal becoming unavailable Vendors going out of business Data Corruption Physical Damage (hurricane, earthquake)
BCP Phase 2: Step 7 Determine the probability/risk for each business function.
BCP Phase 2: Step 8 Once we have done this research, we must document and provide our findings to management. Note at this point we really have not started creating a Business Continuity Plan yet, We’ve just done the research. Once Management reviews findings and gives the OK to proceed, we will actually develop the plan*
BCP Stage 3: Identify Preventative Controls (786) Pretty Straightforward, though a lot of work. Now that we know what we need to protect and the threats involved. Look at ways to PREVENT these problems from occurring, so we never have to worry about dealing with them. This is really just doing a Risk Analysis and determining Cost Effective Countermeasures.
BCP Phase 4: Recovery Strategies (788) Ok now we are at the stage where we actually are developing a PLAN for business continuity. Before was just initial research and getting management to give us the “OK” to develop a plan. (more)
BCP Phase 4: Recovery Strategies (787) A more “technical” and “tangible” stage. The idea is to figure out what the company ACTUALLY needs to do to be able to recovery the necessary business processes in the event of a catastrophe. Determine the most cost-effective* recovery mechanisms Formally define the activities and actions that will be implemented and carried out in response to a disaster. These Strategies will be based on the 5 main business considerations listed on the next page
Phase 4: Recovery Strategies (787) 5 categories Business Process Recovery Facility Recovery Supply and Technology Recovery User Environment Recovery Data Recovery We will go into more detail on each of these categories coming up.
Business Process Recovery (788) A Business Process is a set of interrelated steps linked through specific actives to accomplish a specific task. For these processes the team must know the components of the process including Required roles Required resources Input and output mechanisms Workflow steps Required time for completions How this process interacts with other processes
Facility Recovery (788) Facility Recovery is concerned with the ability to move processing operations to an alternate facility in case of the failure of the main facility. We can have multiple method to deal with this including “subscriptions services” with service bureaus Reciprocal Agreements Redundant Sites Lets looks into each of these more
Facility Recovery (791) Subscription services A subscription service is a contract with a 3rd party to provide access to a facility. There is generally a monthly fee to retain the right to use the facility along with a large “Activation” fee and hourly fee when actually using the facility. This is obviously a short term only solution. There are 3 types of subscription services which we will talk about more of in the next slides Hot Site Warm Site Cold Site
Hot Site (790) Hot Site – a facility that is fully configured and ready to operate in a few hours. The only resources missing from a hot site is the actual data and the actual employees. Hardware and software MUST be fully compatible or it’s pointless Very Expensive Vendor may not have customer specific or proprietary hardware/software + can allow for annual testing + ready within hours
Warm Site (790) A facility that is usually “partially” configured with some computing equipment, but not the actual hard core hardware. I.e. a “hot” site without the expensive stuff. Generally can be up in an acceptable time period. May be better for customers with specific hardware/software needs, customer will bring computing hardware with them. Most widely used model +cheaper +available for longer timeframe due to reduced costs + good if you have our own custom hardware/software - takes longer to prepare -actual yearly testing not generally possible
Cold Site (790) Supplies basic environment, (AC, electrical, plumbing etc), but NO actual computing equipment. Can take a while to activate. +cheaper +available for longer timeframe due to reduced costs + good if you have our own custom hardware/software - May take weeks to get activated and ready Cannot do yearly tests
Reciprocal Agreement (793) RA also called “Mutual Aid” is when two companies agree to help each other out in the case of an emergency. Ultimately this is not really practical for most business. Can you guys tell me what the Pros and Cons of this are? Can you tell me why this is not really practical.
Redundant Sites (794) Pretty much these are HOT sites, that are OWNED by a company (rather than a service bureau). This also may have live or slightly delayed data backups and some staff. - VERY EXPENSIVE (duplicate costs except for personnel) + best solution if turn around time and ability to recover all processing aspects are required
Multiple Processing Centers (794) Another approach is rather to than have only one center that facilitates a certain business function. Split the work among multiple active centers such that there is no single point of failure. Solid approach Good Scalability for normal business growth Just make sure that the other centers have more resources then they individually need in case they need to take on more work, due to the failure of another center.
Supply and Technology Recovery (795) Ok so we have plans to recover our facilities and our main processing requirements. But what about the “lower level” of things Hardware Backups Software Backups Documentation Human Resources These considerations need to be taken into consideration too we will briefly talk about these in the next few slides
Hardware backups (796) Ok so we have a space to process, but unless we have a hot site or redundant site, and our building is destroyed… where do we get the servers from, what about the desktops that our staff need? Do we have a vendors to provide these, how long will it take to get new equipment from them? What happens of we have “legacy” equipment… what do we do? We need to take all of these questions into consideration when planning.
Software Backups (797) Like the hardware backups, but specifically about hardware. How do we get copies of the software, how to we roll out installs. What about licensing? What about custom software that we had created that we cannot just go out and buy at the store? Software escrow – what is this? Anyone?
Documentation (798) OK so we have the equipment and software… how do we get it all rolled out and configured such that it was the same at the company. Incorrect configurations COULD cause compromises in integrity or confidentiality! (how?) Do we even how our old network was configured? Can we reproduce it? An Important concept for BCP that should be in company policy is that ‘All documentation should be kept-up to date and properly protected’
Human Resources (799) What happens if our backup facility is 250 miles away? How do we get people there? What happens if the disaster was a natural catastrophe and some important employees are injured or worse… what do we do now? Executive Succession Planning – what is this?
End User Environment (800) How do we notify the users about a disaster and the change of operating procedure? Once there we need to have some type of people on the ground directing issues pertaining to employees. These people should be easily identified. We also need to be concerned on how to manage other tasks that we might not have the resources to do in the traditional manner. (example automated data processing, or normal communication methods) How do we handle that. The BCP team needs to consider these types of issues.
Data Backups (801) How do we ensure we have data to load back into our new offsite systems? Data changes constantly. We need a solution that makes sense and is cost effective (this will vary business to business). We will talk about traditional backup types as well as “electronic vaulting” on the next few slides.
Traditional Backups (802) Traditional backups have some method of backing up files to a removable medium. The first things to understand about backups is the “archive” bit. Every time a file is altered the “archive” bit is set to notify the system that a file may need to be backed up. Now lets talk about the 3 backup types Full Differential Incremental
Full Backup (802) Simply put, backup every file on the system! Then clear the archive bit of each file This must be done to some degree of regularity, depending on the business needs. + everything gets backed up + if you do a full backup every day, you can restore with only 1 restore operation - Takes a long time, can be expensive to complete in a timely manner
Differential (802) Backup any file that has changed last full backup. Steps are Find any file where the archive bit is set Backup the file DO NOT clear the archive bit This allows you to quickly restore data in the event of a disaster in 2 operations. Simply Restore the last full backup Restore the last “differential” backup (more)
Differential Pros/cons (802) Faster than a full backup Can do a full restore with 2 “operations” restore the last full backup, restore the last differential backup Cons Does not have all data on any tape, you still need a full backup to do a complete restore
Incremental (802) The idea is the backup any file that has changed between the last full backup OR the last incremental backup. Steps are Find any file with the archive bit set Backup that file Clear the archive bit (more)
Incremental Pros/Cons (802) Fast to backup nightly Cons To restore requires many operations, restore last full backup, restore every incremental backup done since the last full restore. (restores are slow) If you lose any of the tapes (full or incremental) you cannot truly restore all data.
Which backup is right for you It depends on your needs. Personally I believe in the following strategy If you can do a full restore every night.. Do so If you cannot, then move to differential If you cannot handle differentials move to incremental REMEMBER, for all these to work you still need a full backup periodically.*
Discussion of backups Can you mix differential and incremental backups? (Why or Why not?) All backups should be stored both onsite and offsite (why) When storing offsite, would the next building over be appropriate? There should be a clear written process on how to restore files (why) Someone should periodically test the backups by performing restores to a “test” system (why)
Discussion of Backups What situations would a full backup be appropriate What situations would a differential backup be appropriate What situations would an incremental backup be appropriate
Discussion of Backups When choosing an offsite storage facility think of the following How fast can I get access to my data What are the hours of the facility What are the access control protections the facility provides (why do I care?) Is there fire suppression systems Are there environmental controls
Non Backup Terms that should be mentioned (804) Disk mirroring / shadowing – coping data to one or more hard drives such that a system has a multiple copies of data in case of a drive failure Disk duplexing- same as shadowing, but using multiple disk controllers.. (why?)
Electronic Vaulting (804) Electronic Vaulting* is the idea of sending all changes to a file to a remote site (using non-backup methods). This usually is not done real-time but in batches. (example bank transactions might be copied daily to another office)
Remote Journaling (805) RJ is another method of transmitting data to an offsite facility. However it is different than EJ. It is done in “real-time” (What do I mean by that) Entire files are not copied, only changes (deltas) to files. (also called transaction logs) From the base files and the records of changes you can recreate the current environment.
Tape Vaulting (806) A type of backup, however rather than backing up to a local device you “back up” to a remote device.
Phase 4: Restoration Strategies (809) Now that we covered recovery strategies we need to look at a couple of recovery concepts that we will need to understand in the planning stage.
Phase 4: Restoration (809) When planning we must also recognize that there are 3 different teams in DR. Damage Assesment team – assess the damage. Restoration team– responsible for getting the alternate site into a working functional environment Salvage team – responsible for starting the processes of “recovering” the original site and moving from the backup site. (cannot stay in the backup site forever ;) Lets look at these in the next slides
Phase 4: Recovery (809) Damage Assesment – Determine cause of disaster Determine potential for further damange Identify affected business functions and assets Indentify resources that must be replaced immediately Estimate how long it will take to bring ciritical functions online Determine whether the BCP should be put into operation
Phase 4: Recovery (809) Restoration Team – should be responsible for getting the alternate site into a working and functioning environment
Phase 4: Recovery (809) Salvage Team – responsible for starting the recovery of the original site. When moving things back to the original site the “most critical functions” should be moved LAST* (why) The least critical functions should be moved first.
End of Phase 4: Recovery
Phase 5: Plan design and development (814) Now we need to actually come up with a goals and a plan for attaining these goals. These goals must contain certain key information. Responsibility – who are the individuals responsible for what. What is exptected of them, how will they be trained Authority – in times of crisis who is in charge. Priorities – What are the crictical processes, what are the priorities. Implementation and Testing – how will we implement our plans, how will we test it. (more)
Phase 5: Plan Design and Development (814) Strategies Copies of the plan need to be kept in one or more lcoations. (why) Plans must be in paper and electronic format Call tress should be implemented
BCP: Phase 6 – Testing (816) OK so we have this great plan that we’ve spent millions of hours and dollars creating.. But does it work, or will it sink and completely fail… we’ll we should try testing it. Testing it also allows us to see where the plan can be improved, or if new changes in environment will require the plan to be updated (what company doesn’t change and grow?) Testing should be carried out at LEAST once a year.* Any problems that occurred should be documented and reported to management.* So what are some testing methods?... Next slide
Checklist Test (818) BCP is distributed to departments and functional areas for review. The Managers read over and indicate if anything is missing or should be modified. (Manager “checks” off that the plan is OK for their department)
Structured Walk-Through (818) Representatives from each department come together AS A GROUP, they walk through the plan and different scenarios from beginning to end to make sure nothing is left out.
Simulation Test (819) A specific scenario is propose, all required employees come together and start to simulate that the event has happened and start taking action to recover. The idea is to see if any problems come up or if any concerns were left out.
Parallel Test (819) Some systems are moved to the alternate site and processing takes place. The results are compared to the real processing to see if anything needs to change.
Full Interruption test (819) Most intrusive test.. The original site is actually shutdown and processing is moved to the alternate site (really needs to be a hot site). The recovery team fulfils it’s obligation in preparing the systems and environment for the alternate site. This is a full blown drill Requires tons of planning and co-ordination These are risky and can cause damage if not managed properly. Senior management approval is required due to the risk involved.*
Maintaining the Plan (819) Now that we have the plan we need to maintain it! Systems and processes become out of date and need constant “refresh” why? BCP plan may not be integrated into change management process (it should be though!) Infrastructure or environment changes (that never changes… ) Company re-organization, layoffs etc Changes in hardware or software Employee turn over (more)
Maintaining the Plan (819) We can help keep the plan updated by taking the following actions Make BCP planning part of every business decision! Insert BCP maintenance responsibilities into job descriptions Include maintenance in personnel evaluations Perform internal audits that include DR and BCP procedures Test the plan yearly