Live Data Center Migration across WANs: A Robust Cooperative Context Aware Approach Kobus Van der Merwe with K.K. Ramakrishnan and Prashant Shenoy.

Slides:



Advertisements
Similar presentations
The leader in session border control for trusted, first class interactive communications.
Advertisements

Elastic Provisioning In Virtual Private Clouds
Remus: High Availability via Asynchronous Virtual Machine Replication
FederalAppliance.com Self-Service Pricing. Full-Service VAR. Server / Storage Consolidation Plan using VMWare and EqualLogic Virtual Machines Virtual Network.
NetApp Confidential - Limited Use
© 2009 VMware Inc. All rights reserved Confidential Overview: vCenter Server Heartbeat Q
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Hosted Revolution Ltd Hosted Exchange October 2009 V2.01.
© 2010 IBM Corporation ® Tivoli Storage Productivity Center for Replication Billy Olsen.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
The Case for Enterprise Ready Virtual Private Clouds Timothy Wood, Alexandre Gerber *, K.K. Ramakrishnan *, Jacobus van der Merwe *, and Prashant Shenoy.
VERITAS Confidential Disaster Recovery – Beyond Backup Jason Phippen – Director Product and Solutions Marketing, EMEA.
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Data protection and disaster recovery.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
Shared File Service VM Forum January, SFS Topics Targeted Usage Security Accessing CIFS Shares Availability & Protection Monitoring Pricing.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Distributed Processing, Client/Server, and Clusters
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
VROOM: Virtual ROuters On the Move Jennifer Rexford Joint work with Yi Wang, Eric Keller, Brian Biskeborn, and Kobus van der Merwe
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Keith Burns Microsoft UK Mission Critical Database.
Module – 12 Remote Replication
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
John Graham – STRATEGIC Information Group Steve Lamb - QAD Disaster Recovery Planning MMUG Spring 2013 March 19, 2013 Cleveland, OH 03/19/2013MMUG Cleveland.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Double-Take Software Overview A Platform for Recoverability.
Lawrence G. Roberts CEO Anagran September 2005 Advances Toward Economic and Efficient Terabit LANs and WANs.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Virtual ROuters On the Move (VROOM): Live Router Migration as a Network-Management Primitive Yi Wang, Eric Keller, Brian Biskeborn, Kobus van der Merwe,
IT Business Continuity Briefing March 3,  Incident Overview  Improving the power posture of the Primary Data Center  STAGEnet Redundancy  Telephone.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
Remus: VM Replication Jeff Chase Duke University.
DotHill Systems Data Management Services. Page 2 Agenda Why protect your data?  Causes of data loss  Hardware data protection  DMS data protection.
IT Infrastructure Chap 1: Definition
Distributed File Systems
Case Study 2 – TeraBit Inc. DMico Johnson Hans Schmidt.
NOAA WEBShop A low-cost standby system for an OAR-wide budgeting application Eugene F. Burger (NOAA/PMEL/JISAO) NOAA WebShop July Philadelphia.
Data Center Back-up: Data Sustainability on a Budget Mike DeNapoli Enterprise Systems Engineer Double-Take Software.
Virtualization for Disaster Recovery Panel Discussion May 19, 2010 Ed Walsh EMC vSpecialist EMC Corporation Cell Chris Fox.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
High Availability in DB2 Nishant Sinha
Use Cases for High Bandwidth Query and Control of Core Networks Greg Bernstein, Grotto Networking Young Lee, Huawei draft-bernstein-alto-large-bandwidth-cases-00.txt.
Internet Protocol Storage Area Networks (IP SAN)
Virtual Machine Movement and Hyper-V Replica
DISASTER RECOVERY PLAN By: Matthew Morrow. WHAT HAPPENS WHEN A DISASTER OCCURS  What happens to a business during a disaster?  What steps does a business.
E Virtual Machines Lecture 6 Topics in Virtual Machine Management Scott Devine VMware, Inc.
Step-by-Step Guide to Asynchronous Data (File) Replication (File Based) over a WAN Supported by Open-E ® DSS™ Software Version: DSS ver up85 Presentation.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Azure Site Recovery For Hyper-V, VMware, and Physical Environments
Business Continuity for Virtual SQL Servers
Metro Mirror, Global Copy, and Global Mirror Quick Reference
High Availability 24 hours a day, 7 days a week, 365 days a year…
AlwaysOn Mirroring, Clustering
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Elastic Provisioning In Virtual Private Clouds
VMware VM Replication for High Availability in Vembu VMBackup
Chapter 16: Distributed System Structures
Introduction to Operating Systems
Microsoft Azure P wer Lunch
Cloud Migration and DR of Any Machine to and Across Clouds, Using Continuous Replication “Our partnership with Microsoft allows customers to create an.
Cloud computing mechanisms
Using the Cloud for Backup, Archiving & Disaster Recovery
Presentation transcript:

Live Data Center Migration across WANs: A Robust Cooperative Context Aware Approach Kobus Van der Merwe with K.K. Ramakrishnan and Prashant Shenoy

Page 2 Motivation Most network based services/applications involve components hosted in data centers Internet: – Mail/Web servers, VoIP, IPTV, P2P directory services etc VPNs: – Mail servers, financial/business applications etc Many of these services require 24x7 availability Any downtime is unacceptable – At best inconvenience users; at worst major business impact; typically has financial implications Recent well published outages: Blackberry, Skype Objective our work: Business continuity in face of data center outages, both planned (planned maintenance) or unplanned (disaster recovery)

Page 3 Motivation cont. Existing solutions to deal with outages are inadequate: Local redundancy solutions – Component redundancy (hot-swappable), multiple network connections No protection against data center outages Existing cross data center solutions – Instance replication Same content/service available in multiple locations Works well for stateless services (e.g., Web servers) – Not for any statefull applications – Remote replication (either synchronous or asynchronous) Partial solutions – Typically only deals with storage – Not seamless; involves server downtime, IP addresses change etc

Page 4 Our approach Basic approach: – Seamless live service migration across WANs Including all components: server, data, network Cooperative, migration aware approach – Migration manager orchestrates migration across all three subsystems In summary: – Planned outages Migration of both server and data – Live server migration – Performed once Atomic switchover of network to complete migration – Unplanned outages “Continuous live migration” – Server and data continuously replicated to remote site – On failure, atomic switchover of network

Page 5 Challenges Seamless live server migration across WAN LAN based live server migration enabled by virtual server technologies (Xen, Vmware) WAN based server migration – Use existing virtual server migration “Management” connectivity to remote site to enable image migration – Network support to allow IP address to migrate with the (virtual) server – Migrate storage to remote site Server and storage remain consistent Continuous live migration LAN based live server migration: – The image of running virtual server is copied to a new physical platform (while the server is still running on the old platform) – Server state is synchronized between the two images – Migration software switches over to the new server with minimal downtime (tens of milliseconds) – New server is exactly the same as the old server (same IP address, network state stays intact etc) – Storage handled by through network attached storage (NAS), e.g., NFS

Page 6 Networking Support IP address migration: – Challenging to move IP addresses in current Internet Especially dynamically Isolate impact on the rest of the network – Routing protocols don’t change instantly – Connectivity changes not under data center control Our approach: – Allow migration management system to initiate network connectivity change Network provides API to migration manager – Time critical changes are kept local Network-wide (routing protocol) changes not time critical – Use temporary tunnels to deal with mobility

Page 7 Physical Server (PS) Virtual Server (VS) PEaPEb PSa VSa PSb VSa PEc PEd Migration Software Data Center A Data Center B IP Migration Primitive Network part of migration 1. Migration software signals to “network” that IPa will (soon) migrate from PEa to PEb Goal: Migrate Virtual Server “a” (VSa) with IP address IPa from Physical Server “a” (PSa) in data center “A” (DCa) to Physical Server “b” (PSb) in data center “B” (DCb) 2. “Network” creates a tunnel between PEa and PEb 3. Server migration executed between PSa and PSb 4. Migration software signals to “network” that switchover should take place 5. PEa switches all traffic towards IPa to tunnel between PEa and PEb which delivers the traffic to VSa in PSb. (Return traffic does not need to go through tunnel.)

Page 8 IP Migration Primitive PEaPEb Data Center A Data Center B Physical Server (PS) Virtual Server (VS) PSa VSa PSb VSa PEc PEd After first five steps, server migration is done as far as migration software is concerned. Traffic towards IPa is “dog-legged” through PEa, so a few more steps remain in the network: 1. PEb starts to advertise a route to IPa with high local preference. So at this point there are two valid paths towards IPa, one though PEa and the tunnel and another directly through PEb. As routers start to learn about the newly advertised path they will prefer the direct path towards IPa and the tunnel will “dry out”. 2. When PEa detects no more traffic flowing through the tunnel it withdraws the route for IPa (if it had a specific route for IPa) and tears down the tunnel. IP Migration Primitive: Takes care of planned maintenance without storage needs (E.g., VoIP network element)

Page 9 Data Storage Existing WAN solutions: remote replication – Maintain a primary/local and remote storage system – Replicate data between primary and remote systems – One of two modes: Synchronous: each write performed locally and remotely before return to “application” – Local and remote remains synchronized – Poor performance: both throughput and application latency Asynchronous: local and remote allowed to diverge, replicate a consistent “snapshot” – Good performance (high throughput, low (local) latency – Potential data loss because of divergence LocalRemote Synchronous LocalRemote Asynchronous

Page 10 Migration Aware Replication LocalRemote Switch Asynchronous Synchronous Our approach: – Remote replication that can seamlessly move between synchronous and asynchronous replication – Allow replication mode to be controlled by migration management system: Allow bulk of data to be replicated asynchronously Switch to synchronous when needed – Final part of server migration process IP Migration Primitive + Migration Aware Replication: Takes care of planned maintenance with storage needs

Page 11 Unplanned Outages Conflicting metrics of concern – Recovery point objective (RPO) How much data loss is acceptable? – Recovery time objective (RTO) How long can service be down? – Cost (overhead of protection) Range of meaning to “unplanned” – Catastrophic instantaneous failure No notice whatsoever – But also imminent failure scenarios Imminent equipments failure (e.g., increase in disk errors; imminent failure of fiber) Developing natural/man-made disasters – E.g., flooding/steam pipe burst in NY, probably even with 911 – Minutes to hours to react Existing remote replication solutions deal with storage – No support for server migration Our goal: – Replicate data and server to allow for seamless failover

Page 12 Application state requirements Limited application state: – E.g., VoIP network element that maintains call state (for 3-way calling and mid-call events), or VoD servers (for fast-forward, random access events) – Lost session state => application impact Inconvenience – RTO small, RPO medium Some state loss is tolerable (drop few calls), but service has to stay up – Instrument application to initiate partial migration, when new state has been created Statefull applications: – E.g., e-commerce applications (shopping cart, auction sites) – Lost session state => application impact (At best) inconvenience, (at worst)application correctness, monetary impact – RTO small, RPO small (minimize state loss, site has to stay up) – Continuous (incremental) server migration High integrity applications – E.g., financial transactions, other data base applications – RTO medium, RPO very small (absolutely no data loss, rather some downtime) – Reduce RTO with continuous (incremental) server migration

Page 13 Continuous Server Migration Enabling Technology: VS record/replay Start RecordingSnapshot of VS Record execution state Replay execution state Restore snapshot VS PS RECORD REPLAY Virtual server record/replay: available from VMware – Efficient recording: track “external” events + times – Synchronize events with VM state during replay – Developed as a debugging tool

Page 14 Continuous Server Migration With migration aware replication Remote: REPLAY Local: RECORD REPLICATE Asynchronously replicate initial snapshot Replication of execution state – Asynchronous if application can tolerate some state loss and execution state represent consistent checkpoint – Synchronous otherwise IP Migration Primitive + Migration Aware Replication + Continuous Server Migration: Takes care of unplanned outages

Page 15 Status Migration aware replication – Key building blocks prototyped “Semantic Aware Replication” project Gal Niv (UMass) WAN live migration – Key building blocks prototyped (without storage) “Live virtual router migration” project Yi Wang (Princeton) Continuous Server Migration – Just getting off the ground Work in progress – Many open issues remain!