Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

Slides:



Advertisements
Similar presentations
Live migration of Virtual Machines Nour Stefan, SCPD.
Advertisements

Virtual Memory (II) CSCI 444/544 Operating Systems Fall 2008.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Cloud Computing Imranul Hoque. Today’s Cloud Computing.
VSphere vs. Hyper-V Metron Performance Showdown. Objectives Architecture Available metrics Challenges in virtual environments Test environment and methods.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
1 Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, John C. S. Lui.
CS 153 Design of Operating Systems Spring 2015
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
5205 – IT Service Delivery and Support
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Data Deduplication in Virtualized Environments Marc Crespi, ExaGrid Systems
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
E Virtual Machines Lecture 4 Device Virtualization
1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.
An Evaluation of Using Deduplication in Swappers Weiyan Wang, Chen Zeng.
1 The Google File System Reporter: You-Wei Zhang.
Redundant Array of Independent Disks
File Systems and Disk Management. File system Interface between applications and the mass storage/devices Provide abstraction for the mass storage and.
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
Virtualization Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation is licensed.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.
1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.
Large Scale Parallel File System and Cluster Management ICT, CAS.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
NTU Cloud 2010/05/30. System Diagram Architecture Gluster File System – Provide a distributed shared file system for migration NFS – A Prototype Image.
Resource Management Model of Data Storage Systems Oriented on Cloud Computing Elena Kaina Yury Korolev.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Hepix spring 2012 Summary SITE:
File Systems and Disk Management
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
Memory Management.
CE 454 Computer Architecture
Information Leakage in Encrypted Deduplication via Frequency Analysis
Distributed Network Traffic Feature Extraction for a Real-time IDS
CSC 322 Operating Systems Concepts Lecture - 12: by
A Simulation Analysis of Reliability in Primary Storage Deduplication
Windows Azure Migrating SQL Server Workloads
Overview Introduction VPS Understanding VPS Architecture
File Systems and Disk Management
Department of Computer Science University of California, Santa Barbara
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
File Systems and Disk Management
File Systems and Disk Management
Process Migration Troy Cogburn and Gilbert Podell-Blume
File Systems and Disk Management
File Systems and Disk Management
File Systems and Disk Management
Department of Computer Science University of California, Santa Barbara
Jingwei Li*, Patrick P. C. Lee #, Yanjing Ren*, and Xiaosong Zhang*
4.3 Virtual Memory.
Fan Ni Xing Lin Song Jiang
Virtual Memory 1 1.
Presentation transcript:

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University of California at Santa Barbara † Aliyun.com Inc.

Motivations Virtual machines on the cloud use frequent backup to improve service reliability  Used in Alibaba’s Aliyun - the largest public cloud service in China High storage demand  Daily backup workload: hundreds of Aliyun  Number of VMs per cluster: Large content duplicates Limited resource for deduplication  No special hardware or dedicated machines  Small CPU& memory footprint

Focus and Related Work Previous work  Version-based incremental snapshot backup –Inter-block/VM duplicates are not detected.  Chunk-based file duduplication –High cost for chunk lookup Focus on  Parallel backup of a large number of virtual disks. –Large files for VM disk images.  Contributions –Cost-constrained solution with very limited computing resource –Multi-level selective duplicate detection and parallel backup.

Requirements Negligible impact on existing cloud service and VM performance  Must minimize CPU and IO bandwidth consumption for backup and deduplication workload – (e.g. <1% of total resource). Fast backup speed  Compute backup for 10,000+ users within a few hours each day during light cloud workload. Fault tolerance constraint  addition of data deduplication should not decrease the degree of fault tolerance.

Design Considerations Design alternatives  An external and dedicated backup storage system.  A decentralized and co-hosted backup system with full deduplication Backup Cloud service backup Cloud service backup Cloud service... backup Cloud service

Design Considerations Decentralized architecture running on a general purpose cluster co-hosting both elastic computing and backup service Multi-level deduplication  Localize backup traffic and exploit data parallelism  Increase fault tolerance Selective deduplication  Use minimal resource while still removing most of redundant content and accomplishing good efficiency

Key Observations Inner-VM data characteristics Exploit unchanged data to localize deduplication Cross-VM data characteristics Small common data dominates duplicates Zipf-like distribution of VM OS/user data Separate consideration of OS and user data

VM Snapshot Representation Data blocks are variable-sized Segments are fix-sized

Processing Flow of Multi-level Deduplication

Data Processing Steps Segment level checkup.  Use dirty bitmap to see which segments are modified. Block level checkup  Divide a segment into variable-sized blocks, and compare their signatures with the parent snapshot Checkup from common dataset (CDS)  Identify duplicate chunks from CDS Write new snapshot blocks  Write new content chunks to stoage. Save recipes  Save segment meta-data information

Architecture of Multi-level VM snapshot backup Cluster node

Status& Evaluation Prototype system running on Alibaba’s Aliyuan cloud.  Based on Xen.  100 nodes and each has 16 cores, 48G memory, 25VMs.  Use <150MB per machine for backup&deduplication Evaluation data from Aliyuan’s production cluster  41TB.  10 snapshots per VM  Segment size: 2MB.  Avg. Block size: 4KB

Data Characteristics of the Benchmark Each VM uses 40GB storage space on average OS and user data disks: each takes ~50% of space OS data  7 main stream OS releases:  Debian, Ubuntu, Redhat, CentOS, Win bit, win bit and win bit. User data  From 1323 VM users

Impacts of 3-Level Deduplication Level 1: Segment-level detection within VM Level 2: Block-level detection within VM Level 3: Common data block detection across-VM

Impact for Different OS Releases

Separate consideration of OS and user data Both have Zipf-like data distribution But popularity growth differs as the cluster size/VM users increase

Commonality among OS releases 1G common OS meta data covers 70+%

Cumulative coverage of popular user data Coverage is the summation of covered data block size*frequency

Space saving compared to perfect deduplication as CDS size increases 100G CDS (1GB index) -> 75% of perfect dedup

Impact of dataset-size increase

Conclusions Contributions:  A multi-level selective deduplication scheme among VM snapshots –Inner-VM deduplication localizes backup and exposes more parallelism –global deduplication with a small common data set appeared in OS and data disks  Use less than 0.5% of memory per node to meet a stringent cloud resource requirement -> accomplish 75% of what perfect deduplication does. Experiments  Achieve 500TB/hour on a 1000-node cloud cluster  Reduce bandwidth by 92% -> 40TB/hour