Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University.

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University of California at Santa Barbara † Aliyun.com Inc.

Motivations Virtual machines on the cloud use frequent backup to improve service reliability  Used in Alibaba’s Aliyun - the largest public cloud service in China High storage demand  Daily backup workload: hundreds of TB @ Aliyun  Number of VMs per cluster: 10000+ Large content duplicates Limited resource for deduplication  No special hardware or dedicated machines  Small CPU& memory footprint

Focus and Related Work Previous work  Version-based incremental snapshot backup –Inter-block/VM duplicates are not detected.  Chunk-based file duduplication –High cost for chunk lookup Focus on  Parallel backup of a large number of virtual disks. –Large files for VM disk images.  Contributions –Cost-constrained solution with very limited computing resource –Multi-level selective duplicate detection and parallel backup.

Requirements Negligible impact on existing cloud service and VM performance  Must minimize CPU and IO bandwidth consumption for backup and deduplication workload – (e.g. <1% of total resource). Fast backup speed  Compute backup for 10,000+ users within a few hours each day during light cloud workload. Fault tolerance constraint  addition of data deduplication should not decrease the degree of fault tolerance.

Design Considerations Design alternatives  An external and dedicated backup storage system.  A decentralized and co-hosted backup system with full deduplication Backup Cloud service backup Cloud service backup Cloud service... backup Cloud service

Design Considerations Decentralized architecture running on a general purpose cluster co-hosting both elastic computing and backup service Multi-level deduplication  Localize backup traffic and exploit data parallelism  Increase fault tolerance Selective deduplication  Use minimal resource while still removing most of redundant content and accomplishing good efficiency

Key Observations Inner-VM data characteristics Exploit unchanged data to localize deduplication Cross-VM data characteristics Small common data dominates duplicates Zipf-like distribution of VM OS/user data Separate consideration of OS and user data

VM Snapshot Representation Data blocks are variable-sized Segments are fix-sized

Processing Flow of Multi-level Deduplication

Data Processing Steps Segment level checkup.  Use dirty bitmap to see which segments are modified. Block level checkup  Divide a segment into variable-sized blocks, and compare their signatures with the parent snapshot Checkup from common dataset (CDS)  Identify duplicate chunks from CDS Write new snapshot blocks  Write new content chunks to stoage. Save recipes  Save segment meta-data information

Architecture of Multi-level VM snapshot backup Cluster node

Status& Evaluation Prototype system running on Alibaba’s Aliyuan cloud.  Based on Xen.  100 nodes and each has 16 cores, 48G memory, 25VMs.  Use <150MB per machine for backup&deduplication Evaluation data from Aliyuan’s production cluster  41TB.  10 snapshots per VM  Segment size: 2MB.  Avg. Block size: 4KB

Data Characteristics of the Benchmark Each VM uses 40GB storage space on average OS and user data disks: each takes ~50% of space OS data  7 main stream OS releases:  Debian, Ubuntu, Redhat, CentOS, Win2003 32bit, win2003 64 bit and win2008 64 bit. User data  From 1323 VM users

Impacts of 3-Level Deduplication Level 1: Segment-level detection within VM Level 2: Block-level detection within VM Level 3: Common data block detection across-VM

Impact for Different OS Releases

Separate consideration of OS and user data Both have Zipf-like data distribution But popularity growth differs as the cluster size/VM users increase

Commonality among OS releases 1G common OS meta data covers 70+%

Cumulative coverage of popular user data Coverage is the summation of covered data block size*frequency

Space saving compared to perfect deduplication as CDS size increases 100G CDS (1GB index) -> 75% of perfect dedup

Impact of dataset-size increase

Conclusions Contributions:  A multi-level selective deduplication scheme among VM snapshots –Inner-VM deduplication localizes backup and exposes more parallelism –global deduplication with a small common data set appeared in OS and data disks  Use less than 0.5% of memory per node to meet a stringent cloud resource requirement -> accomplish 75% of what perfect deduplication does. Experiments  Achieve 500TB/hour on a 1000-node cloud cluster  Reduce bandwidth by 92% -> 40TB/hour

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University.

Similar presentations

Presentation on theme: "Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

Similar presentations

Presentation on theme: "Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University."— Presentation transcript:

Similar presentations

About project

Feedback

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University.

Presentation on theme: "Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang, Hong Tang †, Hao Jiang †, Tao Yang, Xiaogang Li †, Yue Zeng † * University."— Presentation transcript: