Download presentation
Presentation is loading. Please wait.
Published byVerity McCoy Modified over 9 years ago
1
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara Hong Tang Alibaba Inc. USENIX HotStorage’2013
2
Motivation Virtual machines in the cloud can use frequent backup to improve service reliability Used in Alibaba’s Aliyun - the largest public cloud service in China High storage demand & large content duplicates Daily backup workload: hundreds of TB @ Aliyun Number of VMs per cluster: tens of thousands Seek for inexpensive solutions
3
Architecture Consideration An external and dedicated backup storage system. High network traffic for transferring undeduplicated data Expensive A decentralized and co-hosted backup system with full deduplication Lower cost & traffic
4
Requirements Nondedicated resource Cohosted with existing cloud services Resource friendly – small memory footprint and CPU usage Compute and backup for tens of thousand VMs within a few hours each day during light cloud workload.
5
Focus and Related Work Previous work Inline chunk-based deduplication –High cost for fingerprint lookup Speedup fingerprint comparison with approximation (e.g. subsampling, bloomfilter, stateless routing) Focus of this paper Not inline - shorten overall backup times of many VM images, but not individual request Not offline - multi-stage parallel backup with small storage overhead, & limited computing resource Work-in-progress
6
Key Ideas Separation of duplicate detection and data backup Different from inline deduplication. Buffered data redistribution in parallel duplicate detection Stage 1: Collect fingerprints in parallel Stage 2: Detect duplicates in parallel Stage 3: Perform actual VM backup in parallel
7
VM Snapshot Representation Data blocks are variable-sized Segments are fix-sized
8
Stage 1: Deduplication request accumulation ➔ Scan dirty data blocks ➔ Exchange &accumulate dedup requests ➔ Map data from VM-based to fingerprint-based distribution
9
Stage 2: Fingerprint comparison and summary output Load global index and dedup requests one partition at a time Compare fingerprints in parallel Output dedup summary from fingerprint-based to VM-based distribution
10
Stage 3: Non-duplicate data backup Load dup summaries Read dirty segments Output non-duplicate data blocks
11
Memory Usage per Machine at Different Stages Stage 1: Request accumulation 1 I/O buffer to read dirty segments p network send and p recv buffers for p machines q dedup request buffers for local disk write of q partitions Stage 2: Fingerprint comparison Space for hosting 1 partition index and corresponding requests p network send and p recv buffers, v local summary buffers for disk write Stage 3: Nonduplicate backup An I/O buffer to read dirty segments and write non-duplicates Duplicate summary within dirty segments
12
Issues with Incidental Redundancy Two VM blocks with the same fingerprint are created in parallel in different machines Both are identified as new blocks The rest of occurrences are detected as duplicates and logged Repaired inconsistency periodically during index update
13
Snapshot Deletion Mark-and-sweep – A block can be deleted if its reference count is zero Similar to deduplication stages Scan the meta data and accumulate block reference pointers Compute the reference count of each index entry, partition by partition Log deletion instructions Periodically perform a compact operation when its deletion log is too big.
14
Evaluation Evaluated on a cluster of Dual quad-core Intel Nehalem 2.4Hz E5530 with 24GB memory. Test data from Alibaba Aliyuan cloud 41TB. 10 snapshots per VM Segment size: 2MB. Avg. Block size: 4KB Evaluation objectives 1) Analyze the deduplication throughput and effectiveness for a large number of VMs. 2) Examine the impacts of buffering during metadata exchange.
15
Data Characteristics Each VM uses 40GB storage space on average OS and user data disks: each takes ~50% of space OS data 7 main stream OS releases: Debian, Ubuntu, Redhat, CentOS, Win2003 32bit, win2003 64 bit and win2008 64 bit. User data From 1323 VM users
16
Setting & Resource Usage per Machine P=100 machines. 25VMs per machine Disk 8 GB metadata usage 10millsec local disk seek cost 50MB/second I/O per machine < 16.7% of local IO bandwidth usage. Memory usage: ~35MB CPU: Single-thread execution per machine 10-13% of single core
17
Parallel Time When Memory Limit Varies
18
Performance when 35MB memory used per machine Option1: unoptimized data redistribution.
19
Conclusions Low-cost multi-stage parallel deduplication for simultaneous backup of many VM images Co-hosted with other cloud services Tradeoff: Not optimized for individual backup request Read dirty data twice. Work-in-progress Evaluation – Backup throughput of 100 machines about 8.76GB per second for 2500 VMs – Resource friendly to the existing cluster services.
20
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.