Download presentation
Presentation is loading. Please wait.
1
Fan Ni Xing Lin Song Jiang
SS-CDC: A Two-stage Parallel Content-Defined Chunking Method for Data Deduplicating Fan Ni Xing Lin Song Jiang
2
Data is Growing Rapidly
Not transient data The amount of space required to store them and resource to access them can be huge. It also poses significant challenges on their management, such as that for the system’s scalability and fault tolerance. From storagenewsletter.com Most of the data needs to be safely stored. Efficient data storage and management have become a big challenge. 2
3
The Opportunity: Data Duplication is Common
Sources of duplicate data: The same files are stored by multiple users into the cloud. Continuously updating of files to generate multiple versions. Use of checkpointing and repeated data archiving. Significant data duplication has been observed. For backup storage workloads Over 90% are duplicate data. For primary storage workloads About 50% are duplicate data. There is an opportunity to reduce the size of the data, which is the data can be highly redundant, or there exists substantial data duplication. such as software package, data re-ornization, … in backup storage systems. 3
4
The Deduplication Technique can Help.
File1 File2 SHA1( ) = SHA2( ) When duplication is detected (using fingerprinting): Then only one copy is stored: Logical Physical File1 File2 Benefits Storage space I/O bandwidth Network traffic A important feature in commercial storage systems NetApp ONTAP system Dell-EMC Data Domain system The data deduplication technique is critical. How to deduplicate more data? How to deduplicate faster? This explains why dedup technique has been so widely used in various computing and storage environments. The concept of dedup is pretty simple. Use file-level dedup as an example. …cryptographic hash function … In dedup is a feature in major storage products Recognize the duplicate data as much as possible Faster … this is true not only for primary systems, it’s also important for backup systems. …leave more for other data maintenance tasks.
5
Deduplicate at Smaller Chunks …
Chunking and fingerprinting Remove duplicate chunks … for higher deduplication ratio Let’s first look at the first issue, which is dedup more data for a higher dedup ratio, which is .. Two potentially major sources of cost in the deduplication: Chunking Fingerprinting Can chunking be very fast?
6
Fixed-Size Chunking (FSC)
FSC: partition files (or data streams) into equal- and fixed-size chunks. Very fast! But the dedup ratio can be significantly compromised. The boundary-shift problem. HOWAREYOU?OK?REALLY?YES?NO File A Yes, one can simply partition HOWAREYOU?OK?REALLY?YES?NO File B 6
7
Fixed-Size Chunking (FSC)
FSC: partition files (or data streams) into equal- and fixed-size chunks. Very fast! But the dedup ratio can be significantly compromised. The boundary-shift problem. HOWAREYOU?OK?REALLY?YES?NO File A Yes, one can simply partition File B H HOWAREYOU?OK?REALLY?YES?NO 7
8
Content-Defined Chunking (CDC)
CDC: determines chunk boundaries according to contents (a predefined special marker). Variable chunk size. Addresses boundary-shift problem However, it can be very expensive Assume the special marker is ‘?’ HOWAREYOU?OK?REALLY?YES?NO File A To address this problem, CDC is proposed. File B H HOWAREYOU?OK?REALLY?YES?NO Actually the marker is determined by applying a hash function on a window of bytes, such as hash(“YOU?”) == pre-defined-value Even more expensive (likely more than half of the dedup cost!) 8
9
Parallelizing CDC Chunking Operations
A File
10
Parallelizing CDC Chunking Operations
A File Parallelize its chunking: p1 p0 p2 p3
11
Parallelizing CDC Chunking Operations
A File Parallelize its chunking: p1 p0 p2 p3 However, the parallelized chunking can compromise deduplication ratio.
12
Compromised Deduplication Ratio
Higher is better Offset of the advantage of CDC Deduplication ratio = data size before dedup / data size after dedup
13
Chunks can be Different!
The rule of forming chunks: Usually between two adjacent markers. But neither too small (≥ Minimum-chunk-size) nor (≤ maximum-chunk-size) Inherently a sequential process min max The parallel chunking: Artificially introduce a set of markers (segment boundaries). These maker positions change with data insertion/deletion. Partially brings back the boundary shift problem.
14
The Goal of this Research
To design a parallel chunking technique that … Does not compromise any deduplication ratio. Achieves superlinear speedup of chunking operations.
15
Approach of the Proposed SS-CDC Chunking
Two-phase chunking: Stage 1: produce all markers in parallel on a segmented file A thread works on 16 consecutive segments at a time. Use AVX-512 SIMD instructions to process the 16 segments in parallel at a core. The markers are recorded in a bit vector File One thread
16
The Approach of the Proposed SS-CDC Chunking
Two-phase chunking: Stage 2: sequentially determines the chunks based on the marker bit vector Take account of minimum and maximum chunk sizes
17
Advantages of SS-CDC It doesn’t have any loss of deduplication ratio
The second stage is sequential. It generates the set of chunks exactly the same the sequential chunking. It potentially achieves superlinear speedup. Stage 1 accounts for about 98% of the chunking time. Stage 1 is parallelized across and within cores. With optimization, Stage 2 accounts for less than 2% of the chunking time.
18
Experiment Setup The hardware The Software
Dell-EMC PowerEdge T440 server with 2 Intel Xeon 3.6GHz CPUs Each CPU has 4 cores and 16MB LLC. 256GB DDR4 memory. The Software Ubuntu OS. The rolling window function is Rabin. Minimum/average/maximum chunk sizes are 2KB/16KB/64KB, respectively.
19
The Datasets Name Description Cassandra
Docker images of Apache Cassandra, an open-source storage system Redis Docker images of the Redis key-value store database Debian Docker images of Debian Linux distribution (since Ver. 7.11) Linux-src Uncompressed Linux source code (v3.0 ~ v4.9) downloaded from the website of Linux Kernel Archives Neo4j Docker images of neo4j graph database Wordpress Docker images of WordPress rich content management system Nodejs Docker images of JavaScript-based runtime environment packages
20
Single-thread/core Chunking Throughput
Consistently about 3.3X speedup
21
Multi-thread/core Chunking Throughput
The chunking speedups are superlinear and scale well.
22
Existing Parallel CDC Deduplication Ratio Reduction
Compared to SS-CDC, the reduction can be up to 43%. Using smaller segments leads to higher reduction
23
Conclusions SS-CDC is a parallel CDC technique that has
high chunking speed. zero deduplication ratio loss. SS-CDC is optimized for the SIMD platforms. Similar two-stage chunking techniques can be applied in other platforms such as GPU.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.