DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor.

DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor Peled

Method of Deduplication As stated in class there are three main mehus for activating a deduplication engine  SBA – user side resposiblety to deduplicate  ILA – inline data process of deduplication  PPA – batch operation in the background on the server side

Papers  High Throughput Data Redundancy Removal Algorithm with Scalable Performance High Throughput Data Redundancy Removal Algorithm with Scalable Performance Bhattacherjee, Narang, Garg (IBM)  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)

Bloom Filters 101  Basic problem -  Given a set S = {x 1,x 2,,…x n }, Answer: y ∊ S ?  Demand & Relaxations: Efficient lookup time False positives allowed, false negatives are not!  Implementation – map into k locations of an m-wide bit array using k different hash functions. Insertion – set k bits on. Lookup – return true iff all k-bits are set Delete – impossible! (why?)

Bloom Filter example Start with an m bit array, filled with 0s. Hash each item x j in S k times. If H i (x j ) = a, set B[a] = 1. 0000000000000000 B 0000 0100101001110110 B To check if y is in S, check B at H i (y). All k values must be 1. 0100101001110110 B 0100101001110110 B Possible to have a false positive; all k values are 1, but y is not in S. n items m = cn bits k hash functions Slide by Prof Michael Mitzenmacher, Harvard

False Positive Probability  Pr(specific bit of filter is 0) is  If  is fraction of 0 bits in the filter then false positive probability is  Approximations valid as  is concentrated around E[  ].  Martingale argument suffices.  Find optimal at k = (ln 2)m/n by calculus.  So optimal fpp is about (0.6185) m/n n items m = cn bits k hash functions Slide by Prof Michael Mitzenmacher, Harvard

Counting Bloom Filters  Suggested by Fan et al. (1998)  Handles the deletion problem –  Instead of 1-bit, put a counter (usually 4bit) Inc/dec all k locations on insertion/deletion  Mind for overflows! prob. of the order of 6E-17…  What happened to the false deletion problem?  Further upgrades:  Double/triple hashing, compressed BF, hierarchical BF, space-code BF, spectral BF, …

Proposed Parallel-BF ( Bhattacherjee )  Deletion is still slow (k re-hashes).  Hard to parallelize.  New upgrade:  Streaming De-Duplication We want to stream over windows of ω chunks (sets). When done with group i, we only delete the occurrences in the first set (S i ), and add the items in the next (S i+ ω +1 ). For fast delete - instead of counter, keep an array of ω bits, for each set in the currently observed group

Process Flow  Divide data flow into batches of records  For each batch:  Pre-process: Divide into chunks and remove internal duplications (parallel) Merge chunks and remove duplications  Process (FE and BE decoupled): Compute k-hashes on record signatures (parallel) Add to BFA while removing duplications (buffered + parallel)  Continue streaming over next batches while removing inter-batch duplications ( ω -wide)  Delete sets from BFA as the window advances

Process Flow Diagram

Pre-processing stage

BFA Visualization 000000 100000 100000 000000 100000 000000 000000 m ω ω Set identifier Add: ε ∊ S i 1 1 Window = {S i,S i+1,..,S i+ ω }) Add: ε’ ∊ S i+1 1 1 k Delete S i- ω 0 0 0 τ

Scalability Results

Conclusion  Since overall thread count is constant –  Trade off between PP and FE+BE threads  Paper analyses queue model to find sweet-spot PP stage behaves like M/M/k 1 FE+BE stages behave like G/M/k 2  No mention on FE/BE trade off..  Algorithm scales well for #threads, #records and record size  Experimental throughput - 0.81 GB/s

Papers  High Throughput Data Redundancy Removal Algorithm with Scalable Performance High Throughput Data Redundancy Removal Algorithm with Scalable Performance Bhattacherjee, Narang, Garg (IBM)  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)

ILA method – inline process ILA method Pros  Simplicity, Avoids the compilation in batch mode ILA method Cons  Full index  Hugh RAM (unpractical) or small RAM and damage memory BDW (chunk-lookup disk bottleneck) This paper address the main weakness of ILA method chunk- lookup disk bottleneck by using a sparse indexing and picking few candidates for deduplication. i.e. approximate deduplication.

Few Words on Data segmentation The data stream input to the storage device is divided into large pieces called segments. Each Segment can be built and divided into two parts:  Chunks – blocks of real data  Manifest – a struct that holds a pointer to each chunk by the order they should appear in the original segment and a hash value for each chunk

Chunk container: Address - Raw Data 0x1 – 0x2 – 0x3 – Chunk container: Address - Raw Data 0x1 – 0x2 – 0x3 – Few Words on Data segment Quick example Chunk A Chunk B Chunk A Chunk B Chunk A Chunk C Manifest: Address - hash value 0x1 – 0x234 0x2 – 0x017 0x1 – 0x234 0x2 – 0x017 0x3 – 0x459 Manifest: Address - hash value 0x1 – 0x234 0x2 – 0x017 0x1 – 0x234 0x2 – 0x017 0x3 – 0x459 Data Segment

A champion(s) is picked from the sparse index RAM, according to a most similar segment policy. The RAM stores only the Ptr to the manifrest thus a Read request to the disk is needed A champion(s) is picked from the sparse index RAM, according to a most similar segment policy. The RAM stores only the Ptr to the manifrest thus a Read request to the disk is needed After retrieving the champion manifest from the disk the deduplication process starts. At the end of this process the new manifest, new entry and new chunks are stored at the disk Data stream arrives And it Is divided into chunks using Two-Threshold Two- Divisor (TTTD) chunking algorithm Data stream arrives And it Is divided into chunks using Two-Threshold Two- Divisor (TTTD) chunking algorithm Data Chunks are divided into segments using fixed-size segmentation algorithm or variable- size segmentation algorithm The incoming chunks of the Data segment are sampled, it can be done by using only hash values with a common prefix. The frequency of the sampler correlates exponentially to the prefix length A A B B A A B B A A C C Proposed Flow A A B B A A B B A A C C Data Segment Data Stream A A B B A A B B A A C C A A C C D D M-PTR F F G G A A D D C C A A B B Manifest New entry

Assumption Used In the proposed flow an approximation is used to avoid the main downsize of an inline deduplication, chunk-lookup disk bottleneck. The approximation is the use of a sparse index, which implies that not all possible duplicated chunks are deduplicated. The assumptions are  Locality of chunks: if a champion segment share few chunks it likely to share many other chunks with the incoming segment as well.  Locality of segments: most of the deduplication possible for a given segment can be obtained by deduplicating it against a small number of prior segments.

Simulation Results In the below graph, all the flow is as shown previously when sparse indexation of the chunks is used SMB –synthetic data set that represent a small or medium business server backed up to virtual tape. Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface. When Fixing the mean segment size to be 10MB

Simulation Results Deduplication factor = original size/deduplicated size SMB –synthetic data set that represent a small or medium business server backed up to virtual tape. Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface.

Conclusions The proposed method that is presented in the paper have the following strengths:  Simple flow – inline flow that uses known algorithm for the preprocessing stage (chunk and segment partitioning)  Very small RAM –Stores ~15-10 prior segments,actually only the sparse indexes and the manifest pointer is stored in RAM.  Used in the industry – even though not all the details regarding that flow are presented, the fact that this system is in real use gives is a major strength.

Conclusions (2) The proposed method that is presented in the paper have the following weakness:  The efficiency of the flow is crucially dependent on the data set – can be viewed from the simulation 2.3-13 from different sets  both of the datasets which use to evaluate are synthetic – since this flow is so sensitive for it data set a real dataset evaluation is more than needed

Main Differences Stream basedBloom Filter based Processinginline Deduplicating“similar” chunks (Sampled) Consecutive segments ApproximationChunk Sampling Sparse indexing BF (false positives) + window limits Throughput250MB-2.5GB/s0.81GB/s Purpose (proposed)D2DStorage saving Scalability / parallelism#champions boundOverall threads (internal division is optimized)

A note on the comparison  Tradeoffs between  BW  Quality (level of Deduplication)  Space (RAM/disk)  Both approaches don’t limit data set size (usually 10- 100 TB) since they’re inline.  Sparse indexing provides flexibility – support higher rates by doing a worse job  BF provides guaranteed deduplication within a given window of segments, but limits the BW  Inherent problem – BF strength is in no false negatives, for DD we require no false positives..

Proposal  Augmenting the BF based approach:  Instead of ω -wide window (assuming temporal locality), generalize to generic ω -way set “caching” Also a bit similar to the champion approach from the second paper. Maintain sets that are either recent (from the window), or have been hit by some BF lookups. Can be implemented via LRU-like score based on number of hits in last de-duplication stage – each cycle throw away the lowest scoring set from the BFA and fill the new set

DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor.

Similar presentations

Presentation on theme: "DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor.

Similar presentations

Presentation on theme: "DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor."— Presentation transcript:

Similar presentations

About project

Feedback