Similarity based deduplication

Similarity based deduplication
By: Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch , Tomi Klein

Deduplication There is a lot of redundancy in stored data, especially in backup data. Deduplication aims to store only the differences between different versions.

Different types of deduplication
Inline or offline. Hash comparison or byte by byte. Similarity based or identity based.

Our initial design requirements
Support for a petabyte of physical storage. A deduplication rate of at least 350MB/sec. Inline Byte to byte (B2B) comparison

Standard approach Break up the incoming data stream into segments, a few KB in size. The break up boundaries are computed using patterns in a rolling hash. Identify each segment using a long hash. Check if the hash belongs to a previous segment If so place a pointer to the segment.

Standard approach Can be fast, can be inline, however:
Doesn’t scale to a physical Petabyte (because of KB sized segments) No B2B comparison

Our approach Break up incoming stream to chunks of a few MB size.
Compute a similarity (not identity!) signature so that chunks which are alike (even only 50%) will have signatures which are alike (could be only 25%). Do a B2B comparison between the incoming chunk and similar repository segments. Store differences based on the B2B.

Similarity signatures
Compute a rolling hash (Karp-Rabin) to all blocks of the chunk. Three possibilities: (Breen et al) Take k random block hashes (Broder, Heintze, Manbar) Take the largest k hashes (Our choice) Take k hashes of blocks which are close to those that produced the k largest hashes

Criteria for comparison
Similarity checking speed Successful identification of similarity percentage Low probability of false positives Likelihood of finding the most similar match

Comparison of methods The first method (random block hashes) is slow and has many false positives, likelihood of finding best similarity is lower compared to the other methods The second method (k maximal hashes) is faster, but still has false positives The third method solves all issues

B2B phase Once similarity is detected, we know where in the repository the similar data is located and we have a few anchoring matches. The B2B comparison itself is completely decoupled from the similarity search ! We have the anchors and computed hashes to support the B2B.

Implementation The TS7650 from IBM, formerly from Diligent
Has been available since 2005. Many clients managing many petabytes Very large installations

Did we achieve our goals?
Up to 850MB/sec on a single system node Up to 1PB of physically usable storage with only 12 GB of memory Inline B2B comparison

Some of the competition
Data Domain 690 series An HP system, the DD4000 from 2008, academic paper in FAST 2009, the only other similarity based product, but with hash comparison, uses a variant of the second method.

How do we compare?

We are better !!

In more detail While our solution supports 1PB physical, Data domain supports at most 50TB and the HP product at most 10TB, we have actual installations which are far bigger than either of these numbers. Our solution is faster, somewhat faster than Data Domain, much faster than HP We still find time to do B2B comparison, they don’t Our solution has faster reconstruction rate, remember that’s when you have a data outage situation !

Customer site

40:1 dedupe ratio

Daily back-up fluctuations

throughput

Thanks !!

Similarity based deduplication

Similar presentations

Presentation on theme: "Similarity based deduplication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similarity based deduplication

Similar presentations

Presentation on theme: "Similarity based deduplication"— Presentation transcript:

Similar presentations

About project

Feedback