Presentation is loading. Please wait.

Presentation is loading. Please wait.

Controlling the Chunk Size in Deduplication Systems

Similar presentations


Presentation on theme: "Controlling the Chunk Size in Deduplication Systems"— Presentation transcript:

1 Controlling the Chunk Size in Deduplication Systems
M. Hirsch S.T. Klein D.Shapira Y.Toaff ISRAEL

2 Background and motivation
Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree

3 2k entries Algorithm for storing a repository
Background and motivation Algorithm for storing a repository Signature size k bits Hash Table Repository chunks 420 470 550 2487 2486 2485 2488 2489 2484 2k entries 470

4 Background and motivation
Chunk size dilemma small large More overhead Less deduplication fixed variable easier More robust

5 Variable length chunks
seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1K 8K

6 Variable length chunks
Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution

7 1) All functions are easily calculable
New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities such that 3) Conditions are inclusive

8 New segmentation procedure
Small inserts and deletes

9 New segmentation procedure
To get set

10 New segmentation procedure
P large random prime C large constant

11 Example distribution

12 Cumulative probabilities Individual probabilities

13 Experimental results number Avg size Std dev constant 15.7 2127 2347
5.5 2502 2568 Variable probab 15.8 2176 1014 5.9 2273 1081

14

15 Thank you !

16 Using fractional bits

17


Download ppt "Controlling the Chunk Size in Deduplication Systems"

Similar presentations


Ads by Google