Controlling the Chunk Size in Deduplication Systems M. Hirsch S.T. Klein D.Shapira Y.Toaff ISRAEL
Background and motivation Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree
2k entries Algorithm for storing a repository Background and motivation Algorithm for storing a repository Signature size k bits Hash Table Repository chunks 420 470 550 2487 2486 2485 2488 2489 2484 2k entries 470
Background and motivation Chunk size dilemma small large More overhead Less deduplication fixed variable easier More robust
Variable length chunks seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1K 8K
Variable length chunks Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution
1) All functions are easily calculable New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities such that 3) Conditions are inclusive
New segmentation procedure Small inserts and deletes
New segmentation procedure To get set
New segmentation procedure P large random prime C large constant
Example distribution
Cumulative probabilities Individual probabilities
Experimental results number Avg size Std dev constant 15.7 2127 2347 5.5 2502 2568 Variable probab 15.8 2176 1014 5.9 2273 1081
Thank you !
Using fractional bits