Download presentation
Presentation is loading. Please wait.
Published byHans Becke Modified over 6 years ago
1
Controlling the Chunk Size in Deduplication Systems
M. Hirsch S.T. Klein D.Shapira Y.Toaff ISRAEL
2
Background and motivation
Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree
3
2k entries Algorithm for storing a repository
Background and motivation Algorithm for storing a repository Signature size k bits Hash Table Repository chunks 420 470 550 2487 2486 2485 2488 2489 2484 2k entries 470
4
Background and motivation
Chunk size dilemma small large More overhead Less deduplication fixed variable easier More robust
5
Variable length chunks
seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1K 8K
6
Variable length chunks
Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution
7
1) All functions are easily calculable
New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities such that 3) Conditions are inclusive
8
New segmentation procedure
Small inserts and deletes
9
New segmentation procedure
To get set
10
New segmentation procedure
P large random prime C large constant
11
Example distribution
12
Cumulative probabilities Individual probabilities
13
Experimental results number Avg size Std dev constant 15.7 2127 2347
5.5 2502 2568 Variable probab 15.8 2176 1014 5.9 2273 1081
15
Thank you !
16
Using fractional bits
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.