Download presentation
Presentation is loading. Please wait.
Published bySudomo Budiono Modified over 5 years ago
1
Optimal Partitioning of Data Chunks in Deduplication Systems
M. Hirsch A. Ish-Shalom S.T. Klein ISRAEL
2
Outline Background and motivation Background and motivation
Definitions Proposed optimal solution Improvements Problem Statement Definitions Problem Statement Proposed optimal solution Improvements 2
3
Background and motivation
Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree
4
Large Chunks Background and motivation Chunk size dilemma small large
More overhead Less deduplication Similarity Large chunks Zoom in on how to store Large Chunks
5
Data Chunks: sequence of matching and non-matching parts
Definitions Data Chunks: sequence of matching and non-matching parts Matching Old (offset, length) Non-Matching New explicit Partition is given but may be altered
6
1 Byte of meta-data = F Bytes of stored data
M meta-data entry (E bytes) NM meta-data entry + size(new data) Cost: 1 Byte of meta-data = F Bytes of stored data Conversion: M no merge but can be declared as NM NM may merge but cannot change to M Flexibility:
7
Find optimal partition = minimizing cost Saving space AND Time and I/O
Problem Statement Problem: Find optimal partition = minimizing cost If size < 2 FE No gain If sizes < 4 FE If size < FE Saving space AND Time and I/O
8
1’s cannot be changed Solve separately for every run of 0’s n Exhaustive search: possibilities Merge optimal solutions for sub-ranges Dynamic Programming
9
Proposed optimal solution
Cost of optimal solution for items i to j Optimal partition for items i to j 1 2 n Solution: 1 2 Initialize: n Fill by diagonals
10
Or there is a 1 in some position k
Recursion: Either stay with …0 Or there is a 1 in some position k j i k - 1 k k + 1
11
j i k - 1 k k + 1 R L f(L,R) R L -1 1
12
Each item depends on those to its left in the same row
j 1 2 n 1 Each item depends on those to its left in the same row below it in the same column 2 i n
13
Algorithm Initialization:
14
Main loop
15
Complexity:
16
Improvements Reducing the time complexity: Consider maximal possible gain Sufficient condition to remain 0:
17
Improvements Reducing the space complexity: entries each of size
Reduce by keeping for each the index OK of the optimal k space Build partition string recursively
18
Thank you !
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.