Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal Partitioning of Data Chunks in Deduplication Systems

Similar presentations


Presentation on theme: "Optimal Partitioning of Data Chunks in Deduplication Systems"— Presentation transcript:

1 Optimal Partitioning of Data Chunks in Deduplication Systems
M. Hirsch A. Ish-Shalom S.T. Klein ISRAEL

2 Outline Background and motivation Background and motivation
Definitions Proposed optimal solution Improvements Problem Statement Definitions Problem Statement Proposed optimal solution Improvements 2

3 Background and motivation
Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree

4 Large Chunks Background and motivation Chunk size dilemma small large
More overhead Less deduplication Similarity Large chunks Zoom in on how to store Large Chunks

5 Data Chunks: sequence of matching and non-matching parts
Definitions Data Chunks: sequence of matching and non-matching parts Matching Old (offset, length) Non-Matching New explicit Partition is given but may be altered

6 1 Byte of meta-data = F Bytes of stored data
M meta-data entry (E bytes) NM meta-data entry + size(new data) Cost: 1 Byte of meta-data = F Bytes of stored data Conversion: M no merge but can be declared as NM NM may merge but cannot change to M Flexibility:

7 Find optimal partition = minimizing cost Saving space AND Time and I/O
Problem Statement Problem: Find optimal partition = minimizing cost If size < 2 FE No gain If sizes < 4 FE If size < FE Saving space AND Time and I/O

8 1’s cannot be changed Solve separately for every run of 0’s n Exhaustive search: possibilities Merge optimal solutions for sub-ranges Dynamic Programming

9 Proposed optimal solution
Cost of optimal solution for items i to j Optimal partition for items i to j 1 2 n Solution: 1 2 Initialize: n Fill by diagonals

10 Or there is a 1 in some position k
Recursion: Either stay with …0 Or there is a 1 in some position k j i k - 1 k k + 1

11 j i k - 1 k k + 1 R L f(L,R) R L -1 1

12 Each item depends on those to its left in the same row
j 1 2 n 1 Each item depends on those to its left in the same row below it in the same column 2 i n

13 Algorithm Initialization:

14 Main loop

15 Complexity:

16 Improvements Reducing the time complexity: Consider maximal possible gain Sufficient condition to remain 0:

17 Improvements Reducing the space complexity: entries each of size
Reduce by keeping for each the index OK of the optimal k space Build partition string recursively

18 Thank you !


Download ppt "Optimal Partitioning of Data Chunks in Deduplication Systems"

Similar presentations


Ads by Google