Optimal Partitioning of Data Chunks in Deduplication Systems

Optimal Partitioning of Data Chunks in Deduplication Systems
M. Hirsch A. Ish-Shalom S.T. Klein ISRAEL

Outline Background and motivation Background and motivation
Definitions Proposed optimal solution Improvements Problem Statement Definitions Problem Statement Proposed optimal solution Improvements 2

Background and motivation
Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree

Large Chunks Background and motivation Chunk size dilemma small large
More overhead Less deduplication Similarity Large chunks Zoom in on how to store Large Chunks

Data Chunks: sequence of matching and non-matching parts
Definitions Data Chunks: sequence of matching and non-matching parts Matching Old (offset, length) Non-Matching New explicit Partition is given but may be altered

1 Byte of meta-data = F Bytes of stored data
M meta-data entry (E bytes) NM meta-data entry + size(new data) Cost: 1 Byte of meta-data = F Bytes of stored data Conversion: M no merge but can be declared as NM NM may merge but cannot change to M Flexibility:

Find optimal partition = minimizing cost Saving space AND Time and I/O
Problem Statement Problem: Find optimal partition = minimizing cost If size < 2 FE No gain If sizes < 4 FE If size < FE Saving space AND Time and I/O

1’s cannot be changed Solve separately for every run of 0’s n Exhaustive search: possibilities Merge optimal solutions for sub-ranges Dynamic Programming

Proposed optimal solution
Cost of optimal solution for items i to j Optimal partition for items i to j 1 2 n Solution: 1 2 Initialize: n Fill by diagonals

Or there is a 1 in some position k
Recursion: Either stay with …0 Or there is a 1 in some position k j i k - 1 k k + 1

j i k - 1 k k + 1 R L f(L,R) R L -1 1

Each item depends on those to its left in the same row
j 1 2 n 1 Each item depends on those to its left in the same row below it in the same column 2 i n

Algorithm Initialization:

Main loop

Complexity:

Improvements Reducing the time complexity: Consider maximal possible gain Sufficient condition to remain 0:

Improvements Reducing the space complexity: entries each of size
Reduce by keeping for each the index OK of the optimal k space Build partition string recursively

Thank you !

Optimal Partitioning of Data Chunks in Deduplication Systems

Similar presentations

Presentation on theme: "Optimal Partitioning of Data Chunks in Deduplication Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimal Partitioning of Data Chunks in Deduplication Systems

Similar presentations

Presentation on theme: "Optimal Partitioning of Data Chunks in Deduplication Systems"— Presentation transcript:

Similar presentations

About project

Feedback