Controlling the Chunk Size in Deduplication Systems

Slides:



Advertisements
Similar presentations
David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
Advertisements

David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Order Statistics Sorted
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
The Effect of Markov Chain State Size for Synthetic Wind Speed Generation Fatih O. Hocaoğlu, Ömer N. Gerek and Mehmet Kurban.
Materials for Lecture 11 Chapters 3 and 6 Chapter 16 Section 4.0 and 5.0 Lecture 11 Pseudo Random LHC.xls Lecture 11 Validation Tests.xls Next 4 slides.
By Arjuna Sathiaseelan Tomasz Radzik Department of Computer Science King’s College London EPDN: Explicit Packet Drop Notification and its uses.
Log Files. O(n) Data Structure Exercises 16.1.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Look-up problem IP address did we see the IP address before?
Rules for means Rule 1: If X is a random variable and a and b are fixed numbers, then Rule 2: If X and Y are random variables, then.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Samples vs. Distributions Distributions: Discrete Random Variable Distributions: Continuous Random Variable Another Situation: Sample of Data.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Chapter 13.4 Hash Tables Steve Ikeoka ID: 113 CS 257 – Spring 2008.
HASH TABLES Malathi Mansanpally CS_257 ID-220. Agenda: Extensible Hash Tables Insertion Into Extensible Hash Tables Linear Hash Tables Insertion Into.
Hash Tables1 Part E Hash Tables  
Data Structures Hashing Uri Zwick January 2014.
Randomized Algorithms - Treaps
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Entropy coding Present by 陳群元. outline constraints  Compression efficiency  Computational efficiency  Error robustness.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CHAPTER 6: DISCRETE PROBABILITY DISTRIBUTIONS. PROBIBILITY DISTRIBUTION DEFINITIONS (6.1):  Random Variable is a measurable or countable outcome of a.
HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Algorithms for Radio Networks Exercise 12 Stefan Rührup
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Comp 335 File Structures Hashing.
Theory of Computation II Topic presented by: Alberto Aguilar Gonzalez.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
The Median of a Continuous Distribution
Starter The length of leaves on a tree are normally distributed with a mean of 14cm and a standard deviation of 4cm. Find the probability that a leaf is:
1 Introduction to Hashing - Hash Functions Sections 5.1, 5.2, and 5.6.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 4 Memory Management Segmentation. (a) One address space. (b) Separate I and D spaces. Separate Instruction and Data Spaces.
Chapter 4 Random Number Generator Speaker : H.M. Liang.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Lec. 11 – The Normal Distribution. Differences among these plots?
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
1 Introduction to Hashing - Hash Functions Sections 5.1 and 5.2.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Sets and Maps Chapter 9.
Hashing (part 2) CSE 2011 Winter March 2018.
Data Driven Resource Allocation for Distributed Learning
Indexing Goals: Store large files Support multiple search keys
Hashing CSE 2011 Winter July 2018.
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Introduction to Hashing - Hash Functions
Dictionaries Dictionaries 07/27/16 16:46 07/27/16 16:46 Hash Tables 
Review for Final Exam Non-cumulative, covers material since exam 2
Review for Final Exam Non-cumulative, covers material since exam 2
CS223 Advanced Data Structures and Algorithms
Extendible Indexing Dina Said
Overview: Fault Diagnosis
The Normal Probability Distribution Summary
2018, Spring Pusan National University Ki-Joune Li
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Sets and Maps Chapter 9.
CS223 Advanced Data Structures and Algorithms
File Organization.
Hashing.
CS 3343: Analysis of Algorithms
Optimal Partitioning of Data Chunks in Deduplication Systems
Mitali Rawat, Sonu Agarwal, Sudarshan Avish Maru
Improving Deduplication by Accelerating Remainder Calculations
Presentation transcript:

Controlling the Chunk Size in Deduplication Systems M. Hirsch S.T. Klein D.Shapira Y.Toaff ISRAEL

Background and motivation Compression Deduplication Partition into chunks 4K – 16M Apply hash function Store fingerprints hash / B-tree

2k entries Algorithm for storing a repository Background and motivation Algorithm for storing a repository Signature size k bits Hash Table Repository chunks 420 470 550 2487 2486 2485 2488 2489 2484 2k entries 470

Background and motivation Chunk size dilemma small large More overhead Less deduplication fixed variable easier More robust

Variable length chunks seed Hash function Expected size of chunk: BUT great variability Max and min sizes, 1K 8K

Variable length chunks Problem of artificial cutoff points: Not robust Not reproducible Inconvenient distribution

1) All functions are easily calculable New segmentation procedure Use sequence of functions and constants 1) All functions are easily calculable 2) There exists an increasing sequence of probabilities such that 3) Conditions are inclusive

New segmentation procedure Small inserts and deletes

New segmentation procedure To get set

New segmentation procedure P large random prime C large constant

Example distribution

Cumulative probabilities Individual probabilities

Experimental results number Avg size Std dev constant 15.7 2127 2347 5.5 2502 2568 Variable probab 15.8 2176 1014 5.9 2273 1081

Thank you !

Using fractional bits