Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk.

Slides:



Advertisements
Similar presentations
Lecture 5: Cryptographic Hashes
Advertisements

Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
Sushil Jajodia, George Mason U Witold Litwin, U Paris Dauphine Thomas Schwarz, S.J., U Católica Uruguay.
Digital Signatures Good properties of hand-written signatures: 1. Signature is authentic. 2. Signature is unforgeable. 3. Signature is not reusable (it.
Digital Signatures and Hash Functions. Digital Signatures.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
1 Chapter 5 Hashes and Message Digests Instructor: 孫宏民 Room: EECS 6402, Tel: , Fax :
First Edition by William Stallings and Lawrie Brown Lecture slides by Lawrie Brown and edited by Archana Chidanandan Cryptographic Tools.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Chapter 8: I/O Streams and Data Files. In this chapter, you will learn about: – I/O file stream objects and functions – Reading and writing character-based.
Yongtao Zhou, Yuhui Deng, Junjie Xie
DEDUPLICATION IN YAFFS KARTHIK NARAYAN PAVITHRA SESHADRIVIJAYAKRISHNAN.
DISCLAIMER: This material is based on work supported by the National Science Foundation and the Department of Defense under grant No. CNS Any.
Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.
Cryptography and Network Security Chapter 11 Fifth Edition by William Stallings Lecture slides by Lawrie Brown.
UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.
Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
Digital Signatures Good properties of hand-written signatures: 1. Signature is authentic. 2. Signature is unforgeable. 3. Signature is not reusable (it.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Hash Functions A hash function H accepts a variable-length block of data M as input and produces a fixed-size hash value h = H(M) Principal object is.
Dan Johnson. What is a hashing function? Fingerprint for a given piece of data Typically generated by a mathematical algorithm Produces a fixed length.
Secure Incremental Maintenance of Distributed Association Rules.
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
Process by which a system verifies the identity of a user wishes to access it. Authentication is essential for effective security.
COEN 351 E-Commerce Security Essentials of Cryptography.
Data and its manifestations. Storage and Retrieval techniques.
Author : Ozgun Erdogan and Pei Cao Publisher : IEEE Globecom 2005 (IJSN 2007) Presenter : Zong-Lin Sie Date : 2010/12/08 1.
Hashing Algorithms: Basic Concepts and SHA-2 CSCI 5857: Encoding and Encryption.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
Scanning Computer Viruses with Reduced Virus Definition File s Daisuke Anzai Supervised by Prof. H Toyoizumi.
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
Week 4 - Friday.  What did we talk about last time?  Snow day  But you should have read about  Key management.
XP Tutorial 8 Adding Interactivity with ActionScript.
Identification Authentication. 2 Authentication Allows an entity (a user or a system) to prove its identity to another entity Typically, the entity whose.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
A Low-bandwidth Network File System Presentation by Joseph Thompson.
COEN 351 E-Commerce Security
1 3 Computing System Fundamentals 3.6 Errors Prevention and Recovery.
Hash Functions Ramki Thurimella. 2 What is a hash function? Also known as message digest or fingerprint Compression: A function that maps arbitrarily.
2.2 Interfacing Computers MR JOSEPH TAN CHOO KEE TUESDAY 1330 TO 1530
Software. Because databases can get very big, it is important to decide exactly what is going to be stored in each field. Fields can be text, number,
IT 221: Introduction to Information Security Principles Lecture 5: Message Authentications, Hash Functions and Hash/Mac Algorithms For Educational Purposes.
 Encryption provides confidentiality  Information is unreadable to anyone without knowledge of the key  Hashing provides integrity  Verify the integrity.
Data Integrity / Data Authentication. Definition Authentication (Signature) algorithm - A Verification algorithm - V Authentication key – k Verification.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Cryptographic Hash Function. A hash function H accepts a variable-length block of data as input and produces a fixed-size hash value h = H(M). The principal.
7/10/20161 Computer Security Protection in general purpose Operating Systems.
Jonathan Walpole Computer Science Portland State University
SQL and SQL*Plus Interaction
A Simulation Analysis of Reliability in Primary Storage Deduplication
Cryptographic hash functions
Cryptographic Hash Function
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Deduplication in Storage Systems
Hashing Project by: Omar Benismail Comp3801.
Cryptographic Hash Functions Part I
Cryptography Lecture 13.
Controlling the Chunk Size in Deduplication Systems
Similarity based deduplication
Cryptography Lecture 14.
Computer Security Protection in general purpose Operating Systems
Cryptography Lecture 13.
Cryptography Lecture 15.
Fan Ni Xing Lin Song Jiang
Blockchains Lecture 4.
Presentation transcript:

Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk Boundary Calculations and Signature Calculation for Deduplication 10th International Information and Telecommunication Technologies Conference, (I2TS), Dec. 2011, Florianopolis, Br. IEEE Latin America Transactions, vol. 10(1), 2011.

Deduplication How not to store the same data twice Breaks data into chunks, calculates signature of chunks, uses signature to check if chunk is already stored If the chunk is already stored, create a file manifest that allows reconstructing the file The chunk is replaced by a pointer to the chunk stored elsewhere Has sometimes horrible read performance, as files need to be reconstructed Is very appropriate for write heavy loads (backup, archival) Is used for web-based storage

Deduplication Is part of storage solutions by all major providers Has impressive compression rates for backup workload Up to 20:1 Zhu, Li and Patterson, 2008 Up to 30:1 Mandagere, Zhou, Smith, and Uttamchandi, 2008

Deduplication Can be for streams (backup workloads) or for files (archival) In the case for files, use a complete file hash to discover identical files Breaks incoming data into chunks Typical value is 4KB or 8KB Calculates the signature of each chunk (hash) Looks up chunk signatures in a data base Create a file manifest: Describes where chunks are Stores new chunks in the system, stores manifest, updates information on chunk signatures

Deduplication incoming stream or file Chunker chunks Calculate Signatures File Print: file signature chunk sig 1 chunk sig 2 … File Manifest: chunk 1 is at … chunk 2 is at … … Look up chunks and create file manifest

Deduplication Suffers from two bottlenecks I/O bottleneck: Needs to lookup chunk signatures in a database Proposals: Bloom filter, adaptive caching, extreme binning Needs to scan file twice: To calculate chunk boundaries To calculate chunk signatures Suffers from a leap of faith: There is no time to verify chunk identity byte-by-byte Rely on chunk signature identity to identify identical chunks Advent of MD5 and SHA1 convinced people that the risk of falsely identifying chunks is acceptable

Our proposal Use the calculation of chunk boundaries to strengthen the collision resilience of the chunk signatures

Chunk Boundary Calculation Why not fixed-sized chunks: Small changes to the file cannot be found Previous Chunk Fixed sized sliding window sig  0 Chunk Boundary sig = 0

Chunk Boundary Calculation Context defined chunk definition: A small, local change affects in all likelihood only one chunk If it is located on a chunk boundary, it affects two chunks Deduplication ratio is much higher

Chunk Boundary Calculation Need to calculate a signature of a sliding window Can use “rolling hashes” When moving the window one letter to the right, can calculate the signature of the new window using: 1. The signature of the old sliding window 2. The character on the left (leaving the sliding window) 3. The character on the right (entering the sliding window)

Chunk Boundary Calculation Can use Rabin Fingerprints (O. Rabin, 1989) Or Algebraic Signatures (Litwin, Schwarz, 2004) We use the latter Because we invented them Because they are marginally better than Rabin Fingerprints for our purpose Both allow cumulative calculation of a signature of the chunk seen so far

Adding to the Chunk Signature Chunk signature is MD5 or SHA1 There are attacks using artificial collisions, but they are theoretical so far There is a small, but positive collision probability Two different chunks share the same signature value Dedup then destroys / alters the later file To keep dedup acceptable, need to have the resulting data loss orders of magnitude less than losses from other sources

Adding to the Chunk Signature

Conditions: Want x nines assurance against having any collision in a storage system of size N The number of nines is given by the other failure sources The figure uses x = 6 Conclusion: Adding one byte to the chunk signature increases the possible size of the data set by (for big enough x) Example: MD5 has 16B. At six nines, maximum number of chunks is  With two bytes more, it is  Changes from hundreds of petabytes to tens of exabytes

Flatness Signature is flat if the probability of any text to have a certain signature is constant Measuring flatness is difficult No results known for MD5 or SHA-1, though support for almost perfect flatness Algebraic Signatures: Are perfectly flat for complete random input Are very flat for experiment undertaken Short words (taken from password list)

Flatness of Algebraic Signatures

Conclusions Can reuse chunk boundary calculations to strengthen collision resistance of chunk signatures Causes no additional calculation costs