Deduplication in Storage Systems

Slides:



Advertisements
Similar presentations
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Advertisements

Novasky: Cinematic-Quality VoD in a P2P Storage Cloud Speaker : 童耀民 MA1G Authors: Fangming Liu†, Shijun Shen§,Bo Li†, Baochun Li‡, Hao Yin§,
Davide Frey, Anne-Marie Kermarrec, Konstantinos Kloudas INRIA Rennes, France Plug.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Under the Hood: Storage and Advanced Application Development Brian Dewey DAT406 Group Program Manager Microsoft Corporation.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
EndRE: An End-System Redundancy Elimination Service.
Chapter 11: File System Implementation
Symantec De-Duplication Solutions Complete Protection for your Information Driven Enterprise Richard Hobkirk Sr. Pre-Sales Consultant.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Introducing: Cooperative Library Presented August 19, 2002.
File Systems Implementation
Seafile - Scalable Cloud Storage System
Agenda Symantec Enterprise Vault 1 Today’s Management Challenges 1 Why Management? 2 The Solution: Symantec Enterprise Vault 3 Benefits & Closing.
DEDUPLICATION IN YAFFS KARTHIK NARAYAN PAVITHRA SESHADRIVIJAYAKRISHNAN.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Data Deduplication in Virtualized Environments Marc Crespi, ExaGrid Systems
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
Demystifying Deduplication. Global SMB Event Marketing 2 APPROACH: What is deduplication? Eliminate redundant data Start with the backup environment as.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
EndRE: An End-System Redundancy Elimination Service Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan,
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
Plagiarism detection Yesha Gupta.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 11: File System Implementation Chapter.
Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk.
A Low-bandwidth Network File System Presentation by Joseph Thompson.
Emerging Technologies Understanding Deduplication Kevin Carpenter Account Manager Upstate NY Phil Benincasa System Engineer Upstate NY.
Content caching and scheduling in wireless networks with elastic and inelastic traffic Group-VI 09CS CS CS30020 Performance Modelling in Computer.
Storage Issues. Replica Placement Most existing works focus on how to place replica with low cost. Maybe it is safer that we separate the replicas as.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
1 #compromisenothing ©Copyright 2014 Tegile Systems Inc. All Rights Reserved. Company Confidential Think And not Or.
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
1 © 2002 hp Introduction to EVA Keith Parris Systems/Software Engineer HP Services Multivendor Systems Engineering Budapest, Hungary 23May 2003 Presentation.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
CommVault Architecture
Deploying disk deduplication for Hyper-v 3.0 Žigmund Maťašovský.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
IFolder. What is it? System for MIRRORING local hard disc data to a network drive It is NOT A BACKUP solution! Disaster resilience (hard disc crash, fire,
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
File-System Management
File Syncing Technology Advancement in Seafile -- Drive Client and Real-time Backup Server Johnathan Xu CTO, Seafile Ltd.
Integrating Disk into Backup for Faster Restores
Altaro VM Backup 7.0 What’s New?.
Efficient data maintenance in GlusterFS using databases
Chapter 11: File System Implementation
Demystifying Deduplication
Physical Database Design and Performance
File System Implementation
File System Structure How do I organize a disk into a file system?
Accessing nearby copies of replicated objects
2018 Huawei H Real Questions Killtest
Chapter 12: File System Implementation
Content Dissemination Systems Including Streaming Systems
Peer to Peer Information Retrieval
Distributed computing deals with hardware
Lecture 15 Reading: Bacon 7.6, 7.7
Variable Length Data and Records
RDBMS Chapter 4.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Ron Carovano Manager, Business Development F5 Networks
Similarity based deduplication
Fan Ni Xing Lin Song Jiang
Presentation transcript:

Deduplication in Storage Systems Joseph Fernandes Ewen Pinto Srinivas Billava

Who we are ? Joseph Fernandes (Senior Engineer, Red Hat Storage) Ewen Pinto (VI Sem MCA, NMAMIT, Nitte) Srinivas Billava (VI Sem MCA, NMAMIT, Nitte)

Agenda What is Dedupe Why Dedupe Type of Dedupe What is Deduped Where its Deduped When its Deduped Challenges in Dedupe Current work

What is Deduplication? Intelligent way of storing data, by removing redundant copies of data and storing only one instance.

What is Deduplication? Data units are identified by hash index Redundant data units replaced by pointers Hash algorithm with minimum collision Search should be precise and fast Should have rich metadata filter : Modification Frequency, IO Sizes etc Should deal with distributed nature of data Should do load balancing

Why dedupe? Reduces Total Cost of Ownership (TCO) Storage Network Used in Backup/Archive Disaster Recovery Replication local/remote

What is deduped? File Level (Single instancing) File 1 # HASH 1 File 2

What is deduped? File Level (Single instancing) File 1 # HASH 1 Pointer File 2

What is deduped? File Level (Single instancing) File 1 # HASH 1 File 2

What is deduped? Block Level File 1 File 2 B1 File 1 # HASH 1 File 1

Fixed Block Chucking File is divided in even/equal length blocks Pros: Faster! Cons: Not space efficient!

Fixed Block Chunking File

Variable Block Chunking File is chucked in variable block length Block size is determined by content Rolling Hash algorithm : Rabin Karp RHash = (p^n) * a[0]   +   (p^[n-1]) * a[1]   +   (p^[n-2]) * a[2] …..p * a[n-2]    +    a[n-1] If (RHash & fingerprint) == 0 { Chunk! }

Variable Block Chunking File

Variable Block Chucking Pros: Space Efficiency! Cons: Slower !

Where its Deduped? Client Side Pros: Less network traffic Cros: Heavier Clients CPU/Memory Metadata storage

Where its Deduped? Server Side Pros: Lighter Clients Cons: more network traffic

When its Deduped? Inline Deduped Offline Deduped

Challenges in Dedupe Single point of failure “Last line of defense! Or fall off the cliff!” Performance Distributed Dedupe

Current Work: YADL “Yet Another Dedupe Library” Stream based user space dedupe library File or Object or Block The Future : YADL-E

Current Work: YADL https://github.com/YADL/yadl Contributors: Ewen Pinto (ewenpin@gmail.com) Srinivas B (srinivasbillav@gmail.com) Karthik US (kus.karthikus9@gmail.com) Sukumar Poojary (sukumarpoojari92@gmail.com)

THANK YOU