Deduplication in Storage Systems Joseph Fernandes Ewen Pinto Srinivas Billava
Who we are ? Joseph Fernandes (Senior Engineer, Red Hat Storage) Ewen Pinto (VI Sem MCA, NMAMIT, Nitte) Srinivas Billava (VI Sem MCA, NMAMIT, Nitte)
Agenda What is Dedupe Why Dedupe Type of Dedupe What is Deduped Where its Deduped When its Deduped Challenges in Dedupe Current work
What is Deduplication? Intelligent way of storing data, by removing redundant copies of data and storing only one instance.
What is Deduplication? Data units are identified by hash index Redundant data units replaced by pointers Hash algorithm with minimum collision Search should be precise and fast Should have rich metadata filter : Modification Frequency, IO Sizes etc Should deal with distributed nature of data Should do load balancing
Why dedupe? Reduces Total Cost of Ownership (TCO) Storage Network Used in Backup/Archive Disaster Recovery Replication local/remote
What is deduped? File Level (Single instancing) File 1 # HASH 1 File 2
What is deduped? File Level (Single instancing) File 1 # HASH 1 Pointer File 2
What is deduped? File Level (Single instancing) File 1 # HASH 1 File 2
What is deduped? Block Level File 1 File 2 B1 File 1 # HASH 1 File 1
Fixed Block Chucking File is divided in even/equal length blocks Pros: Faster! Cons: Not space efficient!
Fixed Block Chunking File
Variable Block Chunking File is chucked in variable block length Block size is determined by content Rolling Hash algorithm : Rabin Karp RHash = (p^n) * a[0] + (p^[n-1]) * a[1] + (p^[n-2]) * a[2] …..p * a[n-2] + a[n-1] If (RHash & fingerprint) == 0 { Chunk! }
Variable Block Chunking File
Variable Block Chucking Pros: Space Efficiency! Cons: Slower !
Where its Deduped? Client Side Pros: Less network traffic Cros: Heavier Clients CPU/Memory Metadata storage
Where its Deduped? Server Side Pros: Lighter Clients Cons: more network traffic
When its Deduped? Inline Deduped Offline Deduped
Challenges in Dedupe Single point of failure “Last line of defense! Or fall off the cliff!” Performance Distributed Dedupe
Current Work: YADL “Yet Another Dedupe Library” Stream based user space dedupe library File or Object or Block The Future : YADL-E
Current Work: YADL https://github.com/YADL/yadl Contributors: Ewen Pinto (ewenpin@gmail.com) Srinivas B (srinivasbillav@gmail.com) Karthik US (kus.karthikus9@gmail.com) Sukumar Poojary (sukumarpoojari92@gmail.com)
THANK YOU