Mitali Rawat, Sonu Agarwal, Sudarshan Avish Maru

Slides:



Advertisements
Similar presentations
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Chapter 10: File-System Interface
By Jacob SeligmannSteffen Grarup Presented By Leon Gendler Incremental Mature Garbage Collection Using the Train Algorithm.
Chapter 11: File System Implementation
Chapter 12: File System Implementation
File System Structure §File structure l Logical storage unit l Collection of related information §File system resides on secondary storage (disks). §File.
HASH TABLES Malathi Mansanpally CS_257 ID-220. Agenda: Extensible Hash Tables Insertion Into Extensible Hash Tables Linear Hash Tables Insertion Into.
Structured Data Types and Encapsulation Mechanisms to create new data types: –Structured data Homogeneous: arrays, lists, sets, Non-homogeneous: records.
File Concept §Contiguous logical address space §Types: l Data: Numeric Character Binary l Program.
Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.
Virtual Memory Chantha Thoeun. Overview  Purpose:  Use the hard disk as an extension of RAM.  Increase the available address space of a process. 
An Evaluation of Using Deduplication in Swappers Weiyan Wang, Chen Zeng.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
TinyLFU: A Highly Efficient Cache Admission Policy
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 File-System Structure.
Dr. T. Doom 11.1 CEG 433/633 - Operating Systems I Chapter 11: File-System Implementation File structure –Logical storage unit –Collection of related information.
Silberschatz and Galvin  Operating System Concepts File-System Implementation File-System Structure Allocation Methods Free-Space Management.
File System Implementation
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 11: File-System Interface File Concept Access Methods Directory Structure.
Lecture 7: Intro to Computer Graphics. Remember…… DIGITAL - Digital means discrete. DIGITAL - Digital means discrete. Digital representation is comprised.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 11: File System Implementation Chapter.
10.1 CSE Department MAITSandeep Tayal 10 :File-System Implementation File-System Structure Allocation Methods Free-Space Management Directory Implementation.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 11 File-System Implementation Slide 1 Chapter 11: File-System Implementation.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Virtual Memory Pranav Shah CS147 - Sin Min Lee. Concept of Virtual Memory Purpose of Virtual Memory - to use hard disk as an extension of RAM. Personal.
Operating Systems Files, Directory and File Systems Operating Systems Files, Directory and File Systems.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
File-System Management
Garbage Collection What is garbage and how can we deal with it?
Memory Management.
Module 11: File Structure
File-System Implementation
CHP - 9 File Structures.
Dynamic Hashing (Chapter 12)
File System Implementation
Chapter 11: File System Implementation
Oracle SQL*Loader
Hash-Based Indexes Chapter 11
ECE 544 Protocol Design Project 2016
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Chapter 11: File System Implementation
CS Introduction to Operating Systems
Chapter 11: File System Implementation
Data Representation Bits
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Overview: File system implementation (cont)
Controlling the Chunk Size in Deduplication Systems
Hash-Based Indexes Chapter 11
2018, Spring Pusan National University Ki-Joune Li
CPSC-608 Database Systems
Lecture 7: Flexible Address Translation
Chapter 14: File-System Implementation
File Organization.
Chapter 11: File System Implementation
Lecture 41 Syed Mansoor Sarwar
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Garbage Collection What is garbage and how can we deal with it?
Presentation transcript:

Mitali Rawat, Sonu Agarwal, Sudarshan Avish Maru Deduplication - Reference counting with backpointers and bounded representation Mitali Rawat, Sonu Agarwal, Sudarshan Avish Maru

Outline Data deduplication mechanism Approaches for handling leaking references Current implementation in Ceph Contributions Our implementation Simulated results

Data Deduplication Mechanism

Deduplication Mechanism Contd.. Traditional techniques divide an object into multiple chunks. Each chunk is stored at a physical address that is calculated based on its contents. This prevents duplicate storage of the chunk.

Approaches for handling leaking references Each chunk can be referenced by multiple objects. Need to keep track of references.

Approach 1 Store all the references explicitly. However, in a large scale distributed system like ceph, a chunk may have millions of references. Storing them explicitly causes huge space overheads. Eg some popular bytes or chunks

Approach 2 Instead of storing all references to a chunk explicitly, only keep their total count. However, a lot of systems need explicit backpointers for garbage collection, consistency checks etc. (Ex: WiscKey) Leaking references - crashes

Current Implementation in ceph References to a chunk in ceph are stored explicitly. They are captured by means of hobject_t object. It contains information such as Pool ID, Hash etc. However, these references are currently stored in a set, which is unbounded. set<hobject_t> is unbounded; a popular sequence of bytes can be huge and the OSD will be sad.

Contributions Fixed the CAS client to bring it to a working condition. Added support for storing the reference count in addition to the actual backpointers. Implemented an algorithm to create bounded representation of backpointers. Unit tests to verify reference coalescing. These tests currently use scheme 3 but can be easily extended to other schemes.

Our Implementation Hybrid of the two approaches. Stores explicit references as well as the count. Each chunk has a vector of references, keyed by <poolid,hash prefix> where the hash prefix is some small number of bits 32 bits initially When number of references reach a certain threshold, a coalescing mechanism is executed. If n references are coalesced to one, the new coalesced vector contains n as the count.

Coalescing mechanism Scheme 0: Coalesce two references if they have same Pool and 16 least significant bits of hash. Scheme 1 and 2: Instead of 16 bits, use 8 and 4 bits of hash respectively. Scheme 3: Coalesce based only on Pool.

Simulated results Approximate memory used by backpointers as more references are added to the same chunk Memory used (in Bytes) Number of references added All references added from the same pool Threshold on number of backpointers = 5

Approximate memory used by explicit and approximate backpointers as more references are added to the same chunk All references added from the same pool Threshold on number of backpointers = 5 Memory used (in Bytes) Number of references added

Number of backpointers stored as more references are added to the same chunks Number of references added

Threshold on total number of backpointers = 5 Backpointers for an object when references are being added to it from two difference pools Total number of reference Total number of backpointers stored Memory used by backpointers Added from pool 1 1 144 Added from pool 2 2 288 3 432 4 576 5 6 7 8 Threshold on total number of backpointers = 5

Demo

Future Work Fixes to dynamic switching between coalescing schemes Delete path Modifications to the scrubbing process

References https://ceph.com/wp-content/uploads/2018/07/ICDCS_2018_mwoh.pdf https://pad.ceph.com/p/deduplication_how_do_we_store_chunk

THANK YOU !!

QUESTIONS?