Isotope: Transactional Isolation for Block Storage

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Gecko Storage System Tudor Marian, Lakshmi Ganesh, and Hakim Weatherspoon Cornell University.
Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,
Allocation Methods - Contiguous
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Chapter 11: File System Implementation
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
File System Implementation
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.
Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:
1 I/O Management in Representative Operating Systems.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
Locking Key Ranges with Unbundled Transaction Services 1 David Lomet Microsoft Research Mohamed Mokbel University of Minnesota.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
OSes: 3. OS Structs 1 Operating Systems v Objectives –summarise OSes from several perspectives Certificate Program in Software Development CSE-TC and CSIM,
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
12.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 12: File System Implementation Chapter 12: File System Implementation.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
FILE SYSTEM IMPLEMENTATION 1. 2 File-System Structure File structure Logical storage unit Collection of related information File system resides on secondary.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Transactional Flash Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou Microsoft Research, Silicon Valley Presented by Sudharsan.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Storage Systems CSE 598d, Spring 2007 Lecture 13: File Systems March 8, 2007.
Novel Paradigms of Parallel Programming Prof. Smruti R. Sarangi IIT Delhi.
Introduction to Operating Systems Concepts
Introduction to threads
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Virtualization.
Hathi: Durable Transactions for Memory using Flash
Free Transactions with Rio Vista
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Transactions and Reliability
Chapter 11: File System Implementation
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Chapter 12: File System Implementation
File System Implementation
CSE451 I/O Systems and the Full I/O Path Autumn 2002
Google Filesystem Some slides taken from Alan Sussman.
Operating Systems ECE344 Lecture 11: SSD Ding Yuan
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Composite-File File System: Decoupling the One-to-one Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, Andy An-I.
Introduction to Operating Systems
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Lecture 6: Transactions
Operating System Concepts
CS703 - Advanced Operating Systems
Btrfs Filesystem Chris Mason.
Introduction to Operating Systems
Transactional Memory An Overview of Hardware Alternatives
Free Transactions with Rio Vista
Printed on Monday, December 31, 2018 at 2:03 PM.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 15: File System Internals
Operating Systems Structure
Concurrency control (OCC and MVCC)
Module 12: I/O Systems I/O hardwared Application I/O Interface
Lecture 29: File Systems (cont'd)
Presentation transcript:

Isotope: Transactional Isolation for Block Storage Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (Yale), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) FAST 2016

Multicore and Concurrency Concurrent access to storage is the norm For safe data access, concurrency control is a must

Concurrency Control in Storage Stacks Most modern apps support concurrency control App-specific implementation Typically, locking Concurrency Control (+ Atomicity/Durability) Is Difficult Key-Value Store Filesystem / DB Block I/O Device Driver H/W Device Applications Transactional Block Store (Isolation + Atomicity + Durability)

Why Transactional Block Store? TX TX Simpler applications One common implementation for isolation (and atomicity/durability) TX APIs decouple policy/mechanism TX over application-level constructs (e.g. file, directories, key-value pairs) TX across different applications (e.g. read from file and write to KV store) Key-Value Store Filesystem / DB Block I/O Device Driver H/W Device Applications TX TX TX APIs TX Impl. a Impl. b

TX using optimistic concurrency control yields low overhead End-To-End Argument? Application specific functions should be in end-hosts Transactional isolation is general Pushed down function should not incur unnecessary overheads Isolation can be implemented efficiently Key-Value Store Filesystem / DB Block I/O Device Driver H/W Device Applications TX Many block-level functions, e.g. atomicity, block layer indirection, are already implemented TX using optimistic concurrency control yields low overhead

How do we design a transactional block store How do we design a transactional block store? Isotope Is a transactional block store useful? IsoBT, IsoHT, IsoFS, and ImgStore

Rest of the Talk Isotope Performance Evaluation Conclusion Overview Design and APIs Applications Performance Evaluation Conclusion

Isotope The first block store to support TX isolation MARS and TxFlash only supported TX atomicity Multi-version optimistic concurrency control Keeps multiple versions of block data Speculatively executes TX until commit time One of two semantics supported Strict serializability Snapshot isolation Simple APIs BeginTX/EndTX/AbortTX and more

Isotope Design TX Contexts Temporary Multi-version Index BeginTX(); foo=Read(0); Write(1,boo); Write(3,baz); EndTX(); Application Virtual (Logical) Address Space 1 3 2 … Isotope V54: L2 V55:L5 V55:L4 4 5 TX Contexts Thread Id: ? Begin Time: ? End Time: ? Write ? Temporary Multi-version Index Version:Linear Addr V53: L3 V52: L1 Timestamp Counter: T53 Timestamp Counter: T55 Thread Id: 1 BeginTime: T53 End Time: ? 1 2 3 4 5 … Physical data in a Log (linear address space) N Write 1 Write 3 Write Buffer Tx Decision Engine Thread Id: 1 Start Time: 50 End Time: 51 Write 3 BeginTime: T50 End Time: T52 Write 0 Thread Id: 2 Begin Time: T50 End Time: T53 Write 1 Thread Id: 0 Begin Time: T53 End Time: T54 Queued context (sorted by end time) T52 T53 T54 Yes.

Deciding Transactions Strict serializability based Checks for read/write conflicts BeginTX(); // @ T53 foo=Read(33); Write(25, bar); Write(33, baz); EndTX(); // @ T55 Conflict Window End Time Begin Time TX Decision Engine T55 T54 T53 T53 T52 … R 33 W 25 W 33 W40 W 33 W 88 W 22 W 17 W 33 … Conflict     Commit  Abort  Queued contexts (sorted by end time)

Isotope Challenges and Additional APIs Application must be stateless (no caches) PleaseCache(): caches a data block in internal memory cache Mismatching data access granularity (application vs block) MarkAccessed(): indicates subblock level data access False Conflict TX A Write (0, foo); // modified 1st bit TX B Write (0, bar); // modified last bit 1 2 … Filesystem metadata block Yes.

Implementation Built as device mapper in Linux kernel Logical block device similar to software RAID or LVM Can run on any block device (Disk, SSD, etc.) Log implemented based on Gecko Chain logging design (Logs to multiple drives in round robin) APIs supported using IOCTL calls BeginTX/EndTX/AbortTX MarkAccessed/PleaseCache ReleaseTX/TakeoverTX

Isotope Applications IsoBT and IsoHT IsoFS Device Driver IsoBT and IsoHT C++ library key-value stores Based on persistent B-tree and hashtable ACID Put, Get, Delete, etc. IsoFS FUSE based transactional filesystem Executes arbitrary filesystem ops (read, write, rename, etc.) ACID’ly PleaseCache to handle metadata H/W Device

Ease of Programming Lines of code Lock(); If(!ReadMetadata(…)) { Unlock(); return failure; } ReadData(…); Example: Get() BeginTX(); If(!ReadMetadata(…)) { AbortTX(); return failure; } ReadData(…); EndTX(); Lines of code Simple replacement of locks to BeginTX/EndTX/AbortTX Only few lines of code to add optimizations Application Naïve Lock-Based Isolation Isotope TX APIs (lines modified) Optional APIs (lines added) IsoHT 591 591 (15) 617 (26) IsoBT 1,229 1,229 (12) 1,246 (17) IsoFS 997 997 (19) 1,022 (25) Very easy to build transactional applications using Isotope APIs

Composing Applications ImgStore IsoBT IsoHT Isotope ImgStore Transactional storage with two subsystems IsoBT for metadata and IsoHT for images Case Library Device Driver H/W Device ImgStore Library Model IsoBT IsoHT BeginTX EndTX 1 process with threads

Composing Applications ImgStore IsoBT IsoHT Isotope ImgStore Transactional storage with two subsystems IsoBT for metadata and IsoHT for images Case Library Process Device Driver H/W Device Continues on a transaction given the handle Returns a transaction handle 2 processes with threads ImgStore Process Model BeginTX BeginTX TakeoverTX TakeoverTX IsoBT IsoBT IsoHT IsoHT ReleaseTX ReleaseTX EndTX EndTX Thread Id: X TX Handles through IPC

Composing Applications ImgStore IsoBT IsoHT Isotope ImgStore Transactional storage with two subsystems IsoBT for metadata and IsoHT for images Case Library Process Thread pools Device Driver H/W Device 1 process with 2 different thread pools ImgStore Thread Pool Model ImgStore Thread Pool Model BeginTX BeginTX TakeoverTX TakeoverTX IsoBT IsoBT IsoHT IsoHT ReleaseTX ReleaseTX EndTX EndTX 1. ImgStore was only 150 LoC 2. Easy to build large apps whose TX cross boundaries

Performance Evaluation Micro benchmark Base performance of Isotope? Key-value stores Performance of applications built over Isotope? Filesystems Performance of new and existing filesystems? ImgStore Composition Performance under different composition?

Micro Benchmark (Base Performance of Isotope) Random 3-4KB-reads-3-4KB-writes TX’es from 64 threads Increasing address space (decreasing Tx conflicts) Ran on 3-SSD chain Aborts are cheap Subblock TX mechanism has negligible overhead

Key-Value Stores LevelDB: on RAID0 volume, Sync/Async mode Increasing number of threads on 2 SSDs 8KB data using YCSB workload-a Isotope-based applications perform comparable to existing applications and guarantee strong semantics

Filesystems IsoFS performs comparable to ext2/3 Ext2 and Ext3 on top of Isotope on SSDs Logging benefit All I/Os as singleton transactions IOZone benchmark write/rewrite phase with 8 threads IsoFS performs comparable to ext2/3 ext2/3 saturates SSD with no slowdown

ImgStore Compositions Different compositions of ImgStore YCSB Workload-a 16KB image to/from IsoHT and metadata to/from IsoBT in a TX Small ReleaseTX/TakeoverTX overhead (lib vs thread) Cross process overhead comes from IPC

Conclusion First block storage with TX isolation Simple API: BeginTX, EndTX, AbortTX Low overhead design (nearly free abort and MVCC) Optimizations for fine grained TX and caching Facilitates TX application design 1K LoC transactional KV-stores and filesystem Easy support for composition of TX applications Right time to consider pushing Isolation down the I/O stack

Thank you Questions?