Implementing Linearizability at Large Scale and Low Latency Collin Lee, Seo Jin Park, Ankita Kejriwal, † Satoshi Matsushita, John Ousterhout Platform Lab.

Slides:

Advertisements

Similar presentations

Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.

Advertisements

High throughput chain replication for read-mostly workloads

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis.

COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

Replicating Basic Components Bettina Kemme McGill University, Montreal, Canada.

Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :

OCT Distributed Transaction1 Lecture 13: Distributed Transactions Notes adapted from Tanenbaum’s “Distributed Systems Principles and Paradigms”

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.

CS 582 / CMPE 481 Distributed Systems

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Distributed Systems Fall 2009 Distributed transactions.

Recovery Basics. Types of Recovery Catastrophic – disk crash –Backup from tape; redo from log Non-catastrophic: inconsistent state –Undo some operations.

Distributed storage for structured data

Sun NFS Distributed File System Presentation by Jeff Graham and David Larsen.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

1 The Google File System Reporter: You-Wei Zhang.

Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?

RAMCloud Overview John Ousterhout Stanford University.

RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University

RAMCloud: Concept and Challenges John Ousterhout Stanford University.

Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.

Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.

RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Presenters: Rezan Amiri Sahar Delroshan

CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Databases Illuminated

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

RAMCloud Overview and Status John Ousterhout Stanford University.

A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions Youyou Lu 1, Jiwu Shu 1, Jia Guo 1, Shuai Li 1, Onur Mutlu 2 LightTx:

Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.

CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.

Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Remote Backup Systems.

CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.

Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.

Outline Announcements Fault Tolerance.

CSE 486/586 Distributed Systems Concurrency Control --- 3

Assignment 5 - Solution Problem 1

Decoupled Storage: “Free the Replicas!”

UNIVERSITAS GUNADARMA

Remote Backup Systems.

CSE 486/586 Distributed Systems Concurrency Control --- 3

Presentation transcript:

Implementing Linearizability at Large Scale and Low Latency Collin Lee, Seo Jin Park, Ankita Kejriwal, † Satoshi Matsushita, John Ousterhout Platform Lab Stanford University †NEC

● Goal: take back consistency in large-scale systems ● Approach: distinct layer for linearizability ● Reusable Infrastructure for Linearizability (RIFL)  At-least-once RPC  exactly-once RPC  Records RPC results durably  To handle reconfiguration, associates metadata with object ● Implemented on distributed KV store, RAMCloud  Low latency: < 5% (500ns) latency overhead  Scalable: supports states for 1M clients ● RIFL simplified implementing transactions  Simple distributed transaction commits in ~22 μs  Outperforms H-Store Slide 2 Overview

Strongest form of consistency for concurrent systems Slide 3 What is Linearizability? time Client sends request for operation Client receives response Behaves as if executing exactly once and instantaneously

● Most systems: at-least-once semantics, not exactly-once  Retry (possibly) failed operations after crashes  Idempotent semantics: repeated executions O.K.? ● At-least-once + idempotency ≠ linearizability ● Need exactly-once semantics! Slide 4 What is Missing from Existing Systems? client A shared object in server read: 2 client B 2 write(2) read: 2 write(3) retry

Architecture of RIFL RIFL saves results of RPCs If client retries, Don’t re-execute Return saved result  Key problems Unique identification for each RPC Durable completion record Retry rendezvous Garbage collection Slide 5

● Each RPC must have a unique ID ● Retries use the same ID ● Client assigns RPC IDs Slide 6 1) RPC Identification Client ID 64-bit integer Sequence Number 64-bit integer System-wide unique Monotonically increases Assigned by client

● Written when an operation completes ● Same durability as object(s) being mutated ● Atomically created with object mutation Slide 7 2) Durable Completion Record Completion Record Client ID Sequence Number RPC result for client

● Data migration is popular in large systems (eg. crash recovery) ● Retries must find completion record Slide 8 3) Retry Rendezvous Time clientserver1server2 CR Write CR retry Object

● Associate each RPC with a specific object ● Completion record follows the object during migration Slide 9 3) Retry Rendezvous (cont.) Completion Record Client ID Sequence Number RPC result for client Object Identifier

● Lifetime of completion record ≠ lifetime of object value ● Can’t GC if client may retry later ● Server knows a client will never retry if 1. Client acknowledges receipt of RPC result 2. Detect client crashes with lease. Slide 10 4) Garbage Collection client A ACK #2 Seq# 1: Finished Seq# 2: Finished Seq# 3: Finished … Seq# 9: Finished Seq# 10: Finished Completion Records of client A

Slide 11 Linearizable RPC in Action Client 5, Seq# 9 Finished Client 5, Seq# 9 Started Client 5, Seq# 9 Started Durable Storage Success Write (2) ClientID: 5 Seq#: 9 Server Object Value: 2 Com. Record CID: 5, Seq: 9 Success Object Value: 2 Com. Record CID: 5, Seq: 9 Success Object Value: 1 Client 5 ResultTracker Write(2) Pointer

Slide 12 Handling Retries Client 5, Seq# 9 Finished Client 5, Seq# 9 Started Client 5, Seq# 9 Started Durable Storage In_Progress Retry Server Object Value: 2 Com. Record CID: 5, Seq: 9 Success Object Value: 2 Com. Record CID: 5, Seq: 9 Success Object Value: 1 ResultTracker Success Retry Write(2) Time Client 5 Pointer

Performance of RIFL in RAMCloud  Achieved linearizability without hurting performance of RAMCloud Minimal overhead on latency(< 5%) and throughput (~0%) Supports state for 1M clients Slide 13

● General-purpose distributed in-memory key-value storage system ● Durability: 3-way replication ● Fast recovery: 1~2 sec for server crash ● Large scale: servers, 100+ TB ● Low latency: 4.7 μs read, 13.5 μs write (100B object) ● RIFL is implemented on top of RAMCloud  Core: 1200 lines of C++ for infrastructure  Per operation: 17 lines of C++ to make an operation linearizable Slide 14 Why RAMCloud?

● Server: Xeon 4 cores at 3 GHz ● Fast Network  Infiniband (24 Gbps)  Kernel-bypassing transport (RAMCloud default) Slide 15 Experimental Setup

Slide 16 Impact on Latency Median Latency 13.5 μs vs μs 99% Latency 69.0 μs vs μs

Slide 17 Impact on Throughput

● Storage impact: 116B per client  116MB for 1M clients ● Latency impact (linearizable write, unloaded): Slide 18 Impact of Many Client States Median 1M client 14.5 μs Median 1 client 14.0 μs

Case study: Distributed Transactions with RIFL  Extended use of RIFL for more complex operations  Two-phase commit protocol based on Sinfonia  RIFL reduced mechanisms of Sinfonia significantly  Lowest possible round-trip for distributed transactions Slide 19

class Transaction { read(tableId, key) => blob write(tableId, key, blob) delete(tableId, key) commit() => COMMIT or ABORT } Slide 20 Transaction Client API ● Optimistic concurrency control ● Mutations are “cached” in client until commit

● commit(): atomic multi-object operation  Operations in client’s cache are transmitted to servers  Conditioned on a version (same version ≡ object didn’t change) Slide 21 Transaction Commit Semantics Transaction User Transaction API read cache [version, new value] Server Storage write read Transaction Commit Protocol check version & apply changes commit

Participant Server Slide 22 Transaction Commit Protocol Client Backups Durable Log Write PREPARE VOTE DECISION Durable Log Write ● Client-driven 2PC ● RPCs:  PREPARE() => VOTE  DECISION() ● Fate of TX is determined after 1 st phase ● Client blocked for 1 RTT + 1 log write ● Decisions processed in background Objects Locked Done in Background

Participant Server Slide 23 Transaction Recovery on Client-crash Recovery Coordinator Backups Durable Log Write REQUEST-ABORT VOTE DECISION Durable Log Write ● Server-driven 2PC ● Initiated by "worried" server ● RPCs:  START-RECOVERY()  REQUEST-ABORT() => VOTE  DECISION() START-RECOVERY

● PREPARE() => VOTE is linearizable RPC ● Server crash: client retries PREPARE ● Client crash: recovery coordinator sends fake PREPARE (REQUEST-ABORT)  Query ResultTracker with same RPC ID from client  Writes completion record of PREPARE / REQUEST-ABORT ● Race between client’s 2PC and recovery coordinator’s 2PC is safe Slide 24 RIFL Simplifies TX Recovery

Performance of Transactions in RAMCloud  Simple distributed transactions commits in 22 μs  TPC-C benchmark shows RIFL-based transactions outperform H-Store Slide 25

Slide 26 Latency of Transactions 3 servers; 1 obj / server 22 μs 60 servers; 5 objs / server 385 μs

● TPC-C simulates order fulfillment systems  New-Order transaction: 23 reads, 23 writes  Used full-mix of TPC-C (~10% distributed)  Latency is measured from end to end  Modified TPC-C for benchmark to increase server load ● Compared with H-Store  Main-memory DBMS for OLTP ● Two RAMCloud configurations  Kernel-bypass for maximum performance  Kernel TCP for fair comparison Slide 27 TPC-C Benchmark

● TpmC: NewOrder committed per minute ● 210,000 TpmC / server ● Latency: ~500µs ● RAMCloud faster than H-Store (even over TCP) ● Limited by serial log- replication (need batching) Slide 28 TPC-C Performance 10x 1000x

● Faster simple operations (eg. atomic increment)  lower latency  higher throughput ● Can be added on top of existing systems ● Better modular decomposition  easier to implement transaction Slide 29 Should linearizability be a foundation? Transaction Linearizability RIFL vs. Transaction Linearizability Traditional DB

● Distinct layer for linearizability  take back consistency in large-scale systems ● RIFL saves results of RPCs; If client retries, returns saved result without re-executing  0.5us (< 5%) latency overhead, almost no throughput overhead. ● RIFL makes transactions easier  RIFL-based RAMCloud transaction: ~20 μs for commit  Outperform H-Store for TPC-C benchmark Slide 30 Conclusion

Slide 31 Questions