Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

Fast Crash Recovery in RAMCloud

The Raft Consensus Algorithm

Memory and Object Management in a Distributed RAM-based Storage System Thesis Proposal Steve Rumble April 23 rd, 2012.

Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,

RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

The Google File System.

CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

The Design and Implementation of a Log-Structured File System Presented by Carl Yao.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Metrics for RAMCloud Recovery John Ousterhout Stanford University.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

RAMCloud Design Review Recovery Ryan Stutsman April 1,

A Compilation of RAMCloud Slides (through Oct. 2012) John Ousterhout Stanford University.

Assumptions Hypothesis Hopes RAMCloud Mendel Rosenblum Stanford University.

What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,

1 The Google File System Reporter: You-Wei Zhang.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University

RAMCloud: Concept and Challenges John Ousterhout Stanford University.

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Some key-value stores using log-structure Zhichao Liang LevelDB Riak.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,

An Introduction to Consensus with Raft

The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Presenters: Rezan Amiri Sahar Delroshan

CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.

Serverless Network File Systems Overview by Joseph Thompson.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.

Experiences With RAMCloud Applications John Ousterhout, Jonathan Ellithorpe, Bob Brown.

RAMCloud Overview and Status John Ousterhout Stanford University.

Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.

The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.

RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Large-scale file systems and Map-Reduce

Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

THE GOOGLE FILE SYSTEM.

by Mikael Bjerga & Arne Lange

The SMART Way to Migrate Replicated Stateful Services

Presentation transcript:

Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, et al.

● RAMCloud: research project at Stanford University  Led by Professor John Ousterhout, started in 2009 ● Building a real system (open source, C++)  Looking for more users! 1. RAMCloud basics 2. Fast crash recovery  Recover failed servers in 1-2 seconds 3. New consensus algorithm called Raft  Used to store RAMCloud’s cluster configuration  Designed for understandability October 15, 2013RAMCloudSlide 2 Introduction

DRAM in Storage Systems ● DRAM usage specialized/limited ● Clumsy (consistency with backing store) ● Lost performance (cache misses, backing store) October 15, 2013RAMCloudSlide UNIX buffer cache Main-memory databases Large file caches Web indexes entirely in DRAM memcached Facebook: 200 TB total data 150 TB cache! Main-memory DBs, again

Harness full performance potential of large-scale DRAM storage: ● General-purpose storage system ● All data always in DRAM (no cache misses) ● Durable and available ● Scale: servers, 100+ TB ● Low latency: 5-10µs remote access Potential impact: enable new class of applications October 15, 2013RAMCloudSlide 4 RAMCloud Overview

October 15, 2013RAMCloudSlide 5 RAMCloud Architecture Master Backup Master Backup Master Backup Master Backup … Appl. Library Appl. Library Appl. Library Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers Commodity Servers GB per server High-speed networking: ● 5 µs round-trip ● Full bisection bandwidth

● Challenge: making volatile memory durable ● Keep replicas in DRAM of other servers?  3x system cost, energy  Still have to handle power failures ● RAMCloud approach:  1 copy in DRAM  Backup copies on disk/flash: durability ~ free! October 15, 2013RAMCloudSlide 6 Durability and Availability

Disk Backup Buffered SegmentDisk Backup Buffered Segment ● All data/changes appended to the log ● No disk I/O during write requests  Backups need ~64 MB NVRAM (or battery) for power failures October 15, 2013RAMCloudSlide 7 Buffered Logging Master Disk Backup Buffered Segment In-Memory Log Hash Table Write request

● Log-structured: backup disk and master’s memory  Similar to Log-Structured File Systems (Rosenblum 1992) ● Cleaning mechanism reclaims free space (next slide) ● Use memory efficiently (80-90% utilization)  Good performance independent of workload + changes ● Use disk bandwidth efficiently (8 MB writes) ● Manage both disk and DRAM with same mechanism October 15, 2013RAMCloudSlide 8 Log-Structured Memory

1. Find log segments with lots of free space 2. Copy live objects into new segment(s) 3. Free cleaned segments (reuse for new objects) Cleaning is incremental and concurrent – no pauses! Slide 9 Log Cleaning

1. RAMCloud basics 2. Fast crash recovery 3. New consensus algorithm called Raft October 15, 2013RAMCloudSlide 10 Outline

● Applications depend on 5µs latency: must recover crashed servers quickly! ● Traditional approaches don’t work  Can’t afford latency of paging in from disk  Replication in DRAM too expensive, doesn’t solve power loss ● Crash recovery:  Need data loaded from backup disks back into DRAM  Must replay log to reconstruct hash table  Meanwhile, data is unavailable  Solution: fast crash recovery (1-2 seconds)  If fast enough, failures will not be noticed ● Key to fast recovery: use system scale October 15, 2013RAMCloudSlide 11 Crash Recovery Introduction

● Master chooses backups statically  Each backup mirrors entire log for master ● Crash recovery:  Choose recovery master  Backups read log segments from disk  Transfer segments to recovery master  Recovery master replays log ● First bottleneck: disk bandwidth:  64 GB / 3 backups / 100 MB/sec/disk ≈ 210 seconds ● Solution: more disks (and backups) October 15, 2013RAMCloudSlide 12 Recovery, First Try Recovery Master Backups Dead Master

October 15, 2013RAMCloudSlide 13 Recovery, Second Try ● Scatter logs:  Each log divided into 8MB segments  Master chooses different backups for each segment (randomly)  Segments scattered across all servers in the cluster ● Crash recovery:  All backups read from disk in parallel ( MB each)  Transmit data over network to recovery master Recovery Master ~1000 Backups Dead Master

● New bottlenecks: recovery master’s NIC and CPU October 15, 2013RAMCloudSlide 14 Single Recovery Master Recovers MB in one second

● Divide dead master’s data into partitions  Recover each partition on a separate recovery master  Partitions based on key ranges, not log segment  Each backup divides its log data among recovery masters  Highly parallel and pipelined October 15, 2013RAMCloudSlide 15 Recovery, Third Try Recovery Masters Backups Dead Master

● Nearly flat (needs to reach masters) ● Expect faster: out of memory bandwidth (old servers) October 15, 2013RAMCloudSlide 16 Scalability 6 masters 12 SSDs 3 GB 6 masters 12 SSDs 3 GB 80 masters 160 SSDs 40 GB 80 masters 160 SSDs 40 GB

● Fast: under 2 seconds for memory sizes up to 40GB ● Nearly “high availability” ● Much cheaper than DRAM replication (and it handles power outages) ● Leverages scale:  Uses every disk, every NIC, every CPU, …  Recover larger memory sizes with larger clusters ● Leverages low latency:  Fast failure detection and coordination October 15, 2013RAMCloudSlide 17 Crash Recovery Conclusion

1. RAMCloud basics 2. Fast crash recovery 3. New consensus algorithm called Raft October 15, 2013RAMCloudSlide 18 Outline

October 15, 2013RAMCloudSlide 19 RAMCloud Architecture Master Backup Master Backup Master Backup Master Backup … Appl. Library Appl. Library Appl. Library Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers Commodity Servers Need to avoid single point of failure!

Configuration Management ● Keep single coordinator server ● Rely on external system to store top-level configuration:  Cluster membership  Data placement ● Typically consensus-based ● Examples:  Chubby (Google)  ZooKeeper (Yahoo/Apache) ● Unhappy with ZooKeeper, so decided to build our own October 15, 2013RAMCloudSlide 20 Configuration Management Servers Coordinator Backup Coordinator

October 15, 2013RAMCloudSlide 21 Replicated State Machines ● Approach to make any (det.) service fault-tolerant ● Each server’s state machine processes the same sequence of requests from its log ● Consensus algo.: order requests into replicated log ● Small number of servers (typically 5)  Service available with any majority of servers (3)

● Initial plan: use Paxos ● Paxos is “industry standard” consensus algorithm, but:  Very hard to understand  Not a good starting point for real implementations ● Our new algorithm: Raft  Primary design goal: understandability  Also must be practical and complete October 15, 2013RAMCloudSlide 22 New Consensus Algorithm: Raft

Designed from the start to manage a replicated log: 1. Elect a leader:  A single server at a time creates new entries in the replicated log  Elect a new leader when the leader fails 2. Replication:  The leader makes other servers’ logs match its own, and  Notifies other servers when it’s safe to apply operations 3. Safety:  Defines when operations are safe to apply  Ensures that only servers with up-to-date logs can be elected leader October 15, 2013RAMCloudSlide 23 Raft Overview

● Taught students both Paxos and Raft ● One-hour videos, one-hour quizzes, short survey ● Raft significantly easier for students to understand Slide 24 User Study

● Raft consensus algorithm  Equivalent to Paxos, but designed to be easy to understand  Decomposes the problem well  Built to manage a replicated log, uses strong leader  Watch our videos! ● LogCabin: configuration service built on Raft  Open source, C++  Similar to ZooKeeper, not as many features yet ● At least 25 other implementations on GitHub  go-raft (Go) quite popular, used by etcd in CoreOS October 15, 2013RAMCloudSlide 25 Raft Conclusion

● RAMCloud goals:  Harness full performance potential of DRAM-based storage  Enable new applications: intensive manipulation of big data ● Achieved low latency (at small scale)  Not yet at large scale (but scalability encouraging) ● Fast crash recovery:  Recovered 40GB in < 2 seconds; more with larger clusters  Durable + available DRAM storage for the cost of volatile cache ● Raft consensus algorithm:  Making consensus easier to learn and implement October 15, 2013RAMCloudSlide 26 Overall Conclusion

● Speaker: Diego Ongaro on Twitter  ● Much more at  Papers: Case for RAMCloud, Fast Crash Recovery, Log Structured Memory, Raft, etc.  Check out the RAMCloud and LogCabin code  Ask questions on ramcloud-dev, raft-dev mailing lists  Watch Raft and Paxos videos to learn consensus! ● Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, et al. October 15, 2013RAMCloudSlide 27 Questions?