John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Fast Crash Recovery in RAMCloud
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The Raft Consensus Algorithm
Memory and Object Management in a Distributed RAM-based Storage System Thesis Proposal Steve Rumble April 23 rd, 2012.
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
File System Implementation
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
The RAMCloud Storage System
Memory and Object Management in RAMCloud Steve Rumble December 3 rd, 2013.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
A Compilation of RAMCloud Slides (through Oct. 2012) John Ousterhout Stanford University.
Assumptions Hypothesis Hopes RAMCloud Mendel Rosenblum Stanford University.
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
1 The Google File System Reporter: You-Wei Zhang.
RAMCloud Overview John Ousterhout Stanford University.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,
An Introduction to Consensus with Raft
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Implementing Linearizability at Large Scale and Low Latency Collin Lee, Seo Jin Park, Ankita Kejriwal, † Satoshi Matsushita, John Ousterhout Platform Lab.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
File System Implementation
Google Filesystem Some slides taken from Alan Sussman.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
RAMCloud Architecture
RAMCloud Architecture
Lecture 15 Reading: Bacon 7.6, 7.7
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Presentation transcript:

John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013

General-purpose storage system for large-scale applications: ● All data is stored in DRAM at all times ● Large scale: servers, 100+ TB ● Low latency: 5-10 µs remote access time ● As durable and available as disk ● Simple key-value data model (for now) Project goal: enable a new class of data-intensive applications June 6, 2013RAMCloud Update and OverviewSlide 2 What is RAMCloud?

June 6, 2013RAMCloud Update and OverviewSlide 3 RAMCloud Architecture Master Backup Master Backup Master Backup Master Backup … Appl. Library Appl. Library Appl. Library Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers Commodity Servers GB per server High-speed networking: ● 5 µs round-trip ● Full bisection bwidth

read(tableId, key) => blob, version write(tableId, key, blob) => version cwrite(tableId, key, blob, version) => version delete(tableId, key) October 2, 2012RAMCloudSlide 4 Data Model: Key-Value Store Tables Key (≤ 64KB) Version (64b) Blob (≤ 1MB) Object (Only overwrite if version matches)

Disk Backup Buffered SegmentDisk Backup Buffered Segment ● Log-structured: backup disk and master’s memory ● No disk I/O during write requests ● Log cleaning ~ generational garbage collection October 2, 2012RAMCloudSlide 5 Buffered Logging Master Disk Backup Buffered Segment In-Memory Log Hash Table Write request

● Don’t use malloc for memory management  Wastes 50% of memory ● Instead, structured memory as a log  Allocate by appending  Log cleaning to reclaim free space  Control over pointers allows incremental cleaning June 6, 2013RAMCloud Update and OverviewSlide 6 Log-Structured Memory

Log-Structured Memory, cont’d June 6, 2013RAMCloud Update and OverviewSlide 7 ● Creates tradeoff:  Performance vs. utilization ● Two-level cleaner:  Disk and memory  Memory only (compaction) ● Concurrent cleaning ● Multiple cleaner threads ● 80-90% memory utilization feasible ● Paper under submission

● Transport layer enables experimentation with different networking protocols/technologies ● Basic Infiniband performance (one switch):  100-byte reads:4.9 µs  100-byte writes (3x replication):15.3 µs  Read throughput (100 bytes, 1 server):700 Kops/sec June 11, 2012RAMCloudSlide 8 RAMCloud RPC Infiniband Queue Pairs Infiniband Queue Pairs Custom Datagram TCP UDP InfUD InfEth Applications/Servers RPC Stubs Networks Transports

RAMCloud Crash Recovery ● Each master scatters segment replicas across entire cluster ● On crash:  Coordinator partitions dead master’s tablets.  Partitions assigned to different recovery masters  Log data shuffled from backups to recovery masters  Recovery masters replay log entries ● Total recovery time: 1-2s June 6, 2013RAMCloud Update and OverviewSlide 9 Recovery Masters Backups Crashed Master

● Major system components (barely) working:  RPC transports (timeout mechanism new)  Basic key-value store (variable-length keys new)  Log-structured memory management (log cleaner new)  Crash recovery (backup recovery new)  Coordinator overhaul just starting ● Overall project goal: push towards a 1.0 release  “Least usable system” for real applications June 6, 2013RAMCloud Update and OverviewSlide 10 Status at June 2012 Retreat

● Core system becoming stable  Not quite at 1.0, but close! ● First PhDs coming soon:  Ryan Stutsman: crash recovery  Steve Rumble: log-structured memory ● Many research opportunities still left ● About to start new projects:  Higher-level data model  New RPC mechanism  Cluster management June 6, 2013RAMCloud Update and OverviewSlide 11 Current Status

● Fault-tolerant coordinator (Ankita Kejriwal):  Server and tablet configuration info now durable ● LogCabin configuration manager (Diego Ongaro) ● Additional recovery mechanisms (Ryan Stutsman):  Simultaneous server failures  Cold start  Overhaul of backup storage management  RPC retry (John Ousterhout) ● Overhaul of cluster membership management (Stephen Yang)  More robust  Better performance June 6, 2013RAMCloud Update and OverviewSlide 12 Progress Towards RAMCloud 1.0

● New in-memory log architecture (Steve Rumble) ● Automated crash tester (Arjun Gopalan):  Synthetic workload with consistency checks  Force servers to crash randomly  Multiple simultaneous failures, coordinator failures ● Goal: run crash tester for a few weeks with no loss of data ● Current status:  Can survive some coordinator and master failures  Others causing crashes  Working through bugs June 6, 2013RAMCloud Update and OverviewSlide 13 Progress Towards 1.0, cont’d

Configuration Management ● External system for durable storage of top-level configuration information:  Cluster membership  Tablet configuration ● Typically consensus-based:  Chubby (Google)  ZooKeeper (Yahoo/Apache) ● Unhappy with ZooKeeper, so decided to build our own (Diego Ongaro):  Development started before last year’s retreat  Initial plan: use Paxos protocol June 6, 2013RAMCloud Update and OverviewSlide 14 Configuration Management Servers Coordinator Backup Coordinator

● Paxos is “industry standard”, but:  Very hard to understand  Not a good starting point for real implementations ● Our new algorithm: Raft  Primary design goal: understandability  Also must be practical and complete  Result: new approach to consensus ● Design for replicated log from start ● Strong leader ● User study shows that Raft is significantly easier to understand than Paxos (stay tuned...) ● Paper under submission June 6, 2013RAMCloud Update and OverviewSlide 15 New Consensus Algorithm: Raft

API for Consensus ● Key-value store (Chubby, ZooKeeper)?  Applications really want a log?  Why build a log on a key-value-store on a log? ● Collection of logs (LogCabin)?  First approach for RAMCloud, based on Raft  Used in current coordinator implementation  However, log turned out not to be convenient after all ● TreeHouse: key-value store on Raft? ● Export API for replicated state machine? June 6, 2013RAMCloud Update and OverviewSlide 16 Replicated Log Replicated State Machine ?API?

● Goal: higher-level data model than just key-value store:  Secondary indexes?  Transactions spanning multiple objects and servers?  Graph-processing primitives (sets)? ● Can RAMCloud support these without sacrificing  Latency  Scalability ● Design work just getting underway (Ankita Kejriwal) June 6, 2013RAMCloud Update and OverviewSlide 17 New Work: Data Model

Complete redesign of RAMCloud RPC ● General purpose (not just RAMCloud) ● Latency:  Even lower latency?  Explore alternative threading strategies ● Scale:  Support 1M clients/server (minimal state/connection) ● Network protocol/API:  Optimize for kernel bypass  Minimize buffering  Congestion control: reservation based? June 6, 2013RAMCloud Update and OverviewSlide 18 New Work: Datacenter RPC

● Core system becoming stable  Not quite at 1.0, but close! ● First PhDs coming soon:  Ryan Stutsman: crash recovery  Steve Rumble: log-structured memory ● Many research opportunities still left ● About to start new projects:  Higher-level data model  New RPC mechanism  Cluster management June 6, 2013RAMCloud Update and OverviewSlide 19 Current Status