What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Fast Crash Recovery in RAMCloud
Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code Ryan Stutsman and John Ousterhout Stanford University 1.
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
Availability in Globally Distributed Storage Systems
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
SEDCL: Stanford Experimental Data Center Laboratory.
The Google File System.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
A Compilation of RAMCloud Slides (through Oct. 2012) John Ousterhout Stanford University.
It’s Time for Low Latency Steve Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, John Ousterhout Stanford University 報告者 : 厲秉忠
Assumptions Hypothesis Hopes RAMCloud Mendel Rosenblum Stanford University.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Experiences With RAMCloud Applications John Ousterhout, Jonathan Ellithorpe, Bob Brown.
RAMCloud Overview and Status John Ousterhout Stanford University.
MapReduce and the New Software Stack CHAPTER 2 1.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Large-scale file systems and Map-Reduce
湖南大学-信息科学与工程学院-计算机与科学系
RAMCloud Architecture
RAMCloud Architecture
CS 345A Data Mining MapReduce This presentation has been altered.
The SMART Way to Migrate Replicated Stateful Services
Presentation transcript:

What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, and Stephen Yang)

A collection of broad conclusions we have reached during the RAMCloud project:  Randomization plays a fundamental role in large-scale systems  Need new paradigms for distributed, concurrent, fault-tolerant software  Exciting opportunities in low-latency datacenter networking  Layering conflicts with latency  Don’t count on locality  Scale can be your friend October 16, 2012What We Have Learned From RAMCloudSlide 2 Introduction

Harness full performance potential of large-scale DRAM storage: ● General-purpose key-value storage system ● All data always in DRAM (no cache misses) ● Durable and available ● Scale: servers, 100+ TB ● Low latency: 5-10µs remote access Potential impact: enable new class of applications October 16, 2012What We Have Learned From RAMCloudSlide 3 RAMCloud Overview

October 16, 2012What We Have Learned From RAMCloudSlide 4 RAMCloud Architecture Master Backup Master Backup Master Backup Master Backup … Appl. Library Appl. Library Appl. Library Appl. Library … Datacenter Network Coordinator 1000 – 10,000 Storage Servers 1000 – 100,000 Application Servers Commodity Servers GB per server High-speed networking: ● 5 µs round-trip ● Full bisection bwidth

Randomization plays a fundamental role in large-scale systems ● Enables decentralized decision-making ● Example: load balancing of segment replicas. Goals:  Each master decides where to replicate its own segments: no central authority  Distribute each master’s replicas uniformly across cluster  Uniform usage of secondary storage on backups October 16, 2012What We Have Learned From RAMCloudSlide 5 Randomization M1, S2M2, S1M3, S9 M1, S3 M1, S8 M2, S4M3, S11 M3, S12 M3, S13 M1, S2 M3, S9 M2, S1 M3, S11 M3, S13 M2, S4 M1, S3 M1, S8 M3, S12 Masters Backups

● Choose backup for each replica at random?  Uneven distribution: worst-case = 3-5x average ● Use Mitzenmacher’s approach:  Probe several randomly selected backups  Choose most attractive  Result: almost uniform distribution October 16, 2012What We Have Learned From RAMCloudSlide 6 Randomization, cont’d

● Select 3 backups for segment at random? ● Problem:  In large-scale system, any 3 machine failures results in data loss  After power outage, ~1% of servers don’t restart  Every power outage loses data! ● Solution: derandomize backup selection  Pick first backup at random (for load balancing)  Other backups deterministic (replication groups)  Result: data safe for hundreds of years  (but, lose more data in each loss) October 16, 2012What We Have Learned From RAMCloudSlide 7 Sometimes Randomization is Bad!

● RAMCloud often requires code that is distributed, concurrent, and fault tolerant:  Replicate segment to 3 backups  Coordinate 100 masters working together to recover failed server  Concurrently read segments from ~1000 backups, replay log entries, re-replicate to other backups ● Traditional imperative programming doesn’t work ● Result: spaghetti code, brittle, buggy October 16, 2012What We Have Learned From RAMCloudSlide 8 DCFT Code is Hard Must “go back” after failures

● Experimenting with new approach: more like a state machine ● Code divided into smaller units  Each unit handles one invariant or transition  Event driven (sort of)  Serialized access to shared state ● These ideas are still evolving October 16, 2012What We Have Learned From RAMCloudSlide 9 DCFT Code: Need New Pattern Shared State

● Datacenter evolution, phase #1: scale ● Datacenter evolution, phase #2: latency  Typical round-trip in 2010:300µs  Feasible today:5-10µs  Ultimate limit:< 2µs ● No fundamental technological obstacles, but need new architectures:  Must bypass OS kernel  New integration of NIC into CPU  New datacenter network architectures (no buffers!)  New network/RPC protocols: user-level, scale, latency (1M clients/server?) October 16, 2012What We Have Learned From RAMCloudSlide 10 Low-Latency Networking

Most obvious way to build software: lots of layers For low latency, must rearchitect with fewer layers October 16, 2012What We Have Learned From RAMCloudSlide 11 Layering Conflicts With Latency Application Network Developed bottom-up Layers typically “thin” Problems: Complex High latency Application Network Harder to design (top-down and bottom-up) But, better architecturally (simpler)

● Greatest drivers for software and hardware systems over last 30 years:  Moore’s Law  Locality (caching, de-dup, rack organization, etc. etc.) ● Large-scale Web applications have huge datasets but less locality  Long tail  Highly interconnected (social graphs) October 16, 2012What We Have Learned From RAMCloudSlide 12 Don’t Count On Locality Frequency Items Traditional applications Web applications

● Large-scale systems create many problems:  Manual management doesn’t work  Reliability is much harder to achieve  “Rare” corner cases happen frequently ● However, scale can be friend as well as enemy:  RAMCloud fast crash recovery ● Use 1000’s of servers to recover failed masters quickly ● Since crash recovery is fast, “promote” all errors to server crashes  Windows error reporting (Microsoft) ● Automated bug reporting ● Statistics identify most important bugs ● Correlations identify buggy device drivers October 16, 2012What We Have Learned From RAMCloudSlide 13 Make Scale Your Friend

Build big => learn big ● My pet peeve: too much “summer project research”  2-3 month projects  Motivated by conference paper deadlines  Superficial, not much deep learning ● Trying to build a large system that really works is hard, but intellectually rewarding:  Exposes interesting side issues  Important problems identify themselves (recurrences)  Deeper evaluation (real use cases)  Shared goal creates teamwork, intellectual exchange  Overall, deep learning October 16, 2012What We Have Learned From RAMCloudSlide 14 Conclusion