Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Fast Crash Recovery in RAMCloud
RAID Redundant Array of Independent Disks
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
GFS: The Google File System Brad Karp UCL Computer Science CS Z03 / th October, 2006.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Google File System.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
1 The Google File System Reporter: You-Wei Zhang.
CSE 321b Computer Organization (2) تنظيم الحاسب (2) 3 rd year, Computer Engineering Winter 2015 Lecture #4 Dr. Hazem Ibrahim Shehata Dept. of Computer.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Distributed File Systems
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
CS 140 Lecture Notes: Technology and Operating Systems Slide 1 Technology Changes Mid-1980’s2012Change CPU speed15 MHz2.5 GHz167x Memory size8 MB4 GB500x.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Maximum Availability Architecture Enterprise Technology Centre.
Google Filesystem Some slides taken from Alan Sussman.
PA an Coordinated Memory Caching for Parallel Jobs
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Overview Continuation from Monday (File system implementation)
CS 345A Data Mining MapReduce This presentation has been altered.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Presentation transcript:

Fast Crash Recovery in RAMCloud

Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there are limitations – DRAM is typically used as cache Need to worry about consistency and cache misses

RAMCloud Keeps all data in RAM at all times – Designed to scale to thousands of servers – To host terabytes of data – Provides low-latency (5-10 µs) for small reads Design goals – High durability and availability Without compromising performance

Alternatives 3x RAM replications – 3x cost and energy – Power failure RAMCloud keeps one copy in RAM – Two copies on disks To achieve good availability – Fast crash recovery (64GB in 1-2 seconds)

RAMCloud Basics Thousands of off-the-shelf servers – Each with 64GB of RAM – With Infiniband NICs Remote access below 10 µs

Data Model Key-value store Tables of objects – Object 64-bit ID + 1MB array + 64-bit version number No atomic updates to multiple objects

System Structure A large number of storage servers – Each server hosts A master, which manages local DRAM objects and service requests A backup, which stores copies of objects from other masters on storage A coordinator – Manages config info and object locations – Not involved in most requests

RAMCloud Cluster Architecture client master backup coordinator

More on the Coordinator Maps objects to servers in units of tablets – Hold consecutive key ranges within a single table For locality reasons – Small tables are stored on a single server – Large tables are split across servers Clients can cache tablets to access servers directly

Log-structured Storage Logging approach – Each master logs data in memory Log entries are forwarded to backup servers – Backup servers buffer log entries » Battery-backed Writes complete once all backup servers acknowledge A backup server flushes its buffer when full – 8MB segment for logging, buffering, and IOs – Each server can handle 300K 100-byte writes/sec

Recovery When a server crashes, its DRAM content must be reconstructed 1-2 second recovery time is good enough

Using Scale Simple 3 replica approach – Recovery based on the speed of three disks – 3.5 minutes to read 64GB of data Scattered over 1,000 disks – Takes 0.6 seconds to read 64GB – Centralized recovery master becomes a bottleneck – 10 Gbps network means 1 min to transfer 64GB of data to the centralized master

RAMCloud Uses 100 recovery masters – Cuts the time down to 1 second

Scattering Log Segments Ideally uniform, but with more details – Need to avoid correlated failures – Need to account for heterogeneity of hardware – Need to coordinate machines not to overflow buffers on individual machines – Need to account for changing memberships of servers due to failures

Failure Detection Periodic pings to random servers – With 99% chance to detect failed servers within 5 rounds Recovery – Setup – Replay – Cleanup

Setup Coordinator finds log segment replicas – By querying all backup servers Detecting incomplete logs – Logs are self describing Starting Partition Recoveries – Each master uploads a will periodically to the coordinator in the event of its demise – Coordinator carries out the will accordingly

Replay Parallel recovery Six stages of pipelining – At segment granularity – Same ordering of operations on segments to avoid pipeline stalls Only the primary replicas is involved in recovery

Cleanup Get master online Free up segments from the previous crash

Consistency Exactly-once semantics Implementation not yet complete ZooKeeper handles coordinator failures – Distributed configuration service – With its own replication

Additional Failure Modes Current focus – Recover DRAM content for a single master failure Failed backup server – Need to know what segments are lost from the server – Rereplicate those lost segments across remaining disks

Multiple Failures Multiple servers fail simultaneously Recover each failure independently – Some will involve secondary replicas Based on projection – With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds Can’t do much when many racks are blacked out

Cold Start Complete power outage Backups will contact the coordinate as they reboot Need to quorum of backups before starting reconstructing masters Current implementation does not perform cold starts

Evaluation 60-node cluster Each node – 16GB RAM, 1 disk – Infiniband (25 Gbps) User level apps can talk to NICs bypassing the kernel

Results Can recover lost data at 22 GB/s A crashed server with 35 GB storage – Can be recovered in 1.6 seconds Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks 60 recovery masters adds only 10 ms recovery time

Results Fast recovery significantly reduces the risk of data loss – Assume recovery time of 1 sec – The risk of data loss for 100K node is in one year – 10x improvement in recovery time, improves reliability by 1,000x Assumes independent failures

Theoretical Recovery Speed Limit Harder to be faster than a few hundred msec 150 msec to detect failure 100 msec to contact every backup 100 msec to read a single segment from disk

Risks Scalability study based on a small cluster Can treat performance glitches as failures – Trigger unnecessary recovery Access patterns can change dynamically – May lead to unbalanced load