Metrics for RAMCloud Recovery John Ousterhout Stanford University.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

MinCopysets: Derandomizing Replication in Cloud Storage
Fast Crash Recovery in RAMCloud
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
Memory and Object Management in a Distributed RAM-based Storage System Thesis Proposal Steve Rumble April 23 rd, 2012.
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
What’s the Problem Web Server 1 Web Server N Web system played an essential role in Proving and Retrieve information. Cause Overloaded Status and Longer.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Google File System.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
RAMCloud Design Review Recovery Ryan Stutsman April 1,
A Compilation of RAMCloud Slides (through Oct. 2012) John Ousterhout Stanford University.
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Window NT File System JianJing Cao (#98284).
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.
Disks Chapter 5 Thursday, April 5, Today’s Schedule Input/Output – Disks (Chapter 5.4)  Magnetic vs. Optical Disks  RAID levels and functions.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Christos Kozyrakis, David Mazières, Aravind Narayanan,
An Introduction to Consensus with Raft
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Experiences With RAMCloud Applications John Ousterhout, Jonathan Ellithorpe, Bob Brown.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RAMCloud Overview and Status John Ousterhout Stanford University.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Large-scale file systems and Map-Reduce
Flat Datacenter Storage
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
湖南大学-信息科学与工程学院-计算机与科学系
Overview Continuation from Monday (File system implementation)
The Google File System (GFS)
Lecture 15 Reading: Bacon 7.6, 7.7
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
Presentation transcript:

Metrics for RAMCloud Recovery John Ousterhout Stanford University

● All data/changes appended to a log: ● One log for each master (kept in DRAM) ● Log data is replicated on disk on 2+ backups ● During recovery:  Read data from disks on backups  Replay log to recreate data on master ● Recovery must be fast: 1-2 seconds!  Only one copy of data in DRAM  Data unavailable until recovery completes January 26, 2011Recovery Metrics for RAMCloudSlide 2 Logging/Recovery Basics

● Master chooses backups statically  Each backup stores entire log for master ● Crash recovery:  Choose recovery master  Backups read log info from disk  Transfer logs to recovery master  Recovery master replays log ● First bottleneck: disk bandwidth:  64 GB / 2 backups / 100 MB/sec/disk ≈ 320 seconds ● Solution: more disks (more backups) January 26, 2011Recovery Metrics for RAMCloudSlide 3 Recovery, First Try Recovery Master Backups

January 26, 2011Recovery Metrics for RAMCloudSlide 4 Recovery, Second Try ● Scatter logs:  Each log divided into 8MB segments  Master chooses different backups for each segment (randomly)  Segments scattered across all servers in the cluster ● Crash recovery:  All backups read from disk in parallel  Transmit data over network to recovery master Recovery Master ~1000 Backups

● Disk no longer a bottleneck:  64 GB / 8 MB/segment / 1000 backups ≈ 8 segments/backup  100ms/segment to read from disk  0.8 second to read all segments in parallel ● Second bottleneck: NIC on recovery master  64 GB / 10 Gbits/second ≈ 60 seconds ● Solution: more NICs  Spread work over 100 recovery masters  64 GB / 10 Gbits/second / 100 masters ≈ 0.6 second January 26, 2011Recovery Metrics for RAMCloudSlide 5 Scattered Logs, cont’d

● Divide each master’s data into partitions  Recover each partition on a separate recovery master  Partitions based on tables & key ranges, not log segment  Each backup divides its log data among recovery masters January 26, 2011Recovery Metrics for RAMCloudSlide 6 Recovery, Third Try Recovery Masters Backups Dead Master

January 26, 2011Recovery Metrics for RAMCloudSlide 7 Parallelism in Recovery Read diskFilter log data Log Hash Table Request data from backups Add new info to hash table & log Log new data to backups Recovery Master Backup Write backup data to disk Receive backup data Respond to requests from masters

● How fast is recovery? ● What is the current bottleneck? ● How do we fix it? ● What will the next bottleneck be? ● What is the ultimate limit for performance? ● If the network bandwidth drops, will it hurt recovery performance? ● Someone just made a commit and the behavior/performance is different; why? January 26, 2011Recovery Metrics for RAMCloudSlide 8 Sample Performance Questions

● Performance measurements often wrong/counterintuitive ● Measure components of performance ● Understand why performance is what it is January 26, 2011Recovery Metrics for RAMCloudSlide 9 Measure Deeper Hardware Application Want to understand performance here? Then measure down to here

Where is time being spent? (% of recovery time in each major activity) ● Backup:  Disk: reading log data, writing new backup data  CPU: filtering logs  Network: transmitting logs to recovery masters, receiving new backup data ● Recovery masters:  CPU: processing incoming log data  Network: receiving log data, writing new backup data January 26, 2011Recovery Metrics for RAMCloudSlide 10 Utilizations

● Stalls:  Master waiting for incoming log data  Master waiting to write new backup data  Backup delaying RPC response because segment not yet filtered  Disk reads delayed because disk busy writing  Can’t send RPC because NIC busy ● Break out low-level factors if significant:  Memcpy for master copying data into log  Memcpy for backup copying filtered segment data  … January 26, 2011Recovery Metrics for RAMCloudSlide 11 Utilizations, cont’d

How efficient are individual components/algorithms? ● Backups:  Disk read/write bandwidths  Network in/out bandwidths  RPCs/sec.  Filtering: bytes/sec., objects/sec. ● Masters:  Network in/out bandwidths  RPCs/sec.  Efficiency of concurrent RPCs  Log processing: bytes/sec., objects/sec. ● Miscellaneous:  memcpy bandwidth January 26, 2011Recovery Metrics for RAMCloudSlide 12 Efficiency

● Total times:  For entire recovery  For each master to complete its recovery ● Total data volumes:  Read by each backup from disk  New data received by each server  Total new data written to server log  Total new live data after recovery ● Uniformity of filtering: how evenly does segment data divide between servers? ● Unexpected events:  Reading secondary copies of segments  Crashes of recovery masters/backups January 26, 2011Recovery Metrics for RAMCloudSlide 13 Miscellaneous

● Collect info separately on each master/backup  Information applies to a single recovery  Write to local log at end of recovery ● Coordinator collects data from each master/backup at the end of recovery  Log raw data on coordinator  Aggregate to generate overall statistics, such as: ● Overall utilizations ● Effective disk bandwidth (read/write) for the entire cluster ● Effective network bandwidth for entire cluster ● Avg. disk utilization for backups reading log data ● Disk bandwidth on individual backups: avg., best, worst, std dev  Log aggregate info January 26, 2011Recovery Metrics for RAMCloudSlide 14 Aggregation