Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.

Slides:



Advertisements
Similar presentations
Comparing Two Proportions
Advertisements

MinCopysets: Derandomizing Replication in Cloud Storage
Fast Crash Recovery in RAMCloud
Marzieh Parandehgheibi
Inference Sampling distributions Hypothesis testing.
G. Alonso, D. Kossmann Systems Group
Availability in Globally Distributed Storage Systems
Stats for Engineers Lecture 11. Acceptance Sampling Summary One stage plan: can use table to find number of samples and criterion Two stage plan: more.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Chapter 11: File System Implementation
Statistics Are Fun! Analysis of Variance
Comparing Systems Using Sample Data
Multiple Sender Distributed Video Streaming Thinh Nguyen, Avideh Zakhor appears on “IEEE Transactions On Multimedia, vol. 6, no. 2, April, 2004”
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Cooperative backup on Social Network Nguyen Tran and Jinyang Li.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
A Switch Criterion for Defined Contribution Pension Schemes Bas Arts and Elena Vigna.
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Objective: To convert numbers into standard index form
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
AP STATISTICS LESSON 10 – 2 DAY 1 TEST OF SIGNIFICANCE.
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Section 10.1 Estimating with Confidence AP Statistics February 11 th, 2011.
- Disk failure ways and their mitigation - Priya Gangaraju(Class Id-203)
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
Today… Modularity, or Writing Functions. Winter 2016CISC101 - Prof. McLeod1.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Reasoning in Psychology Using Statistics Psychology
Inferential Statistics 10.1 Confidence Intervals
Understanding Sampling Distributions: Statistics as Random Variables
Large-scale file systems and Map-Reduce
Continuous Distributions
Sampling Distributions
RAID RAID Mukesh N Tekwani
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
湖南大学-信息科学与工程学院-计算机与科学系
Hand in your Homework Assignment.
Presented by Hoang Nguyen
Combating Tag Cloning with COTS RFID Devices
Yield Optimization: Divide and Conquer Method
Tests of Significance.
CENG 351 Data Management and File Structures
Type I and Type II Errors
Presentation transcript:

Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University

Outline ● Motivation ● Segment Loss Probabilities ● Simultaneous Crash Probabilities ● Average Segment Loss Rate ● Numerical Results ● Conclusions & Takeaways November 16, 2010RAMCloudSlide 2

Motivation ● Challenge the assumptions of RAMCloud’s recovery mechanism  Are 2 disk backups per segment enough?  Is it a good idea to backup every segment independently and randomly?  If we suddenly lose all our memory (i.e. power outage), is our data protected? ● Estimate rate and probability of segment loss in RAMCloud November 16, 2010RAMCloudSlide 3

Segment Loss Probabilities November 16, 2010RAMCloudSlide 4

Assumptions 1. Each master randomly uniformly and independently distributes one copy of each of its segments to two different backups 2. A backup cannot hold a segment that belongs to a master with the same master index November 16, 2010RAMCloudSlide 5

Probability of Segment Loss for 2 Backups Slide 6

Probability of Segment Loss for 2 Backups and 1 Master November 16, 2010RAMCloudSlide 7

More of the Same November 16, 2010RAMCloudSlide 8

Intermezzo ● What have we done so far?  We calculated the probability of losing at least one copy of a segment on disk given two simultaneous failures of backups  We calculated the probability of losing at least one copy of a segment on disk and on memory given the simultaneous failure of three machines ● What can we do now?  Try to estimate the rate of simultaneous machine failures in a RAMCloud data center  Estimate RAMCloud’s annual segment loss rate November 16, 2010RAMCloudSlide 9

Simultaneous Crashes November 16, 2010RAMCloudSlide 10 M7 M4 M1 T 0t

Assumptions 1. All machines fail independently of each other 2. Each individual machine fails at a low rate 3. Number of machines >> 1 4. Constant recovery time for all machines 5. If a single machine fails, there is a time slot of 2T when other machine failures count as simultaneous failures November 16, 2010RAMCloudSlide 11

Average Simultaneous Crash Rate Slide 12

Average Segment Loss Rate November 16, 2010RAMCloudSlide 13

Numerical Results ● Segment loss probabilities are accurate ● Annual simultaneous crashes and annual segment loss rate are only lower bounds, the real numbers are probably higher ● We do not take into account rare but feasible data center crash scenarios (e.g. power outages) November 16, 2010RAMCloudSlide 14

Numerical Results: Segment Disk Loss November 16, 2010RAMCloudSlide 15 Segment Loss on Disk Probabilities (8,000 segments per machine, 50 machines per rack) Number of machines1,00010,000100,0001,000, E E E E-07

Numerical Results: Segment Disk Loss November 16, 2010RAMCloudSlide 16 Segment Loss on Disk Probabilities (1,000 segments per machine, 50 machines per rack) Number of machines1,00010,000100,0001,000, E E E E E-07

Numerical Results: Segment Disk & Memory Loss November 16, 2010RAMCloudSlide 17 Segment Loss on Disk and Memory Probabilities (8,000 segments per machine, 50 machines per rack) Number of machines1,00010,000100,0001,000, E E E E-100

Numerical Results: Rates of Simultaneous Crashes November 16, 2010RAMCloudSlide 18 Annual Simultaneous Crash Rate (each machine fails 2 times a year, 50 machines per rack) Number of machines1,00010,000100,0001,000,000 Annual rate of 2 machines failing simultaneously (different racks) 1.379E Annual rate of 3 machines failing simultaneously (different racks) 4.97E E

Numerical Results: Segment Loss Rates November 16, 2010RAMCloudSlide 19 Annual Segment Loss Rate (each machine fails 2 times a year, 50 machines per rack) Number of machines1,00010,000100,0001,000,000 Annual segment loss rate for 2 backups, 3 crashes 7.53E E E Annual segment loss rate for 3 backups, 4 crashes 1.313E-131.3E E-11~ E-10

Conclusions & Takeaways ● In case of big data center crash, 2 backups are not enough  For example: power outage takes out all memory, ~100% data loss if two machines do not reboot out of 1000 machine cluster ● 2 backups are also risky in case of 3 simultaneous failures  ~5% data loss with 1000 machines ● If our independent crash model is a good approximation, 2 backups are safe for ordinary crash scenarios ● In most cases, using 3 backups instead of 2 significantly reduces crash probabilities November 16, 2010RAMCloudSlide 20

Suggestions for Improvement ● Number of backups per segment should be a configurable system parameter ● Consider using 3 backups for important data, 2 backups for ordinary data  Pros: lower data loss rate, provide majority in case of inconsistencies  Con: higher I/O bandwidth for writes ● Consider backing up segments in bigger chunks  Pros: lower data loss rate, recovery time determined by slowest machine, easier to manage fewer backups (smaller tables, less coordination)  Con: bigger chunks  lower recovery throughput November 16, 2010RAMCloudSlide 21

November 16, 2010RAMCloudSlide 22 THANK YOU