Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University.

Slides:



Advertisements
Similar presentations
CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.
Advertisements

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
Availability in Globally Distributed Storage Systems
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University.
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance August 2008 Hakim Weatherspoon, Lakshmi Ganesh, Tudor.
MAC Layer (Mis)behaviors Christophe Augier - CSE Summer 2003.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Transaction Processing IS698 Min Song. 2 What is a Transaction?  When an event in the real world changes the state of the enterprise, a transaction is.
Tentative Updates in MINO Steven Czerwinski Jeff Pang Anthony Joseph John Kubiatowicz ROC Winter Retreat January 13, 2002.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
1 Awareness Services for Digital Libraries Arturo Crespo Hector Garcia-Molina Stanford University.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Failures in the System  Two major components in a Node Applications System.
National Manager Database Services
Transactions and Recovery
Case Study - GFS.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
J.H.Saltzer, D.P.Reed, C.C.Clark End-to-End Arguments in System Design Reading Group 19/11/03 Torsten Ackemann.
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance Hakim Weatherspoon Joint with Lakshmi Ganesh, Tudor.
Distributed File Systems
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
MapReduce M/R slides adapted from those of Jeff Dean’s.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Chapter 15 Recovery. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.15-2 Topics in this Chapter Transactions Transaction Recovery System.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CAP Theorem Justin DeBrabant CIS Advanced Systems - Fall 2013.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.
MTTR “>>” MTTF Armando Fox, June 2002 ROC Retreat.
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.
An Introduction to Super-Scalability But first…
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
A Tale of Two Erasure Codes in HDFS
The Case for a Session State Storage Layer
Distributed Systems – Paxos
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Operating System Reliability
Operating System Reliability
RAID RAID Mukesh N Tekwani
Operating System Reliability
Operating System Reliability
Decoupled Storage: “Free the Replicas!”
Transaction Properties: ACID vs. BASE
RAID RAID Mukesh N Tekwani April 23, 2019
Distributed Systems and Concurrency: Distributed Systems
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Operating System Reliability
Operating System Reliability
Presentation transcript:

Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Motivation: [app needs]  [storage guarantees] n Some types of state in interactive Web applications don’t require absolute consistency and durability l Examples: shopping cart state, session state l Characteristics: lifetime ~wks/mins, lost updates not critical n This state is often stored in storage solutions that provide absolute guarantees (e.g., DBMS, DDS) n Applications pay the cost of these policies even if they don’t need the guarantees l Node failure recovery: some r/w operations are suspended l Scaling: downtime of some partitions and administration

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Proposal: Relax consistency and durability n Approach: Provide a storage solution that apps can tune for desired levels of consistency and durability n Proposed Solution: Store state in a redundant array of independent, non-durable stores (RAINS) l Independent: No coordination (2P commit) among nodes l Non-durable: Node doesn’t necessarily recover state on crash recovery (Web cache) n Hypothesis: Storing data in a RAINS significantly reduces the costs of node failures and scaling l Fast recovery (no need to recover data) l Reads and writes allowed during recovery and scaling

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 RAINS Hashtable Overview … N RStores RLib App Unreliable, non-durable hashtable (RLib can’t tell if put succeeds or get returns up-to-date value) Unreliable, non-durable hashtable (RLib can’t tell if put succeeds or get returns up-to-date value) n Consistency and durability mechanism: write/read N RStores RAINS Store RAINS Library Presents hashtable-like API to App Presents hashtable-like API to App Constructor: hashtable(L,P m ) L=lifetime, P m =min. P(consistency) Constructor: hashtable(L,P m ) L=lifetime, P m =min. P(consistency) n RLib uses L and P m to determine N Uses some policy to determine return value of get (e.g., majority) Uses some policy to determine return value of get (e.g., majority)

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Probabilistic Model for Determining N Given: hashtable(L,P m ) Given: hashtable(L,P m ) l L = lifetime l P m = min. consistency probability = min(P C ) n Assume we can calculate these values: l P put = P(put succeeds) l P get = P(get returns a valid value if it exists) l T R = frequency of RStore reboots (chosen s.t., crashes are very unlikely between reboots - based on TTF distribution) n Find: l N = # RStores to write and read = f(L, P m, P put, P get, T R ); s.t.,  t ≤ L, P c  P m = f(L, P m, P put, P get, T R ); s.t.,  t ≤ L, P c  P m

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 P(consistency) Decreases Over Time PCPC t 1 PmPm L TRTR f(N, P put, P get, T R ) 0 t put

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Assumptions 1. Network is perfect and node failures are independent  individual put and get operations are independent of each other 2. Data isn’t recovered on node recovery 3. Rolling reboots: reboots spaced in regular intervals 4. Write a distinct value on each put 5. RLib can detect corrupt values (e.g., using CRC)

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Read Set – RS(t) n RS(t) = [# RStores that have data from t=t put ] n A2 (data isn’t recovered on recovery) l  RS(t) is monotonically decreasing l RS(t) = [# RStores that haven’t been rebooted since t=t put ] n A3 (reboots spaced in regular intervals) l  RS(t) is a step function bounded by a linear function (decreasing at a constant rate)

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Read Set Example: N=8 t TRTR t put RS = 8RS = 3 t get N = 8

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Read Set Function RS t n TRTR RS(t) =  n(1- t/T R )  0 t put t=0

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Possible Read Outcomes 1. Valid, up-to-date value (U): P U = P put  P get 2. Valid, stale value (S): P S = (1 - P put )  P get A4 (each put value is distinct) A4 (each put value is distinct) n  possible that S = U, so P C errs on the pessimistic side 3. Fail - timeout or corrupt value (F): P F = 1 – P get n A5 (RLib can detect corrupt values using CRC) n  corrupt values and timeouts can be conflated 4. Key not found (N) n If  a valid value (U/S), assume N is from rebooted RStore n  N is not part of the Read Set.

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Policies for Determining Return Value of get n Let: n = RS(t) n Most up-to-date time stamp l P C = 1 – (1 – P U ) n n Majority: U n > n/2 l Let X i = 1 if i th value is up-to-date = 0 if i th value is fail or stale = 0 if i th value is fail or stale l Let U n = X 1 + X 2 + … + X n (binomial distribution) l P(U n = i) = ( n i ) P U i  (1 – P U ) n-i l P C = P(U n > n/2) =  n i=n/2+1 P(U n = i) n Majority over Read Set: U n > S n

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Questions n How big is the win? l Can we achieve throughput as good as or better than DDS? l How cheap can we make failures? n What is the cost? l Can we achieve decent levels of consistency and durability (even with small clusters)? l Are there an interesting set of applications that can tolerate relaxed consistency and durability?

Software Infrastructures Group - Stanford University ROC Retreat, Lake Tahoe, CA, 10 June 2002 Conclusion n Problem: Storage solutions force applications to pay the cost of absolute consistency and durability even if they aren’t needed n Proposal: Store state in RAINS, which allows applications to specify the desired level of consistency and durability n Hypothesis: Relaxing the levels of consistency and durability will drastically decrease the costs of node failures and scaling