Decoupled Storage: “Free the Replicas!” Andy Huang and Armando Fox Stanford University
What is decoupled storage (DeStor)? Goal: application-level persistent storage system for Internet services Good recovery behavior Predictable performance Related projects Decoupled version of DDS (Gribble) Federated Array of Bricks (HP Labs) at the application level Session State Server (Ling), but for persistent state
Outline Dangers of coupling Techniques for decoupling Consequences
ROWA – coupling and recovery don’t mix Read One (i.e., any) All copies must be consistent Availability coupling: data locked during recovery to bring replica up-to-date Write All Writes proceed at the rate of the slowest replica Performance coupling: system can grind to a halt if one replica degrades Possible causes of degradation: cache warming and garbage collection
Decoupled ROWA – allow replicas to say “No” Write All (but replicas can say “No”) Performance coupling: write can complete without waiting for a degraded replica Availability coupling: allowing stale values eliminates the need for locked data during recovery Issue: read may return a stale value Read One (but read all timestamps) Replicas can say “No” to a read_timestamp request Use quorums to make sure enough replicas say “Yes”
Quorums – use up-to-date information Perform reads and writes on a majority of the replicas Use timestamps to determine the correct value of a read Performance coupling Problem: requests distributed using static information Consequence: one degraded node can slow down over 50% of writes Load-balanced quorums Use current load information to select quorum participants
DeStor – two ways to look at it Decoupled ROWA “Write all” is best-effort, but write to at least a majority Read majority of timestamps to check staleness Load-balanced quorums (w/ read optimization) Use dynamic load information Read one value and majority of timestamps
DeStor write Write Issue write(key,val) to N replicas Wait for majority to ack before returning success Else, timeout and retry or return fail v.6 v.5 C R1 R2 R3 R4 write v.7 write v.7 v.7 v.6 C R1 R2 R3 R4 success
DeStor read Read Issue {v,tv}=read(key) to random replica Issue get_timestamp(key) to N replicas Find most recent timestamp t*T={t1,t2,…} If (tv=t*), return v Else, issue read(key) to replica with tn=t* v.7 v.6 C R1 R2 R3 R4 read time read v.7 v.6 C R1 R2 R3 R4 value,v.7 value v.7 v.6
Decoupling further – unlock the data 2-phase commit – ensures atomicity among replicas Couples replicas between phases Locking complicates the implementation and recovery 2PC not needed for DeStor? R1 R2 R3 C1 C2 R4 x=2 x=1 Client-generated physical timestamps API: Single-operation transactions with no partial updates R1 R2 R3 C1 C2 R4 y=2 x=1 (1,0) (1,2) (0,2) Assumption: clients operate independently C1 C2 DeStor r v.6 w v.7
Client failure – what can happen w/o locks Issue Less than majority are written R2 and R3 v.6 R1 and R2/R3 v.7 Serializability Once v.7 is read, make sure it is the majority Idea: write v.7 didn’t happen until it was read R1 R2 R3 C1 v.7 v.6 R4 C2 R1 R2 R3 v.7 v.6 R4
Timestamps – loose synchronization is sufficient Unsynchronized clocks Issue: client’s writes are “lost” because other writers’ timestamps are always more recent Why that’s okay: clients are independent, so they can’t differentiate a “lost write” from an overwritten value Caveat: a user is often behind the client requests User sees inter-request causality NTP synchronizes clocks within milliseconds, which is sufficient for human-speed interactions
Consequence – behavior is more restricted Good recovery behavior Data available throughout crash and recovery Performance degradation during cache warming doesn’t affect other replicas Predictable performance DeStor vs. ROWA: DeStor has better write throughput and latency at the cost of read throughput and latency Key: better degradation characteristics more predictable performance
Performance: predictable Twrite T1= throughput of a single replica D1= % degradation of one replica D = % system degradation = [−slope/Tmax]D1 ROWA: Tmax= T1 slope = -T1 D = D1 DeStor: Tmax= (N/Q)T1 T1 ≤ Tmax ≤ 2T1 slope = −T1/Q = −2T1/(N+1) D = D1/N T N=7 N=5 T1 N=3 D1 1
Performance: slightly degraded Tread ROWA: Tmax = NT1 slope = −T1 D = D1/N DeStor: depends on overhead of read_timeout request Tmax = NT1 – (N/Q)[overhead] slope = –T1 – (T1/Q)[overhead] D ≈ D1/N T NT1 (N-1)T1 T2 T1 D1 1
Research issues – once replicas are free… Next step: simulate ROWA and DeStor Measure: read and write throughput/latency Factors: object size, working set size, read-write mix Opens up new options for system administration Online repartitioning, scaling, and replica replacement Raises new issues for performance optimizations When in-memory replication is persistent enough (non-write-through replicas)
Summary Application-level persistent storage system Replication scheme Write all, wait for majority Read any, read majority of timestamps Consequences Data availability throughout recovery Predictable performance when replicas degrade