Download presentation
Presentation is loading. Please wait.
Published byEdward Walton Modified over 9 years ago
1
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University
2
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Persistent hash tables Frontends App Servers DB LA N KeyValue Yahoo! user ID User profile ISBN Amazon catalog metadata Hash table
3
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Two state management challenges Failure handling Consistency requirements Consistency requirements ð Node recovery costly ð Reliable failure detection Relax internal consistency Relax internal consistency ð Fast, non-intrusive recovery (“free”) System evolution Large data sets Large data sets ð Repartitioning is costly ð Good resources provisioning Free recovery Free recovery ð Automatic, online repartitioning an easy-to-manage cluster-based persistent hash table for Internet services DStore
4
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang DStore architecture Dlib LA N Brick app server Dlib: exposes hash table API and is the “coordinator” for distributed operations Brick: stores data by writing synchronously to disk an easy-to-manage cluster-based persistent hash table for Internet services DStore
5
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery Technique 1: Quorums Tolerant to brick inconsistency Technique 2: Single-phase writes No request relies on specific bricks Simple, non-intrusive recovery 2PC: failure between phases complicates protocol 2 nd phase depends on particular set of bricks Relies on reliable failure detection Single-phase quorum writes: can be completed by any majority of bricks Any brick can fail at any time Write: send to all, wait for majority Read: read from majority OK if some bricks’ data differs Failure = missing some writes
6
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency Dl 1 B1B1 B2B2 B3B3 x = 0 Dl 2 0 read 1 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit write(1)
7
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency B1B1 B2B2 B3B3 x = 0 Dl 1 Dl 2 1 read write write(1) A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual client’s view of DStore is consistent with that of a single centralized server (Bayou)
8
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Free recovery Worst-case behavior (100% cache hit rate) Expected behavior (85% cache hit rate) Recovery: fast and non-intrusive Brick killedRecovery
9
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic failure detection Modest policy (anomaly threshold = 8) Aggressive policy (anomaly threshold = 5) False positives: low cost Fail-stutter: detected by Pinpoint Fail-stutter
10
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Online repartitioning 1. Take brick offline 2. Copy data to new brick 3. Bring both bricks online 01 01 01 01 01 01 01 01 01 1 01 010 1 Appears as if brick just failed and recovered
11
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic online repartitioning Evenly-distributed load (3 to 6 bricks) Hotspot in 01 partition (6 to 12 bricks) Brick selection: effective Repartitioning: non-intrusive Naive
12
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang n Perform online checkpoints l Take checkpointing brick offline l Just like failure+recovery n See if free recovery can simplify online data reconstruction after hard failures n Any other state management challenges you can think of? Next up for free recovery
13
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary Free recovery DStore = Decoupled Storage Managed like a stateless Web farm Quorums [spacial decoupling] Cost: extra overprovisioning Gain: fast, non-intrusive recovery Single-phase ops [temporal decoupling ] Cost: temporarily violates “majority” invariant Gain: any brick can fail at any time Failure handling fast, non-intrusive Mechanism: simple reboot Policy: aggressively reboot anomalous bricks System evolution “plug-and-play” þ Mechanism: automatic, online repartitioning Policy: dynamically add and remove nodes based on predicted load
14
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang an easy-to-manage cluster-based persistent hash table for Internet services DStore andy.huang@stanford.edu
15
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang ACID Properties Atomicity: a put replaces existing value and is atomic (multi-operation transactions and partial updates not supported) Consistency: Jane’s view of the hash table is consistent with that of a single centralized server (Bayou) l Read your writes: Jane sees her own updates l Monotonic reads: Jane won’t read a value older than one she’s read before l Writes follow reads: Jane’s writes are ordered after any writes (by any user), which Jane has read l Monotonic writes: Jane’s own writes are totally ordered Isolation: no multi-operation transactions to isolate Durability: updates synced to disk on multiple servers
16
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary n Quorums = spacial decoupling (between nodes) l Gain: fast, non-intrusive recovery l Cost: overprovision for quorum replication n Single-phase operations = temporal decoupling l Gain: any brick can fail at any time l Cost: temporary violation of quorum majority invariants n Free recovery addresses challenges: l Handing failures fail anytime, recovery quickly, non-intrusively l System evolution plug-and-play nodes via automatic, online repartitioning l Failure detection aggressive (low false-positive cost) l Resource provisioning dynamic (low repartitioning cost) n Resulting system: can be managed like a stateless Web farm
17
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Wavering reads n No two-phase commit (complicates recovery and introduces coupling) n C 1 attempts to write, but fails before completion n Quorum property violated: reading a majority doesn’t guarantee latest value is returned n Result: wavering reads R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 0 read 1 0 1 write(1)
18
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Read writeback n Idea: commit partial write when it is first read n Commit point l Before x=0 l After x=1 n Proven linearizable under fail-stop model C1C1 R1R1 R2R2 R3R3 x = 0 C2C2 0 read 1 1 1 write(1)
19
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Crash recovery n Fail-stop not an accurate model: implies client that generated the request fails permanently n With writeback, commit point occurs sometime in the future n A writer expects request to succeed or fail, not be “in- progress” read 0 R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 1 write(1) write 1 1
20
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Write in-progress n Requirement: write must be committed/aborted on the next read n Record “write in-progress” on client l On submit: write “start” cookie l On return: write “end” cookie l On read: if “start” cookie has no matching “end,” read all R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 read 1 write 11 1 write(1)
21
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery n Technique 1: Quorums l Write to ≥ majority; read from majority l Failure = missing a few writes l Simple, non-intrusive recovery n Decouple in time (i.e., between requests) using single-phase operations l Lazy read-repair handles Dlib failures l No request relies on a specific set of replicas l Safe for any node to fail at any time Bricks Dlib =
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.