Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Persistent hash tables Frontends App Servers DB LA N KeyValue Yahoo! user ID User profile ISBN Amazon catalog metadata Hash table
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Two state management challenges Failure handling Consistency requirements Consistency requirements ð Node recovery costly ð Reliable failure detection Relax internal consistency Relax internal consistency ð Fast, non-intrusive recovery (“free”) System evolution Large data sets Large data sets ð Repartitioning is costly ð Good resources provisioning Free recovery Free recovery ð Automatic, online repartitioning an easy-to-manage cluster-based persistent hash table for Internet services DStore
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang DStore architecture Dlib LA N Brick app server Dlib: exposes hash table API and is the “coordinator” for distributed operations Brick: stores data by writing synchronously to disk an easy-to-manage cluster-based persistent hash table for Internet services DStore
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery Technique 1: Quorums Tolerant to brick inconsistency Technique 2: Single-phase writes No request relies on specific bricks Simple, non-intrusive recovery 2PC: failure between phases complicates protocol 2 nd phase depends on particular set of bricks Relies on reliable failure detection Single-phase quorum writes: can be completed by any majority of bricks Any brick can fail at any time Write: send to all, wait for majority Read: read from majority OK if some bricks’ data differs Failure = missing some writes
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency Dl 1 B1B1 B2B2 B3B3 x = 0 Dl 2 0 read 1 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Considering consistency B1B1 B2B2 B3B3 x = 0 Dl 1 Dl 2 1 read write write(1) A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual client’s view of DStore is consistent with that of a single centralized server (Bayou)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Free recovery Worst-case behavior (100% cache hit rate) Expected behavior (85% cache hit rate) Recovery: fast and non-intrusive Brick killedRecovery
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic failure detection Modest policy (anomaly threshold = 8) Aggressive policy (anomaly threshold = 5) False positives: low cost Fail-stutter: detected by Pinpoint Fail-stutter
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Online repartitioning 1. Take brick offline 2. Copy data to new brick 3. Bring both bricks online Appears as if brick just failed and recovered
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Benchmark: Automatic online repartitioning Evenly-distributed load (3 to 6 bricks) Hotspot in 01 partition (6 to 12 bricks) Brick selection: effective Repartitioning: non-intrusive Naive
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang n Perform online checkpoints l Take checkpointing brick offline l Just like failure+recovery n See if free recovery can simplify online data reconstruction after hard failures n Any other state management challenges you can think of? Next up for free recovery
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary Free recovery DStore = Decoupled Storage Managed like a stateless Web farm Quorums [spacial decoupling] Cost: extra overprovisioning Gain: fast, non-intrusive recovery Single-phase ops [temporal decoupling ] Cost: temporarily violates “majority” invariant Gain: any brick can fail at any time Failure handling fast, non-intrusive Mechanism: simple reboot Policy: aggressively reboot anomalous bricks System evolution “plug-and-play” þ Mechanism: automatic, online repartitioning Policy: dynamically add and remove nodes based on predicted load
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang an easy-to-manage cluster-based persistent hash table for Internet services DStore
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang ACID Properties Atomicity: a put replaces existing value and is atomic (multi-operation transactions and partial updates not supported) Consistency: Jane’s view of the hash table is consistent with that of a single centralized server (Bayou) l Read your writes: Jane sees her own updates l Monotonic reads: Jane won’t read a value older than one she’s read before l Writes follow reads: Jane’s writes are ordered after any writes (by any user), which Jane has read l Monotonic writes: Jane’s own writes are totally ordered Isolation: no multi-operation transactions to isolate Durability: updates synced to disk on multiple servers
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Summary n Quorums = spacial decoupling (between nodes) l Gain: fast, non-intrusive recovery l Cost: overprovision for quorum replication n Single-phase operations = temporal decoupling l Gain: any brick can fail at any time l Cost: temporary violation of quorum majority invariants n Free recovery addresses challenges: l Handing failures fail anytime, recovery quickly, non-intrusively l System evolution plug-and-play nodes via automatic, online repartitioning l Failure detection aggressive (low false-positive cost) l Resource provisioning dynamic (low repartitioning cost) n Resulting system: can be managed like a stateless Web farm
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Wavering reads n No two-phase commit (complicates recovery and introduces coupling) n C 1 attempts to write, but fails before completion n Quorum property violated: reading a majority doesn’t guarantee latest value is returned n Result: wavering reads R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 0 read write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Read writeback n Idea: commit partial write when it is first read n Commit point l Before x=0 l After x=1 n Proven linearizable under fail-stop model C1C1 R1R1 R2R2 R3R3 x = 0 C2C2 0 read write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Crash recovery n Fail-stop not an accurate model: implies client that generated the request fails permanently n With writeback, commit point occurs sometime in the future n A writer expects request to succeed or fail, not be “in- progress” read 0 R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 1 write(1) write 1 1
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Algorithm: Write in-progress n Requirement: write must be committed/aborted on the next read n Record “write in-progress” on client l On submit: write “start” cookie l On return: write “end” cookie l On read: if “start” cookie has no matching “end,” read all R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 read 1 write 11 1 write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Focusing on recovery n Technique 1: Quorums l Write to ≥ majority; read from majority l Failure = missing a few writes l Simple, non-intrusive recovery n Decouple in time (i.e., between requests) using single-phase operations l Lazy read-repair handles Dlib failures l No request relies on a specific set of replicas l Safe for any node to fail at any time Bricks Dlib =