Download presentation
Presentation is loading. Please wait.
Published byAdela Nichols Modified over 9 years ago
1
DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University
2
© 2003 Andy Huang Outline n Proposal l Why? – The goal l What? – The class of state we focus on l How? – Technique for achieving the goal n Quorum algorithm and recovery results n Repartitioning algorithm and availability results n Conclusion
3
© 2003 Andy Huang Why? What? and How?
4
© 2003 Andy Huang Why? Simplify state-management SIMPLECOMPLEX Configuration“plug-and-play”repartition Recovery simple & non-intrusive unavailability ~minutes Frontends App Servers DB/FS LA N
5
© 2003 Andy Huang n User preferences: l Explicit: name, address, etc. l Implicit: usage statistics (Amazon’s “items viewed”) n Collaborative workflow data: l Examples: insurance claims, human resources files What? Non-transactional data read-mostly (catalogs) non-transactional r/w (user prefs, workflow data) transactional (billing)
6
© 2003 Andy Huang n Hypothesis: A state store designed for non- transactional data can be decoupled so that it can be managed like a stateless system n Technique 1: Expose a hash table API l Repartitioning scheme is simple (no complex data dependencies) n Technique 2: Use quorums (read/write ≥ majority) l Recovery is simple (no special case recovery mechanism) l Recovery is non-intrusive (data available throughout) How? Decouple using hash table and quorums
7
© 2003 Andy Huang n Brick: stores data n Dlib: exposes hash table API to app server and executes quorum-based reads/writes on bricks n Replica groups: bricks storing the same portion of the key space are in the same replica group Architecture overview Dlib App Servers LA N Dlib Bricks
8
© 2003 Andy Huang Quorum algorithm
9
© 2003 Andy Huang Algorithm: Wavering reads n No two-phase commit (complicates recovery and introduces coupling) n C 1 attempts to write, but fails before completion n Quorum property violated: reading a majority doesn’t guarantee latest value is returned n Result: wavering reads R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 0 read 1 0 1 write(1)
10
© 2003 Andy Huang Algorithm: Read writeback n Idea: commit partial write when it is first read n Commit point l Before x=0 l After x=1 n Proven linearizable under fail-stop model C1C1 R1R1 R2R2 R3R3 x = 0 C2C2 0 read 1 1 1 write(1)
11
© 2003 Andy Huang Algorithm: Crash recovery n Fail-stop not an accurate model: implies client that generated the request fails permanently n With writeback, commit point occurs sometime in the future n A writer expects request to succeed or fail, not be “in- progress” read 0 R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 1 write(1) write 1 1
12
© 2003 Andy Huang Algorithm: Write in-progress n Requirement: write must be committed/aborted on the next read n Record “write in-progress” on client l On submit: write “start” cookie l On return: write “end” cookie l On read: if “start” cookie has no matching “end,” read all R1R1 R2R2 R3R3 x = 0 C1C1 C2C2 read 1 write 11 1 write(1)
13
© 2003 Andy Huang Algorithm: The common case n Write all, wait for a majority l Normally, all replicas perform the write n Read majority l Normally, replicas return non-conflicting values n Writeback performed when a brick fails or when it is temporarily overloaded and missed some writes n Read all performed when an app server fails
14
© 2003 Andy Huang Recovery results
15
© 2003 Andy Huang Results: Simple, non-intrusive recovery n Normal operation: majority must complete write n Failure: if fewer than a majority fail, writes can succeed n Recovery: equivalent to missing a few writes under normal operation l Simple: no special code l Non-intrusive: availability throughout
16
© 2003 Andy Huang Benchmark: Simple, non-intrusive recovery n Benchmark: l t=60 sec one brick killed l t=120 sec brick restarted n Summary: l Data available during failure and recovery l Recovering brick restores throughput in seconds
17
© 2003 Andy Huang Benchmark: Availability under performance faults n Fault causes: l cache warming l garbage collection n Benchmark: l Degrade one brick by wasting CPU cycles n Comparison: l DStore: Throughput remains steady l ROWA: Throughput throttled by slowest brick
18
© 2003 Andy Huang Repartitioning algorithm & Availability results
19
© 2003 Andy Huang Algorithm: Online repartitioning n Split replica group ID (rgid), but announce both n Take a brick offline (looks just like a failure) n Copy data to new brick n Change rgid and bring both bricks online 001000100010001000100010 001000100010 0010001000 10
20
© 2003 Andy Huang Benchmark: Online repartitioning n Benchmark: l t=120 sec group 0 repartitioned l t=240 sec group 1 repartitioned n Non-intrusive: l Data available during entire process l Appears as if brick just failed and recovered (but there are now more bricks)
21
© 2003 Andy Huang Conclusion n Goal: Simplify management for non-transactional data n Techniques: Expose hash table API and use quorums n Results: l Recovery is simple and non-intrusive l Repartitioning can be done fully online n Next steps: l True “plug-and-play” – automatically repartition when bricks are added/removed (simplified by hash table partitioning scheme) n Questions: andy.huang@stanford.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.