Download presentation
Presentation is loading. Please wait.
Published byNeil Day Modified over 8 years ago
1
D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC
2
Debugging distributed systems is difficult Bugs are difficult to reproduce – Many machines executing concurrently – Machines/network may fail Consistent snapshots are not easy to get Current approaches – Multi-threaded debugging – Model-checking – Runtime-checking
3
State of the Arts Example – Distributed reader-writer locks Log-based debugging – Step1: add logs void ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode); } – Step2: Collect logs – Step3: Write checking scripts
4
Problems Too much manual effort Difficult to anticipate what to log – Too much? – Too little? Checking for large system is challenging – A central checker cannot keep up – Snapshots must be consistent
5
D 3 S Contribution A simple language for writing distributed predicates Programmers can change what is being checked on-the-fly Failure tolerant consistent snapshot for predicate checking Evaluation with five real-world applications
6
D 3 S Workflow Checker Predicate: no conflict locks Predicate: no conflict locks Violation! state Conflict!
7
Glance at D 3 S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0 { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning };
8
D 3 S Parallel Predicate Checker Lock clients Checkers Expose states individually Reconstruct: SN1, SN2, … Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 (C1,L1,E), (C5,L1,S) (C2,L3,S) Key: LockID
9
Summary of Checking Language Predicate – Any property calculated from a finite number of consecutive state snapshots Highlights – Sequential programs (w/ mapping) – Reuse app types in the script and C++ code Binary Instrumentation – Supports for reducing the overhead (in the paper) Incremental checking Sampling the time or snapshots
10
Snapshots Use Lamport clock – Instrument network library – 1000 logic clocks per second Problem: how does the checker know whether it receives all necessary states for a snapshot?
11
Consistent Snapshot Membership What if a process does not have state to expose for a long time? What if a checker fails? A B Checker { (A, L0, S) }, ts=2 { (B, L1, E) }, ts=6 { }, ts=10 ts=12 { (A, L1, E) }, ts=16 M(2)={A,B} S B (2)=?? M(6)={A,B} S A (6)=?? M(10)={A,B} S A (6)=S A (2) check(6) Detect failure S B (10)=S B (6) check(10) M(16)={A} check(16) S A (2) S B (6) S A (10)S A (16)
12
Experimental Method Debugging five real systems – Can D 3 S help developers find bugs? – Are predicates simple to write? – Is the checking overhead acceptable? Case: Chord implementation – i3 – Using predecessors and successors list to stabilize – “holes” and overlap
13
Chord Overlay Perfect Ring: No overlap, no hole Aggregated key coverage is 100% Perfect Ring: No overlap, no hole Aggregated key coverage is 100% ??? Consistency vs. Availability: cannot get both Global measure on the factors See the tradeoff quantitatively for performance tuning Capable of checking detailed key coverage
14
Summary of Results ApplicationLoCPredicatesLoPResults PacificA (Structured data storage) 67,263membership consistency; leader election; consistency among replicas 1183 correctness bugs Paxos implement- ation 6,993consistency in consensus outputs; leader election 502 correctness bugs Web search engine 26,036unbalanced response time of indexing servers 811 performance problem Chord (DHT)7,640aggregate key range coverage; conflict key holders 72tradeoff bw/ availability & consistency BitTorrent client 36,117Health in neighbor set; distribution of downloaded pieces; peer contribution rank 2102 performance bugs; free riders Data center apps Wide area apps
15
Overhead (PacificA) Less than 8%, in most cases less than 4%. I/O overhead < 0.5% Overhead is negligible in other checked systems
16
Related Work Log analysis – Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] Predicate checking at replay time – WiDS Checker[NSDI’07], Friday[NSDI’07] P2-based online monitoring – P2-monitor[EuroSys’06] Model checking – MaceMC[NSDI’07], CMC[OSDI’04]
17
Conclusion Predicate checking is effective for debugging deployed & large-scale distributed systems D 3 S enables: – Change of what is monitored on-the-fly – Checking with multiple checkers – Specify predicate in sequential & centralized manner
18
Thank You Thank the authors for providing some of slides
19
PNUTS PNUTS Yahoo!’s Hosted Data Serving Platform Brian F. Cooper et al. @ Yahoo! Research Presented by Ying-Yi Liang * Some slides come from the authors’ version
20
What is the Problem The web era: web applications Users are picky – low latency; high availability Enterprises are greedy – high scalability Things go fast – new ideas expires very soon Two ways of developing a cool web application Making your own fire: quick, cool, but tiring, error prone Using huge “powerful” building blocks: wonderful, but the market would have shifted away when you are done Both ways do not scale very well… Something is missing – an infrastructure specially tailored for web applications!
21
Web Application Model Object sharing: Blogs, Flicker, Web Picasa, Youtube, … Social: Facebook, Twitter, … Listing: Yahoo! Shopping, del.icio.us, news They require: High scalability, availability and fault tolerance Acceptable latency w.r.t. geographically distributed requests Simplified query API Some consistency (weaker than SC)
22
PNUTS – DB in the Cloud E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E
23
Basic Concepts GrapeGrapes are good to eat LimeLimes are Green AppleApple is wisdom StrawberryStrawberry shortcake OrangeArrgh! Don’t get scurvy! AvocadoBut at what price? LemonHow much did you pay for this lemon? TomatoIs this a vegetable? BananaThe perfect fruit KiwiNew Zealand Primary Key Record Tablet Field
24
A view from 10,000-ft
25
PNUTS Storage Architecture Storage units Routers Tablet controller REST API Clients Message Broker
26
Geographic Replication Storage units Routers Tablet controller REST API Clients Message Broker Region 1 Region 2 Region 3
27
In-region Load Balance Storage unit Tablets
28
Data and Query Models Simplified rational data model: tables of records Typed columns Typical data types plus the blob type Does not enforce inter-table relationship Operation: selection, projection (no join, aggregation, …) Options: point access, range query, multiget
29
Record Assignment Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon SU1Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe
30
Single Point Update 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers
31
Range Query Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe
32
Relaxed Consistency ACID transactions Sequential consistency: too strong Non-trivial overhead for asynchronous settings Users can tolerate stale data in many cases Go hybrid: eventual consistency + mechanism for SC Use versioning to cope with asynchrony Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update
33
Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_any()
34
Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_latest()
35
Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_critical(“v.6”)
36
Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version write()
37
Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version test_and_set_write(v.7 ) ERROR
38
Membership Management Record timelines should be coherent for each replica Updates must be applied to the latest version Use mastership Per-record basis Only one replica has mastership at anytime All update requests are sent to master to get ordered Routers & YMB maintain mastership information Replica receiving frequent write req. gets the mastership Leader election service provided by ZooKeeper
39
ZooKeeper A distributed system is like a zoo, someone needs to be in charge of it. ZooKeeper is a highly available, scalable coordination svc. ZooKeeper plays two roles in PNUTS Coordination service Publish/subscribe service Guarantees: Sequential consistency; Single system image Atomicity (as in ACID); Durability; Timeliness A tiny kernel for upper level building blocks
40
ZooKeeper: High Availability High availability via replication A fault-tolerant persistent store Providing sequential consistency
41
ZooKeeper: Services Publish/Subscribe Service Contents stored in ZooKeeper organized as directory trees Publish: write to specific znode Subscribe: read specific znode Coordination via automatic name resolution By appending sequence number to names CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) “/…/x-1”, “/…/x-2”, … Ephemeral nodes: znodes living as long as the session
42
ZooKeeper Example: Lock 1) id = create(“…/locks/x-”, SEQUENCE | EMPHEMERAL); 2) children = getChildren(“…/locks”, false); 3) if (children.head == id) exit(); 4) test = exists(name of last child before id, true); 5) if (test == false) goto 2); 6) wait for modification to “…/locks”; 7) goto 2);
43
ZooKeeper Is Powerful Many core svc. in distributed sys. built on ZooKeeper Consensus Distributed locks (exclusive, shared) Membership Leader election Job tracker binding … More information at http://hadoop.apache.org/zookeeper/http://hadoop.apache.org/zookeeper/
44
Experimental Setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality
45
Scalability
46
Sensitivity to R/W Ratio
47
Sensitivity to Request Dist.
48
Related Work Google BigTable/GFS Fault-tolerance and consistency via Chubby Strong consistency – Chubby not scalable Lack of geographic replication support Targeting analytical workloads Amazon Dynamo Unstructured data Peer-to-peer style solution Eventual consistency Facebook Cassandra (still kind of a secret) Structured storage over peer-to-peer network Eventual consistency Always writable property – success even in the face of a failure
49
Discussion Can all web applications tolerate stale data? Is doing replication completely across WAN a good idea? Single level router vs. B+ tree style router hierarchy Tiny service kernel vs. stand alone services Is relaxed consistency just right or too weak? Is exposing record versions to applications a good idea? Should security be integrated into PNUTS? Using pub/sub service as undo logs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.