D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,

D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang, CS525@UIUC

Debugging distributed systems is difficult Bugs are difficult to reproduce – Many machines executing concurrently – Machines/network may fail Consistent snapshots are not easy to get Current approaches – Multi-threaded debugging – Model-checking – Runtime-checking

State of the Arts Example – Distributed reader-writer locks Log-based debugging – Step1: add logs void ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode); } – Step2: Collect logs – Step3: Write checking scripts

Problems Too much manual effort Difficult to anticipate what to log – Too much? – Too little? Checking for large system is challenging – A central checker cannot keep up – Snapshots must be consistent

D 3 S Contribution A simple language for writing distributed predicates Programmers can change what is being checked on-the-fly Failure tolerant consistent snapshot for predicate checking Evaluation with five real-world applications

D 3 S Workflow Checker Predicate: no conflict locks Predicate: no conflict locks Violation! state Conflict!

Glance at D 3 S Predicate V0: exposer  { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0  { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning };

D 3 S Parallel Predicate Checker Lock clients Checkers Expose states individually Reconstruct: SN1, SN2, … Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 (C1,L1,E), (C5,L1,S) (C2,L3,S) Key: LockID

Summary of Checking Language Predicate – Any property calculated from a finite number of consecutive state snapshots Highlights – Sequential programs (w/ mapping) – Reuse app types in the script and C++ code Binary Instrumentation – Supports for reducing the overhead (in the paper) Incremental checking Sampling the time or snapshots

Snapshots Use Lamport clock – Instrument network library – 1000 logic clocks per second Problem: how does the checker know whether it receives all necessary states for a snapshot?

Consistent Snapshot Membership What if a process does not have state to expose for a long time? What if a checker fails? A B Checker { (A, L0, S) }, ts=2 { (B, L1, E) }, ts=6 { }, ts=10 ts=12 { (A, L1, E) }, ts=16 M(2)={A,B} S B (2)=?? M(6)={A,B} S A (6)=?? M(10)={A,B} S A (6)=S A (2) check(6) Detect failure S B (10)=S B (6) check(10) M(16)={A} check(16) S A (2) S B (6) S A (10)S A (16)

Experimental Method Debugging five real systems – Can D 3 S help developers find bugs? – Are predicates simple to write? – Is the checking overhead acceptable? Case: Chord implementation – i3 – Using predecessors and successors list to stabilize – “holes” and overlap

Chord Overlay Perfect Ring: No overlap, no hole Aggregated key coverage is 100% Perfect Ring: No overlap, no hole Aggregated key coverage is 100% ??? Consistency vs. Availability: cannot get both Global measure on the factors See the tradeoff quantitatively for performance tuning Capable of checking detailed key coverage

Summary of Results ApplicationLoCPredicatesLoPResults PacificA (Structured data storage) 67,263membership consistency; leader election; consistency among replicas 1183 correctness bugs Paxos implementation 6,993consistency in consensus outputs; leader election 502 correctness bugs Web search engine 26,036unbalanced response time of indexing servers 811 performance problem Chord (DHT)7,640aggregate key range coverage; conflict key holders 72tradeoff bw/ availability & consistency BitTorrent client 36,117Health in neighbor set; distribution of downloaded pieces; peer contribution rank 2102 performance bugs; free riders Data center apps Wide area apps

Overhead (PacificA) Less than 8%, in most cases less than 4%. I/O overhead < 0.5% Overhead is negligible in other checked systems

Related Work Log analysis – Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] Predicate checking at replay time – WiDS Checker[NSDI’07], Friday[NSDI’07] P2-based online monitoring – P2-monitor[EuroSys’06] Model checking – MaceMC[NSDI’07], CMC[OSDI’04]

Conclusion Predicate checking is effective for debugging deployed & large-scale distributed systems D 3 S enables: – Change of what is monitored on-the-fly – Checking with multiple checkers – Specify predicate in sequential & centralized manner

Thank You Thank the authors for providing some of slides

PNUTS PNUTS Yahoo!’s Hosted Data Serving Platform Brian F. Cooper et al. @ Yahoo! Research Presented by Ying-Yi Liang * Some slides come from the authors’ version

What is the Problem The web era: web applications Users are picky – low latency; high availability Enterprises are greedy – high scalability Things go fast – new ideas expires very soon Two ways of developing a cool web application Making your own fire: quick, cool, but tiring, error prone Using huge “powerful” building blocks: wonderful, but the market would have shifted away when you are done Both ways do not scale very well… Something is missing – an infrastructure specially tailored for web applications!

Web Application Model Object sharing: Blogs, Flicker, Web Picasa, Youtube, … Social: Facebook, Twitter, … Listing: Yahoo! Shopping, del.icio.us, news They require: High scalability, availability and fault tolerance Acceptable latency w.r.t. geographically distributed requests Simplified query API Some consistency (weaker than SC)

PNUTS – DB in the Cloud E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E

Basic Concepts GrapeGrapes are good to eat LimeLimes are Green AppleApple is wisdom StrawberryStrawberry shortcake OrangeArrgh! Don’t get scurvy! AvocadoBut at what price? LemonHow much did you pay for this lemon? TomatoIs this a vegetable? BananaThe perfect fruit KiwiNew Zealand Primary Key Record Tablet Field

A view from 10,000-ft

PNUTS Storage Architecture Storage units Routers Tablet controller REST API Clients Message Broker

Geographic Replication Storage units Routers Tablet controller REST API Clients Message Broker Region 1 Region 2 Region 3

In-region Load Balance Storage unit Tablets

Data and Query Models Simplified rational data model: tables of records Typed columns Typical data types plus the blob type Does not enforce inter-table relationship Operation: selection, projection (no join, aggregation, …) Options: point access, range query, multiget

Record Assignment Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon SU1Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe

Single Point Update 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers

Range Query Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe

Relaxed Consistency ACID transactions Sequential consistency: too strong Non-trivial overhead for asynchronous settings Users can tolerate stale data in many cases Go hybrid: eventual consistency + mechanism for SC Use versioning to cope with asynchrony Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update

Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_any()

Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_latest()

Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version read_critical(“v.6”)

Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version write()

Relaxed Consistency Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version test_and_set_write(v.7 ) ERROR

Membership Management Record timelines should be coherent for each replica Updates must be applied to the latest version Use mastership Per-record basis Only one replica has mastership at anytime All update requests are sent to master to get ordered Routers & YMB maintain mastership information Replica receiving frequent write req. gets the mastership Leader election service provided by ZooKeeper

ZooKeeper A distributed system is like a zoo, someone needs to be in charge of it. ZooKeeper is a highly available, scalable coordination svc. ZooKeeper plays two roles in PNUTS Coordination service Publish/subscribe service Guarantees: Sequential consistency; Single system image Atomicity (as in ACID); Durability; Timeliness A tiny kernel for upper level building blocks

ZooKeeper: High Availability High availability via replication A fault-tolerant persistent store Providing sequential consistency

ZooKeeper: Services Publish/Subscribe Service Contents stored in ZooKeeper organized as directory trees Publish: write to specific znode Subscribe: read specific znode Coordination via automatic name resolution By appending sequence number to names CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) “/…/x-1”, “/…/x-2”, … Ephemeral nodes: znodes living as long as the session

ZooKeeper Example: Lock 1) id = create(“…/locks/x-”, SEQUENCE | EMPHEMERAL); 2) children = getChildren(“…/locks”, false); 3) if (children.head == id) exit(); 4) test = exists(name of last child before id, true); 5) if (test == false) goto 2); 6) wait for modification to “…/locks”; 7) goto 2);

ZooKeeper Is Powerful Many core svc. in distributed sys. built on ZooKeeper Consensus Distributed locks (exclusive, shared) Membership Leader election Job tracker binding … More information at http://hadoop.apache.org/zookeeper/http://hadoop.apache.org/zookeeper/

Experimental Setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality

Scalability

Sensitivity to R/W Ratio

Sensitivity to Request Dist.

Related Work Google BigTable/GFS Fault-tolerance and consistency via Chubby Strong consistency – Chubby not scalable Lack of geographic replication support Targeting analytical workloads Amazon Dynamo Unstructured data Peer-to-peer style solution Eventual consistency Facebook Cassandra (still kind of a secret) Structured storage over peer-to-peer network Eventual consistency Always writable property – success even in the face of a failure

Discussion Can all web applications tolerate stale data? Is doing replication completely across WAN a good idea? Single level router vs. B+ tree style router hierarchy Tiny service kernel vs. stand alone services Is relaxed consistency just right or too weak? Is exposing record versions to applications a good idea? Should security be integrated into PNUTS? Using pub/sub service as undo logs

D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,

Similar presentations

Presentation on theme: "D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,

Similar presentations

Presentation on theme: "D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,"— Presentation transcript:

Similar presentations

About project

Feedback