Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.

Slides:

Advertisements

Similar presentations

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Advertisements

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Parasol Architecture A mild case of scary asynchronous system stuff.

Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming.

Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.

Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.

Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.

CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.

CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Microsoft Research Faculty Summit Yuanyuan(YY) Zhou Associate Professor University of Illinois, Urbana-Champaign.

Software Reliability Methods Sorin Lerner. Software reliability methods: issues What are the issues?

1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.

What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

PRASHANTHI NARAYAN NETTEM.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.

Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.

Failures in the System  Two major components in a Node Applications System.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

1 Joe Meehean. 2 Testing is the process of executing a program with the intent of finding errors. -Glenford Myers.

What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering.

© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.

Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.

A Simple Method for Extracting Models from Protocol Code David Lie, Andy Chou, Dawson Engler and David Dill Computer Systems Laboratory Stanford University.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Dynamic Analysis of Multithreaded Java Programs Dr. Abhik Roychoudhury National University of Singapore.

What Change History Tells Us about Thread Synchronization RUI GU, GUOLIANG JIN, LINHAI SONG, LINJIE ZHU, SHAN LU UNIVERSITY OF WISCONSIN – MADISON, USA.

Cloud Testing Haryadi Gunawi Towards thousands of failures and hundreds of specifications.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,

Deadlock Detection and Recovery

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Testing OO software. State Based Testing State machine: implementation-independent specification (model) of the dynamic behaviour of the system State:

Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang.

Debugging Threaded Applications By Andrew Binstock CMPS Parallel.

Tanakorn Leesatapornwongsa Haryadi S. Gunawi. ISSTA ’15 2 node1node2node3 TCP/UDP.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 16: Distributed Shared Memory 1.

The article collection PRIS F7 Fredrik Kilander. Content “On agent-based software engineering” Nick Jennings, 1999 “An agent-based approach for building.

1 Software Reliability in Wireless Sensor Networks (WSN) -Xiong Junjie

September 1999Compaq Computer CorporationSlide 1 of 16 Verification of cache-coherence protocols with TLA+ Homayoon Akhiani, Damien Doligez, Paul Harter,

CS 5150 Software Engineering Lecture 22 Reliability 3.

Agenda  Quick Review  Finish Introduction  Java Threads.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.

BIG DATA/ Hadoop Interview Questions.

Content Coverity Static Analysis Use cases of Coverity Examples

FOP: Multi-Screen Apps

CSC 591/791 Reliable Software Systems

EEC 688/788 Secure and Dependable Computing

Two phase commit.

Lazy Diagnosis of In-Production Concurrency Bugs

Understanding Real World Data Corruptions in Cloud Systems

EEC 688/788 Secure and Dependable Computing

A Comprehensive Study on Real World Concurrency Bugs in Node.js

EEC 688/788 Secure and Dependable Computing

An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang,

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

LO1 – Understand Computer Hardware

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems Jeffrey F. Lukman, Huan Ke, Cesar Stuardo, Riza Suminto, Daniar Kurniawan,

EEC 688/788 Secure and Dependable Computing

Cesar A. Stuardo, Tanakorn Leesatapornwongsa , Riza O

Presentation transcript:

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi

ASPLOS ‘16 4  More people develop distributed systems  Distributed systems are hard  Hard largely because of concurrency  Concurrency leads to unexpected timings  X should arrive before Y, but X can arrive after Y  Unexpected timings lead to distributed concurrency (DC) bugs

ASPLOS ‘16 5 “… be able to reason about the correctness of increasingly more complex distributed systems that are used in production” – Azure engineers & managers Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) [FAST ‘16] Understanding distributed system bugs is important!

ASPLOS ‘16 6  Bugs caused by non-deterministic timing  Non-deterministic timing of concurrent events involving more than one node  Messages, crashes, reboots, timeouts, computations

ASPLOS ‘16 7 (LC bug: multi-threaded single machine software) Top 10 most cited ASPLOS paper

ASPLOS ‘16 8  104 bugs  4 varied distributed systems  Bugs in  Study description, source code, patches

ASPLOS’16 9 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g

ASPLOS ‘16 10 ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3.

ASPLOS ‘16 11 ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Timing: - Atomicity violation - Fault Timing Input: - 4 Protocols - 2 faults - 2 reboots Error: - Global Failure: Data inconsistenc y Fix: Delay msg.

ASPLOS’16 12 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g

ASPLOS’16 13 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g Conditions that make bugs happen

ASPLOS’16 14 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: Untimely moment that makes bug happens Why: Help design bug detection tools

ASPLOS ‘16 15 Trigger Timing Message Ex: MapReduce-3274 “Does the timing involve many messages?”

“Does the timing involve many messages?” ASPLOS ‘16 16 Trigger Timing Message Order violation (44%) Ex: MapReduce events, X and Y Y must happen after X But Y happens before X

“Does the timing involve many messages?” ASPLOS ‘16 17 Trigger Timing Message Order violation (44%) Msg-msg race Submit Kill Ex: MapReduce-3274 Submit 2 events, X and Y Y must happen after X But Y happens before X

ASPLOS ‘16 18 Trigger Timing Receive- receive race Send-send race Receive- send race Message Order violation (44%) Msg-msg race AB AB AB AB Ne w key Ne w Old key HBase MapReduce Kill End report Kill what job? Expired! Ne w key (late ) MapReduce End report Kill

ASPLOS ‘16 19 Trigger Timing cmp Message Order violation (44%) Msg-msg race Msg-compute race Order violation: 2 events, X and Y Y must happen after X But Y happens before X Ex: MapReduce-4157

ASPLOS ‘16 20 Trigger Timing Message Order violation (44%) Atomicity violation (20%) A message comes in the middle of atomic operation ABAB AB Ex: Cassandra-1011, Hbase-4729, MapReduce-5009, Zookeeper-1496

ASPLOS ‘16 21 Trigger Timing Message Fault (21%) Fault at specific timing AB ABC ABC Ex: Cassandra-6415, Hbase-5806, MapReduce-3858, Zookeeper-1653 No fault timing in LC bugs Only in DC bugs

ASPLOS ‘16 22 Trigger Timing Message Fault Reboot (11%) AB AB Reboot at specific timing Ex: Cassandra-2083, Hadoop-3186, MapReduce-5489, Zookeeper-975

ASPLOS ‘16 23 Trigger Timing Message Fault Reboot Mix (4%) ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory (in the middle of sync snapshot) 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Atomicity violation Fault timing Failure

ASPLOS ‘16 24 Trigger Timing cmp Message timingFault timing Reboot timing Implication: simple patterns can inform pattern-based bug detection tools, etc.

ASPLOS’16 25 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g Input ScopeErrorFailure Handlin g Timin g What: Input to exercise buggy code Why: Improve testing coverage

ASPLOS ‘16 26 Trigger Timing Input ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this update only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Fault & reboot 2 crashes 2 reboots

ASPLOS ‘16 27 Trigger Timing Input Fault “How many bugs require fault injection?” 37% = No fault 63% = Yes “What kind of fault? & How many times?” 88% = No timeout12% 53% = No crash35% = 1 crash12% Real-world DC bugs are NOT just about message re-ordering, but faults as well

ASPLOS ‘16 28 Trigger Timing Input Fault Reboot “How many reboots?” 73% = No reboot20% = 17%

ASPLOS ‘16 29 Trigger Timing Input Fault Reboot Workload nmnm popo rqrq Cassandra Paxos bug (Cassandra-6023) 3 concurrent user requests! “How many protocols to run as input?” 20% = 1 80% = 2+ protocols Implication: multiple protocols for DC testing

ASPLOS ’16 30 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeFailure Handlin g Timin g Error What: First effect of untimely ordering Why: Help failure diagnosis and bug detection

ASPLOS ‘16 31 Trigger Local Error Error can be observed in one triggering node (46%) Implication: identify opportunities for failure diagnosis and bug detection Null pointer, false assertion, etc.

ASPLOS ‘16 32 Many are silent errors and hard to diagnose (hidden errors, no error messages, long debugging) Trigger Local Error Global Error cannot be observed in one node (54%) ??

ASPLOS’16 33 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: How developers fix bugs Why: Help design runtime prevention and automatic patch generation

ASPLOS ‘16 34 Trigger Error Fix Comple x Add Global Synchro - nization Similar to fixing LC bugs: add synchronization e.g. lock() Are patches complicated? Are patches adding synch.? Add new states & transitions

ASPLOS ‘16 35 Trigger Delay Comple x Error Fix Simple

ASPLOS ‘16 36 Trigger Comple x Error Fix Simple Delay Ignore/discard

ASPLOS ‘16 37 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry

ASPLOS ‘16 38 Trigger Comple x Error Fix Simple f(msg); g(msg); Delay Ignore/Discard Retry Accept

ASPLOS ‘16 39 Trigger Comple x Error Fix Simple Delay Ignore Retry Accept 40% are easy to fix (no new computation logic) Implication: many fixes can inform automatic runtime prevention f(msg); g(msg);

ASPLOS ‘16 40 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry Accept Sync. Fix DC bugs vs. LC bugs

ASPLOS ‘16 41  Distributed system model checker  Formal verification  DC bug detection  Runtime failure prevention

ASPLOS ‘16 42 RealityEvent Message Crash Multiple crashes Reboot Multiple reboots Timeout Computation Disk fault Modist NSDI’11 Demeter SOSP’11 MaceMC NSDI’07 SAMC OSDI’14 Let’s find out how to re-order all events without exploding the state space!

 State-of-the-art  Verdi [PLDI ‘15]  Raft update  ~ 6,000 lines of proof  IronFleet [SOSP ‘15]  Paxos update  Lease-based read/write  ~ 5,000 – 10,000 lines of proof  Challenges Foreground & Background #Protocol interactions ASPLOS ‘ %= 1 80% = 2+ Protocols 29% = Mix 19%=FG 52% = BG Let’s find out how to better verify more protocol interactions! Only verify foreground protocols Foreground & background

 State-of-the-art: LC bug detection  Pattern-based detection  Error-based detection  Statistical bug detection  Opportunities: DC bug detection?  Pattern-based detection  Error-based detection ASPLOS ‘ % = Explicit47% = Silent Message timing Fault timing Reboot timing Let’s leverage these timing patterns and explicit error to do DC bug detection!

 State-of-the-art: LC bug prevention  Deadlock Immunity [OSDI ‘08]  Aviso [ASPLOS ‘13]  ConAir [ASPLOS ‘13]  Etc.  Opportunities: DC bug prevention Fixes ASPLOS ‘ % = Simple60% = Complex Let’s build runtime prevention technique that leverage this simplicity!

ASPLOS ‘16 46 “Why seriously address DC bugs now?” Everything is distributed and large-scale! DC bugs are not uncommon! “Why is tackling DC bugs possible now?” 1.Open access to source code 2.Pervasive documentations 3.Detailed bug descriptions

47 ASPLOS ‘16