Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Slides:



Advertisements
Similar presentations
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
Advertisements

Mahadevan Subramaniam and Bo Guo University of Nebraska at Omaha An Approach for Selecting Tests with Provable Guarantees.
A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.
A survey of techniques for precise program slicing Komondoor V. Raghavan Indian Institute of Science, Bangalore.
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
4 July 2005 overview Traineeship: Mapping of data structures in multiprocessor systems Nick de Koning
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.
BPC.1 Basic Programming Concepts
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
System/Software Testing
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Programming Lifecycle
Software Testing Testing types Testing strategy Testing principles.
Bug Localization with Machine Learning Techniques Wujie Zheng
Race Checking by Context Inference Tom Henzinger Ranjit Jhala Rupak Majumdar UC Berkeley.
Static Program Analysis of Embedded Software Ramakrishnan Venkitaraman Graduate Student, Computer Science Advisor: Dr. Gopal Gupta
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Grigore Rosu Founder, President and CEO Professor of Computer Science, University of Illinois
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
C++ for Engineers and Scientists, Second Edition 1 Problem Solution and Software Development Software development procedure: method for solving problems.
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
CS 5150 Software Engineering Lecture 21 Reliability 2.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Comp 411 Principles of Programming Languages Lecture 3 Parsing
Control Flow Testing Handouts
Testing and Debugging PPT By :Dr. R. Mall.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Applying Control Theory to Stream Processing Systems
CSC 591/791 Reliable Software Systems
Troubleshooting SQL Server When You Cannot Access The Machine
Recall The Team Skills Analyzing the Problem
Production Debugging in a Serverless World
Outline of the Chapter Basic Idea Outline of Control Flow Testing
Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold Xu Zhao*, Kirk Rodrigues*, Yu Luo*, Michael Stumm*,
Understanding Real World Data Corruptions in Cloud Systems
User input We’ve seen how to use the standard output buffer
runtime verification Brief Overview Grigore Rosu
RDE: Replay DEbugging for Diagnosing Production Site Failures
Central Florida Business Intelligence User Group
Multi-Dispatch in the Java™ Virtual Machine
Building Java Programs
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Lecture 6 Inductive Synthesis
Programming Fundamentals (750113) Ch1. Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
Chapter 1 Introduction(1.1)
PZ09A - Activation records
Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu
Building Java Programs
BugHint: A Visual Debugger Based on Graph Mining
Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
NASA Secure Coding Rules
Building Java Programs
Programming Fundamentals (750113) Ch1. Problem Solving
PROGRAMMING FUNDAMENTALS Lecture # 03. Programming Language A Programming language used to write computer programs. Its mean of communication between.
Programming Fundamentals (750113) Ch1. Problem Solving
C. M. Overstreet Old Dominion University Spring 2006
Building Java Programs
How to improve (decrease) CPI
Building Java Programs
Scratch Programming Lesson 7 Debugging.
Testing.
C. M. Overstreet Old Dominion University Fall 2005
C. M. Overstreet Old Dominion University Fall 2007
Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Failure Reproduction is Time-Consuming $156 Billion in 2013 Understand bug Verify fix

A Real-World Debugging Procedure Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer HDFS-6130

A Real-World Debugging Procedure Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer . . . [5 days, 29 discussions] Reproduced… HDFS-6130 [after another 8 minutes] Posted the working patch.

Pensieve: Non-Intrusive Failure Reproduction Analyzes Java bytecode & log Non-intrusive Works on JVM based system

Pensieve: Non-Intrusive Failure Reproduction Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs-2.0.0-alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms

Pensieve: Non-Intrusive Failure Reproduction Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs-2.0.0-alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms . 800: ERROR: invalid block datanode-2.log

Pensieve: Non-Intrusive Failure Reproduction Analyzes Java bytecode & log Non-intrusive Works on JVM based system Output Unit Test (HDFS-4022) initCluster(1); // start 1-machine cluster create(“a.txt”,2); // create 2-replica file append(“a.txt”,“X”); // append to file addDataNode(); // add a datanode on the fly

Existing Solutions Are Limited Record-and-replay (deterministic replay) Intrusive: modifies existing software stack Incurs performance overhead Symbolic Execution E.g., ESD [Zamfir EuroSys’10], SherLog [Yuan ASPLOS’10]. Pros: precise & non-intrusive Cons: hard to scale to large systems

Scalability of Symbolic Execution 1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path

Scalability of Symbolic Execution 1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[i].genStamp!=VALID_GS

Scalability of Symbolic Execution 1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS .

Scalability of Symbolic Execution 1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS OR blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS OR .

Scalability of Symbolic Execution 1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path Failure Total Branch Instructions S.E. Stopped At (Instructions) Condition Size Pensieve instruction HDFS-4022 72,943,652 693 109,018,324 166

Core Idea – Partial Trace Observation Developers almost never debug a failure by reconstructing its complete execution path. Instead, they construct a simplified trace which only contains events that are likely to be causally relevant to the failure.

How do developers debug HDFS-4022? for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? Client.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

How do developers debug HDFS-4022? Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } Got one user command (append) by looking at 8 instructions! for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

Event Chaining Approach Event – a point in time during execution Location event – a program location reached Condition event – a condition holds Invocation event – a function invoked

Event Chaining Approach Location event Condition event Invocation event An event is explained by other events Location event  path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e1:line2(L2) datanode-2.log

Event Chaining Approach Location event Condition event Invocation event An event is explained by other events Location event  path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 datanode-2.log

Event Chaining Approach Location event Condition event Invocation event An event is explained by other events Condition event  definitions 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 datanode-2.log

Event Chaining Approach Location event Condition event Invocation event An event is explained by other events 6.void append(. . .){ 7. stage=APPEND; … . . . … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e4:this.stage==APPEND@L3 e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 datanode-2.log

Event Chaining Approach Location event Condition event Invocation event An event is explained by other events Location event  function invocation 6.void append(. . .){ 7. stage=APPEND; … . . . … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e6:append() e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e4:this.stage==APPEND@L3 e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 datanode-2.log

Event Chaining Approach Location event Condition event Invocation event Captures dependency on shared variables 6.void append(. . .){ 7. stage=APPEND; … . . . … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } Thread 1 e6:append() Thread 2 e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e4:this.stage==APPEND@L3 e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 datanode-2.log

Forking for Multiple Possibilities e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Forking for Multiple Possibilities e7:L8 e3:L5 fork e2:blocks[i].genStamp !=VALID_GS@L1 e2 e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Priority based scheduling e7:L8 P:500 e3:L5 P:500 P:1000 fork e2:blocks[i].genStamp !=VALID_GS@L1 e2 e1 e1:L2

Priority based scheduling e7:L8 P:500 e3:L5 P:500 e2 e2:blocks[i].genStamp !=VALID_GS@L1 e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Priority based scheduling e7:L8 P:0 e3:L5 P:1000 e2 e2:blocks[i].genStamp !=VALID_GS@L1 e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } s . Appending… log files … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Priority based scheduling Favors event chains with most matched logs Favors simpler reproduction paths

Eliminating Infeasible Event Chains path conditions Path conditions Variable substitution Logical conjunction blocks[i].genStamp!=VALID_GS@L1 e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Eliminating Infeasible Event Chains path conditions Path conditions Variable substitution Logical conjunction blocks[i].genStamp!=VALID_GS@L1 newGS!=VALID_GS@L5 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Skip Less-Relevant Loops Skips loops when there’s no loop carried dependency 77% of randomly sampled loops in HDFS Follows loop iterations otherwise 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp !=VALID_GS@L1 e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

Verification & Testcase Refinement Sacrifices precision for efficiency Event Chain e3: i=2; e2: i>1 e1: log(“ERROR”);

Verification & Testcase Refinement Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 e1: log(“ERROR”);

Verification & Testcase Refinement Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 X Diverged! … e1: log(“ERROR”);

Verification & Testcase Refinement Sacrifices precision for efficiency e3: i=2 Event Chain Execution e5: N>0 e4: i=0; e2: i>1 X Diverged! … e1: log(“ERROR”);

Verification & Testcase Refinement Sacrifices precision for efficiency e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! … e1: log(“ERROR”);

Verification & Testcase Refinement Variable modified in a different thread (in paper) e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! … e1: log(“ERROR”);

Evaluation Evaluated on 18 cases from JVM distributed systems HDFS, HBase, ZooKeeper, Cassandra with noisy logs generated from manual reproduction Overall Result Successfully reproduces 72% Finishes analysis within 10 min Scalability Result Average # of Events in A Event Chain 105.2 # of Forked Event Chains 1367.2

Case Study: HDFS-6130 Useful for hard bugs path conditions inferred fsimage.layoutVersion!=TXID_LAYOUT // Use old fsimage initCluster(UPGRADE); restartCluster(); initCluster(UPGRADE); restartCluster(); Developers’ reproduction Pensieve’s reproduction

Case Study: HDFS-4022 Finds different reproduction than developers’ initCluster(3); setConfig(“policy”, “ALWAYS”); create(“a.txt”,2); stopDataNode(1); append(“a.txt”,“data”); initCluster(4); create(“a.txt”,3); stopDataNode(3); append(“a.txt”,“data”); Developers’ reproduction Pensieve’s reproduction – fewer nodes!

Limitations Error is not logged (e.g., silent data loss) Bugs involving resource exhaustion Systems need to have clearly defined input events. E.g., not for compilers.

Related Work Static program slicing [Weiser81] Obtains static trace but not dynamic partial trace Symbolic execution based approach ESD, SherLog. Record-and-replay based approach BugRedux [Wei ICSE’12], etc.

Conclusion Thanks! Pensieve: automated failure reproduction Based on Partial Trace Observation Scales to real-world distributed systems Non-intrusive and relies on logs Pensieve leverages the natural way human beings do failure reproduction. Thanks!