Download presentation
Presentation is loading. Please wait.
Published byHillary Ryan Modified over 6 years ago
1
Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan
Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan
2
Failure Reproduction is Time-Consuming
$156 Billion in 2013 Understand bug Verify fix
3
A Real-World Debugging Procedure
Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer HDFS-6130
4
A Real-World Debugging Procedure
Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer . . . [5 days, 29 discussions] Reproduced… HDFS-6130 [after another 8 minutes] Posted the working patch.
5
Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system
6
Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms
7
Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms . 800: ERROR: invalid block datanode-2.log
8
Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Output Unit Test (HDFS-4022) initCluster(1); // start 1-machine cluster create(“a.txt”,2); // create 2-replica file append(“a.txt”,“X”); // append to file addDataNode(); // add a datanode on the fly
9
Existing Solutions Are Limited
Record-and-replay (deterministic replay) Intrusive: modifies existing software stack Incurs performance overhead Symbolic Execution E.g., ESD [Zamfir EuroSys’10], SherLog [Yuan ASPLOS’10]. Pros: precise & non-intrusive Cons: hard to scale to large systems
10
Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path
11
Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[i].genStamp!=VALID_GS
12
Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS .
13
Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS OR blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS OR .
14
Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path Failure Total Branch Instructions S.E. Stopped At (Instructions) Condition Size Pensieve instruction HDFS-4022 72,943,652 693 109,018,324 166
15
Core Idea – Partial Trace Observation
Developers almost never debug a failure by reconstructing its complete execution path. Instead, they construct a simplified trace which only contains events that are likely to be causally relevant to the failure.
16
How do developers debug HDFS-4022?
for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
17
How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
18
How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
19
How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
20
How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
21
How do developers debug HDFS-4022?
Client.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
22
How do developers debug HDFS-4022?
Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
23
How do developers debug HDFS-4022?
Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } Got one user command (append) by looking at 8 instructions! for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");
24
Event Chaining Approach
Event – a point in time during execution Location event – a program location reached Condition event – a condition holds Invocation event – a function invoked
25
Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e1:line2(L2) datanode-2.log
26
Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log
27
Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Condition event definitions 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log
28
Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events 6.void append(. . .){ 7. stage=APPEND; … … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log
29
Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event function invocation 6.void append(. . .){ 7. stage=APPEND; … … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e6:append() e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log
30
Event Chaining Approach
Location event Condition event Invocation event Captures dependency on shared variables 6.void append(. . .){ 7. stage=APPEND; … … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } Thread 1 e6:append() Thread 2 e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log
31
Forking for Multiple Possibilities
e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
32
Forking for Multiple Possibilities
e7:L8 e3:L5 fork e2:blocks[i].genStamp e2 e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
33
Priority based scheduling
e7:L8 P:500 e3:L5 P:500 P:1000 fork e2:blocks[i].genStamp e2 e1 e1:L2
34
Priority based scheduling
e7:L8 P:500 e3:L5 P:500 e2 e2:blocks[i].genStamp e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
35
Priority based scheduling
e7:L8 P:0 e3:L5 P:1000 e2 e2:blocks[i].genStamp e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } s . Appending… log files … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
36
Priority based scheduling
Favors event chains with most matched logs Favors simpler reproduction paths
37
Eliminating Infeasible Event Chains
path conditions Path conditions Variable substitution Logical conjunction e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
38
Eliminating Infeasible Event Chains
path conditions Path conditions Variable substitution Logical conjunction 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
39
Skip Less-Relevant Loops
Skips loops when there’s no loop carried dependency 77% of randomly sampled loops in HDFS Follows loop iterations otherwise 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");
40
Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain e3: i=2; e2: i>1 e1: log(“ERROR”);
41
Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 e1: log(“ERROR”);
42
Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 X Diverged! … e1: log(“ERROR”);
43
Verification & Testcase Refinement
Sacrifices precision for efficiency e3: i=2 Event Chain Execution e5: N>0 e4: i=0; e2: i>1 X Diverged! … e1: log(“ERROR”);
44
Verification & Testcase Refinement
Sacrifices precision for efficiency e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! … e1: log(“ERROR”);
45
Verification & Testcase Refinement
Variable modified in a different thread (in paper) e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! … e1: log(“ERROR”);
46
Evaluation Evaluated on 18 cases from JVM distributed systems
HDFS, HBase, ZooKeeper, Cassandra with noisy logs generated from manual reproduction Overall Result Successfully reproduces 72% Finishes analysis within 10 min Scalability Result Average # of Events in A Event Chain 105.2 # of Forked Event Chains 1367.2
47
Case Study: HDFS-6130 Useful for hard bugs path conditions inferred
fsimage.layoutVersion!=TXID_LAYOUT // Use old fsimage initCluster(UPGRADE); restartCluster(); initCluster(UPGRADE); restartCluster(); Developers’ reproduction Pensieve’s reproduction
48
Case Study: HDFS-4022 Finds different reproduction than developers’
initCluster(3); setConfig(“policy”, “ALWAYS”); create(“a.txt”,2); stopDataNode(1); append(“a.txt”,“data”); initCluster(4); create(“a.txt”,3); stopDataNode(3); append(“a.txt”,“data”); Developers’ reproduction Pensieve’s reproduction – fewer nodes!
49
Limitations Error is not logged (e.g., silent data loss)
Bugs involving resource exhaustion Systems need to have clearly defined input events. E.g., not for compilers.
50
Related Work Static program slicing [Weiser81]
Obtains static trace but not dynamic partial trace Symbolic execution based approach ESD, SherLog. Record-and-replay based approach BugRedux [Wei ICSE’12], etc.
51
Conclusion Thanks! Pensieve: automated failure reproduction
Based on Partial Trace Observation Scales to real-world distributed systems Non-intrusive and relies on logs Pensieve leverages the natural way human beings do failure reproduction. Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.