Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer.

Slides:



Advertisements
Similar presentations
1 Verification by Model Checking. 2 Part 1 : Motivation.
Advertisements

Lazy Asynchronous I/O For Event-Driven Servers Khaled Elmeleegy, Anupam Chanda and Alan L. Cox Department of Computer Science Rice University, Houston,
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Using T4Eclipse tool to Analyze Eclipse UI For t4eclipse version Ben Xu July 17,2010.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 11: Structure and Union Types Problem Solving & Program Design.
0 - 0.
Addition Facts
OPERATING SYSTEMS Lecturer: Szabolcs Mikulas Office: B38B
1 Processes and Threads Creation and Termination States Usage Implementations.
1 Data Link Protocols By Erik Reeber. 2 Goals Use SPIN to model-check successively more complex protocols Using the protocols in Tannenbaums 3 rd Edition.
All You Ever Wanted to Know About Dynamic Taint Analysis & Forward Symbolic Execution (but might have been afraid to ask) Edward J. Schwartz, ThanassisAvgerinos,
Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures.
Software Engineering COMP 201
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Chapter 4 Memory Management Basic memory management Swapping
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Chapter 3.3 : OS Policies for Virtual Memory
Error-handling using exceptions
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
Processes Management.
Executional Architecture
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
Addition 1’s to 20.
Week 1.
Introduction to Recursion and Recursive Algorithms
Data Structures Using C++ 2E
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
CSc 352 Programming Hygiene Saumya Debray Dept. of Computer Science The University of Arizona, Tucson
SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.
1 Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures Feng Qin Joseph Tucek Jagadeesan Sundaresan Yuanyuan Zhou Presentation by.
(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.
Automated Diagnosis of Software Configuration Errors
University of Washington CSE 351 : The Hardware/Software Interface Section 5 Structs as parameters, buffer overflows, and lab 3.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.
An anti-hacking guide.  Hackers are kindred of expert programmers who believe in freedom and spirit of mutual help. They are not malicious. They may.
University of Maryland Bug Driven Bug Finding Chadd Williams.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
1 VeriSoft A Tool for the Automatic Analysis of Concurrent Reactive Software Represents By Miller Ofer.
Computer Security and Penetration Testing
Bug Localization with Machine Learning Techniques Wujie Zheng
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
Highly Scalable Distributed Dataflow Analysis Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan Chelsea LeBlancTodd.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
Detecting Atomicity Violations via Access Interleaving Invariants
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.
The Potential of Sampling for Dynamic Analysis Joseph L. GreathouseTodd Austin Advanced Computer Architecture Laboratory University of Michigan PLAS, San.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park and Yuanyuan Zhou (UCSD) Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu.
14 Compilers, Interpreters and Debuggers
CS5123 Software Validation and Quality Assurance
runtime verification Brief Overview Grigore Rosu
RDE: Replay DEbugging for Diagnosing Production Site Failures
High Coverage Detection of Input-Related Security Faults
Fault Tolerance Distributed Web-based Systems
CSE 333 – Section 3 POSIX I/O Functions.
Rust for Flight Software
Presentation transcript:

Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign

Joseph TucekCS-UIUCPage 2 Despite all of our effort, production runs still fail What do we do about these failures?

Joseph TucekCS-UIUCPage 3 What is (currently) done about end-user failures? Dumps leave much manual effort to diagnose We still need to reproduce the bug This is hard, if not impossible, to do

Joseph TucekCS-UIUCPage 4 Why on-site diagnosis of production run failures? Production run bugs are valuable Not caught in testing Potentially environment specific Causing real damage to end users We cant diagnose production failures off-site Reproduction is hard The programmer doesnt have the end-user environment Privacy concerns limit even the reports we do get We must diagnose at the end-users site

Joseph TucekCS-UIUCPage 5 What do we mean by diagnosis? Diagnosis traces back to the underlying fault Core dumps tell you about the failure Bug detection tells you about some errors Existing diagnosis tools are offline trigger faulterror failure service interruption incorrect state e.g. smashed stack root cause buggy line of code

Joseph TucekCS-UIUCPage 6 What do we need to perform diagnosis? (1) We need information about the failure What is the fault, the error, the propagation tree? Off-site: Repeatedly inspect the bug (e.g. with a debugger) We run analysis tools targeted at the failure, or at suspected failures Off-site techniques dont work on-site Reproducing the bug is non-trivial We dont know what specific failures will occur Existing analysis tools are too expensive

Joseph TucekCS-UIUCPage 7 What do we need to perform diagnosis? (2) We need guidance as to what to do next What analysis should we perform, what is likely to work well, and what variables are interesting? Off-site: The programmer decides, based on past knowledge On-site, there is no programmer. Any decisions as to action must be made automatically.

Joseph TucekCS-UIUCPage 8 What do we need to perform diagnosis? (3) We need to try what-ifs with the execution If we change this input, what happens? Skip this function? Off-site: Programmers run many input variations Even with differing code. This is difficult on-site Most replay focuses on minimizing variance We cant understand what the results mean

Joseph TucekCS-UIUCPage 9 What does Triage contribute? Enables on-site diagnosis Uses systems techniques to make offline analysis tools feasible on-site Addresses the three previous challenges Allows a new technique, delta analysis Human study Real programmers and real bugs Show large time savings in time-to-fix

Joseph TucekCS-UIUCPage 10 Overview Introduction Addressing the three challenges Diagnosis process & design Experimental results Human study Overhead Related work Conclusions

Joseph TucekCS-UIUCPage 11 Getting information about the failure Checkpoint/re-execution can capture the bug The environment, input, memory state, etc. Everything we need to reproduce the bug Benefits: We can relive the failure over and over Dynamically plug in analysis tools on-demand Makes the expensive cheap Normal-run overhead is low too

Joseph TucekCS-UIUCPage 12 Guidance about what to do next A human-like diagnosis protocol can guide the diagnosis process Repeated replay lets us diagnose incrementally Based on past results, we can pick the next step E.g. if the bug doesnt always repeat, we should look for races StageGoal 1failure/error type & location 2failure triggering conditions 3Fault related code & variables

Joseph TucekCS-UIUCPage 13 Trying what-ifs with the execution Flexible re-execution lets us play with what-ifs Three types of re-execution Plain – deterministic Loose – allow some variance Wild – introduce (potentially large) variations Extracts how they differ with delta analysis

Joseph TucekCS-UIUCPage 14 Main idea of Triage How to get information about the failure? Capture the bug with checkpoint/re-execution Relive the bug with various diagnostic techniques How to decide what to do? Use a human-like protocol to select analysis Incrementally increase our understanding of the bug How to try out what-if scenarios? Flexible re-execution allows varied executions Delta analysis points out what makes them different

Joseph TucekCS-UIUCPage 15 Overview Introduction Addressing the three challenges Diagnosis process & design Experimental results Human study Overhead Related work Conclusions

Joseph TucekCS-UIUCPage 16 Triage Architecture Checkpointing Subsystem Analysis Tools (e.g. backward slicing, bug detection) Control Unit (Protocol)

Joseph TucekCS-UIUCPage 17 Triage vs. Rx Both are in memory Both support variations in execution Triage has no output commit Triage has no need for safety Can even skip code Triage considers why the failure occurs Tries to analyze the failure

Joseph TucekCS-UIUCPage 18 Failure analysis & delta generation (stage 1 and 2) Bounds checking (1.1x) Assertion checking (1x) Happens-before (12x) Atomicity detection (60x) Static core analysis (1x) Taint analysis (2x) Dynamic Slicing (1000x) Symbolic exec. (1000x) Lockset analysis (20x) Rearrange allocation Drop inputs Mutate inputs Pad buffers Change file state Drop code Reschedule threads Change libraries Reorder messages The differences caused by variations are useful as well

Joseph TucekCS-UIUCPage 19 Delta analysis A B C D E F G A B C X E G Y A B C D E F G X Y {A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0} {A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1} {A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1} Compute the basic block vector:

Joseph TucekCS-UIUCPage 20 Delta analysis From delta generations many runs, Triage finds the most similar Compare the basic block vectors Triage will diff the two closest runs The minimum edit distance, aka shortest edit script A B C D E F G - ^ V A B C X E G Y

Joseph TucekCS-UIUCPage 21 A bug in TAR char * get_directory_contents (char *path, dev_t device) { struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children; if (! dirp) savedir_error (path); errno = 0; name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer); directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN; accumulator = new_accumulator (); if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1) char * savedir (const char *dir) { DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno; dirp = opendir (dir); if (dirp == NULL) return NULL; name_space = xmalloc (allocated); errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0') { size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do { if (2 * allocated < allocated) xalloc_die (); allocated *= 2; } while (allocated <= used + entry_size); Segmentation fault null point dereference Execution difference

Joseph TucekCS-UIUCPage 22 Sample Triage report Failure point Segfault in lib strlen Stack & heap OK Bug detection Deterministic bug Null pointer at incremen.c:207 Fault propagation dirp = opendir (dir); if (dirp == NULL) return NULL; dirp = savedir (path); entry = dirp; strlen(entry)

Joseph TucekCS-UIUCPage 23 Results – Human Study We tested Triage with a human study 15 programmers drawn from faculty, research programmers, and graduate students No undergraduates! Measured time to repair bugs, with/without Triage Everybody got core dumps, sample inputs, instructions on how to replicate, and access to many debugging tools Including Valgrind 3 simple toy bugs, & 2 real bugs The TAR bug you just saw A copy-paste error in BC

Joseph TucekCS-UIUCPage 24 Time to fix a bug We hope that the report is be easy to check We cut out the reproduction step This is quite unfair to Triage Also, we put a time limit Over time is counted as max time reproducefind failure…error…faultfix it check Triage report fix it

Joseph TucekCS-UIUCPage 25 Results – Human study For the real bugs, Triage strongly helps (47%) Better than 99.99% confidence that with < without

Joseph TucekCS-UIUCPage 26 Results – Other Bugs Δ Generation Δ Analysis Dynamic Slicing Apache Input element12%8 instructions Apache Input element69%3 instructions CVS -- 4 functions MySQL interleaving-- Squid 1 character71%6 instructions BC array padding98%3 instructions Linux-ext -- 6 instructions MAN -- 9 functions NCOMP -- 5 instructions TAR file perms68%6 instructions

Joseph TucekCS-UIUCPage 27 Results – Normal Run Overhead Identical to checkpoint system (Rx) overhead Under 5%

Joseph TucekCS-UIUCPage 28 Results – Diagnosis Overhead CPU bound is the worst case Still reasonable because were only redoing 200ms Delta analysis is somewhat costly Should be run in the background

Joseph TucekCS-UIUCPage 29 Related work Checkpointing & re-execution Zap [Osman, OSDI02], TTVM [King, USENIX05] Bug detection & diagnosis Valgrind [Nethercote], CCured [Necula, POPL02], Purify [Hastings, USENIX92] Eraser [Savage, TOCS97], [Netzer, PPoPP91] Backward slicing [Weiser, CACM82] Innumerable others Execution variation Input variation Delta debugging [Zeller, FSE02], Fuzzing [B. So] Environment variation Rx [Qin, SOSP05] DieHard [Berger, PLDI06]

Joseph TucekCS-UIUCPage 30 Conclusions & Future Work On-site diagnosis can be made feasible Checkpoint can effectively capture the failure Expensive off-line analysis can be done on-site Privacy issues are minimized Also useful for in house testing Reduces the manual portion of analysis Future work Automatic bug hot fixes Visualization of delta analysis

Joseph TucekCS-UIUCPage 31 Thank you Questions? Special thanks to Hewlett-Packard for student scholarship support. This work supported by NSF, DoE, and Intel