Statistical Debugging CS 6340. Motivation Bugs will escape in-house testing and analysis tools –Dynamic analysis (i.e. testing) is unsound –Static analysis.

Slides:

Advertisements

Similar presentations

Towers of Hanoi Move n (4) disks from pole A to pole C such that a disk is never put on a smaller disk A BC ABC.

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

Building a Better Backtrace: Techniques for Postmortem Program Analysis Ben Liblit & Alex Aiken.

Building a Better Backtrace: Techniques for Postmortem Program Analysis Ben Liblit & Alex Aiken.

1 Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice X. ZhengMichael Jordan Presented By : Arpita Gandhi.

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 6.

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.

G. Alonso, D. Kossmann Systems Group

Statistical Debugging Ben Liblit, University of Wisconsin–Madison.

Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Bug Isolation in the Presence of Multiple Errors Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan UC Berkeley and Stanford University.

1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.

The Future of Correct Software George Necula. 2 Software Correctness is Important ► Where there is software, there are bugs ► It is estimated that software.

16/27/2015 3:38 AM6/27/2015 3:38 AM6/27/2015 3:38 AMTesting and Debugging Testing The process of verifying the software performs to the specifications.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.

EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,

How Science Works Glossary AS Level. Accuracy An accurate measurement is one which is close to the true value.

Applying Dynamic Analysis to Test Corner Cases First Penka Vassileva Markova Madanlal Musuvathi.

Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice ZhengMike Jordan.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Data Structures — Lists and Trees CS-2301, B-Term Data Structures — Lists and Trees CS-2301, System Programming for Non-Majors (Slides include materials.

Language Evaluation Criteria

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,

Fundamentals of Hypothesis Testing: One-Sample Tests

Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.

Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?

 1  Outline  stages and topics in simulation  generation of random variates.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,

Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and.

University of Maryland Bug Driven Bug Finding Chadd Williams.

Bug Localization with Machine Learning Techniques Wujie Zheng

School of Electrical Engineering and Computer Science University of Central Florida Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated.

Analysis of Algorithms

Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.

ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.

Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.

The Daikon system for dynamic detection of likely invariants MIT Computer Science and Artificial Intelligence Lab. 16 January 2007 Presented by Chervet.

Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.

Unit Testing 101 Black Box v. White Box. Definition of V&V Verification - is the product correct Validation - is it the correct product.

Pointers OVERVIEW.

Psyc 235: Introduction to Statistics DON’T FORGET TO SIGN IN FOR CREDIT!

Section 10.1 Confidence Intervals

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.

What is Testing? Testing is the process of finding errors in the system implementation. –The intent of testing is to find problems with the system.

An Undergraduate Course on Software Bug Detection Tools and Techniques Eric Larson Seattle University March 3, 2006.

Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.

(c) University of Washington16-1 CSC 143 Java Lists via Links Reading: Ch. 23.

Bug Isolation via Remote Sampling. Lemonade from Lemons Bugs manifest themselves every where in deployed systems. Each manifestation gives us the chance.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

CSCI1600: Embedded and Real Time Software Lecture 28: Verification I Steven Reiss, Fall 2015.

© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.

Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Cooperative Bug Isolation CS Outline Something different today... Look at monitoring deployed code –Collecting information from actual user runs.

Week 6 MondayTuesdayWednesdayThursdayFriday Testing III Reading due Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.

Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison.

Building Java Programs Chapter 13 Lecture 13-1: binary search and complexity reading:

Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice X. ZhengMichael I. Jordan UC Berkeley.

Statistical Debugging

High Coverage Detection of Input-Related Security Faults

Sampling User Executions for Bug Isolation

Public Deployment of Cooperative Bug Isolation

Discrete Event Simulation - 4

Inlining and Devirtualization Hal Perkins Autumn 2011

Statistical Debugging

Presentation transcript:

Statistical Debugging CS 6340

Motivation Bugs will escape in-house testing and analysis tools –Dynamic analysis (i.e. testing) is unsound –Static analysis is incomplete –Limited resources (time, money, people) Software ships with unknown (and even known) bugs

An Idea: Statistical Debugging Monitor Deployed Code –Online: Collect information from user runs –Offline: Analyze information to find bugs Effectively a “black box” for software

Benefits of Statistical Debugging Actual runs are a vast resource! Crowdsource-based Testing –Number of real runs >> number of testing runs Reality-directed Debugging –Real-world runs are the ones that matter most

Two Key Questions How do we get the data? What do we do with it?

Practical Challenges 1.Complex systems –Millions of lines of code –Mix of controlled and uncontrolled code 2.Remote monitoring constraints –Limited disk space, network bandwidth, power, etc. 3.Incomplete information –Limit performance overhead –Privacy and security data remote client central server Internet

The Approach Guess behaviors that are “potentially interesting” –Compile-time instrumentation of program Collect sparse, fair subset of these behaviors –Generic sampling framework –Feedback profile + outcome label (success vs. failure) for each run Analyze behavioral changes in successful vs. failing runs to find bugs –Statistical debugging

Overall Architecture Guesses Top bugs with likely causes Guesses Statistical Debugging Program Source Profile + / Deployed Application Sampler Compiler

A Model of Behavior Assume any interesting behavior is expressible as a predicate P on a program state at a particular program point. Observation of behavior = observing P Instrument the program to observe each predicate Which predicates should we observe?

Branches Are Interesting ++branch_17[p!=0]; if (p)… else… 63 p == 0 0 branch_17: Track predicates: p != 0

Return Values Are Interesting n = fopen(...); ++call_41[(n==0)+(n>=0)]; 23 call_41: n 0 n == 0 Track predicates: 0 90

What Other Behaviors Are Interesting? Depends on the problem you wish to solve! Examples: –Number of times each loop runs – Scalar relationships between variables (e.g. i 42 ) – Pointer relationships (e.g. p == q, p != null )

QUIZ: Identify the Predicates List all predicates tracked for this program, assuming only branches are potentially interesting: void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } }

QUIZ: Identify the Predicates List all predicates tracked for this program, assuming only branches are potentially interesting: c == ‘a’ c != ‘a’ i < 3 i >= 3 void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } }

Summarization and Reporting 63 p == 0 p != 0 0 branch_17: 23 call_41: n 0 n == p == 0 p != 0 n 0 n ==

Summarization and Reporting Feedback report per run is: – Vector of predicate states:, 0, 1, * – Success/failure outcome label No time dimension, for good or ill outcome S / F p == 0 p != 0 n 0 n ==

QUIZ: Abstracting Predicate Counts ‒ Never observed 0 False at least once, never true 1 True at least once, never false * Both true and false at least once p == 0 p != 0 n 0 n ==

QUIZ: Abstracting Predicate Counts p == 0 p != 0 n 0 n == * 0 * ‒ Never observed 0 False at least once, never true 1 True at least once, never false * Both true and false at least once

void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } } “bba”“bbb” c == ‘a’ c != ‘a’ i < 3 i >= 3 Outcome label (S/F) Populate the predicate vectors and outcome labels for the two runs: QUIZ: Populate the Predicates

void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } } Populate the predicate vectors and outcome labels for the two runs: QUIZ: Populate the Predicates “bba”“bbb” c == ‘a’ *0 c != ‘a’ *1 i < 3 1* i >= 3 0* Outcome label (S/F) FS

Tracking all predicates is expensive Decide to examine or ignore each instrumented site: –Randomly –Independently –Dynamically Why? –Fairness –We need an accurate picture of rare events p == 0 p != 0 n 0 n == * 0 * The Need for Sampling

A Naive Sampling Approach Toss a coin at each instrumentation site Too slow! ++count_42[p != NULL]; p = p->next; ++count_43[i < max]; total += sizes[i]; if (rand(100) == 0) ++count_42[p != NULL]; p = p->next; if (rand(100) == 0) ++count_43[i < max]; total += sizes[i];

Some Other Problematic Approaches Sample every k th site –Violates independence –Might miss predicates “out of phase” Periodic hardware timer or interrupt –Might miss rare events –Not portable

Amortized Coin Tossing Observation: Samples are rare (e.g. 1/100) Idea: Amortize sampling cost by predicting time until next sample Implement as countdown values selected from geometric distribution Models how many tails (0) before next head (1) for biased coin toss Example with sampling rate 1/5: 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, … Next sample after: sites

An Efficient Approach if (countdown >= 2) { countdown -= 2; p = p->next; total += sizes[i]; } else { if (countdown-- == 0) { ++count_42[p != NULL]; countdown = next(); } p = p->next; if (countdown-- == 0) { ++count_43[i < max]; countdown = next(); } total += sizes[i]; } if (rand(100) == 0) ++count_42[p !=NULL]; p = p->next; if (rand(100) == 0) ++count_43[i < max]; total += sizes[i];

Feedback Reports with Sampling Feedback report per run is: –Vector of sampled predicate states (‒, 0, 1, *) –Success/failure outcome label Certain of what we did observe –But may miss some events Given enough runs, samples ≈ reality –Common events seen most often –Rare events seen at proportionate rate

QUIZ: Uncertainty Due to Sampling Check all possible states that a predicate P might take due to sampling. The first column shows the actual state of P (without sampling). P ‒ 01* ‒ 0 1 *

QUIZ: Uncertainty Due to Sampling P ‒ 01* ‒ 0 1 * Check all possible states that a predicate P might take due to sampling. The first column shows the actual state of P (without sampling).

Overall Architecture Revisited Guesses Top bugs with likely causes Guesses Statistical Debugging Program Source Profile + / Deployed Application Sampler Compiler

Finding Causes of Bugs We gather information about many predicates –≈ 300K for bc (“bench calculator” program on Unix) Most of these are not predictive of anything How do we find the few useful predicates?

Finding Causes of Bugs How likely is failure when predicate P is observed to be true? F(P) = # failing runs where P is observed to be true S(P) = # successful runs where P is observed to be true Example: F(P)=20, S(P)=30 ⇒ Failure(P) = 20/50 = 0.4 Failure(P) = F(P) F(P) + S(P)

Tracking Failure Is Not Enough Predicate x == 0 is an innocent bystander Program is already doomed Failure(x == 0) = 1.0 Failure(f == NULL) = 1.0 if (f == NULL) { x = foo(); *f; } int foo() { return 0; }

Tracking Context How likely is failure when predicate P is observed to be true? F(P observed) = # failing runs observing P S(P observed) = # successful runs observing P Example: F(P observed) = 40, S(P observed) = 80 ⇒ Context(P) = 40/120 = 0.333… Context(P) = F(P observed) F(P observed) + S(P observed)

A Useful Measure: Increase() Does P being true increase chance of failure over the background rate? Increase(P) = Failure(P) - Context(P) A form of likelihood ratio testing Increase(P) ≈ 1 ⇒ High correlation with failing runs Increase(P) ≈ -1 ⇒ High correlation with successful runs

Failure(x == 0) = 1.00 Increase(f == NULL) = 0.67 if (f == NULL) { x = foo(); *f; } int foo() { return 0; } Context(x == 0) = 1.00 Failure(f == NULL) = 1.00 Increase(x == 0) = 0.00 Context(f == NULL) = failing run: f == NULL 2 successful runs: f != NULL Increase() Works

QUIZ: Computing Increase() “bba”“bbb”FailureContextIncrease c == ‘a’ *0 c != ‘a’ * i < 3 1* i >= 3 0* Outcome label (S/F) FS

QUIZ: Computing Increase() “bba”“bbb”FailureContextIncrease c == ‘a’ * c != ‘a’ * i < 3 1* i >= 3 0* Outcome label (S/F) FS

Isolating the Bug Increase c == ‘a’ 0.5 c != ‘a’ 0.0 i < i >= void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } }

A First Algorithm 1.Discard predicates having Increase(P) ≤ 0 – e.g. bystander predicates, predicates correlated with success – Exact value is sensitive to few observations – Use lower bound of 95% confidence interval 2.Sort remaining predicates by Increase(P) – Again, use 95% lower bound – Likely causes with determinacy metrics

Isolating the Bug void main() { int z; for (int i = 0; i < 3; i++) { char c = getc(); if (c == ‘a’) z = 0; else z = 1; assert(z == 1); } } Increase c == ‘a’ 0.5 c != ‘a’ 0.0 i < i >=

Isolating a Single Bug in bc void more_arrays() {... /* Copy the old arrays. */ for (indx = 1; indx < old_count; indx++) arrays[indx] = old_ary[indx]; /* Initialize the new elements. */ for (; indx < v_count; indx++) arrays[indx] = NULL;... } #1: indx > scale #2: indx > use_math #3: indx > opterr #4: indx > next_func #5: indx > i_base

It Works! At least for programs with a single bug Real programs typically have multiple, unknown bugs Redundancy in the predicate list is a major problem

Using the Information Multiple useful metrics: Increase(P), Failure(P), F(P), S(P) Organize all metrics in compact visual (bug thermometer)

Sample Report

Multiple Bugs: The Goal Find the best predictor for each bug, without prior knowledge of the number of bugs, sorted by the importance of the bugs.

Multiple Bugs: Some Issues A bug may have many redundant predictors –Only need one –But would like to know correlated predictors Bugs occur on vastly different scales –Predictors for common bugs may dominate, hiding predictors of less common problems

An Idea Simulate the way humans fix bugs Find the first (most important) bug Fix it, and repeat

Revised Algorithm Repeat Step 1: Compute Increase(), F(), etc. for all predicates Step 2: Rank the predicates Step 3: Add the top-ranked predicate P to the result list Step 4: Remove P and discard all runs where P is tr Simulates fixing the bug corresponding to P Discard reduces rank of correlated predicates Until no runs are left

Ranking by Increase(P) Problem: High Increase() scores but few failing runs! Sub-bug predictors: covering special cases of more general bugs

Ranking by F(P) Problem: Many failing runs but low Increase() scores! Super-bug predictors: covering several different bugs together

A Helpful Analogy In the language of information retrieval –Precision = fraction of retrieved instances that are relevant –Recall = fraction of relevant instances that are retrieved In our setting: –Retrieved instances ~ predicates reported as bug predictors –Relevant instances ~ predicates that are actual bug predictors Trivial to achieve only high precision or only high recall Need both high precision and high recall

Combining Precision and Recall Increase(P) has high precision, low recall F(P) has a high recall, low precision Standard solution: take the harmonic mean of both Rewards high scores in both dimensions 2 1/Increase(P) + 1/F(P) =

Sorting by the Harmonic Mean It works!

What Have We Learned? Monitoring deployed code to find bugs Observing predicates as model of program behavior Sampling instrumentation framework Metrics to rank predicates by importance: Failure(P), Context(P), Increase(P),... Statistical debugging algorithm to isolate bugs

Key Takeaway S1 of 2 A lot can be learned from actual executions –Users are executing them anyway –We should capture some of that information

Key Takeaway S2 of 2 Crash reporting is a step in the right direction –But stack is useful for only about 50% of bugs –Doesn’t characterize successful runs But this is changing... Screenshot of permission to report crash Screenshot of permission to monitor all runs