Debugging Intermittent Issues

Slides:



Advertisements
Similar presentations
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
G. Alonso, D. Kossmann Systems Group
CHAPTER 13: Binomial Distributions
LIAL HORNSBY SCHNEIDER
The Real Zeros of a Polynomial Function
Dividing Polynomials.
Chapter 5 Sampling Distributions
Key Stone Problem… Key Stone Problem… next Set 22 © 2007 Herbert I. Gross.
The Scientific Method. The Scientific Method The Scientific Method is a problem solving-strategy. *It is just a series of steps that can be used to solve.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 3 Polynomial and Rational Functions Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 Chapter 18 Sampling Distribution Models. 2 Suppose we had a barrel of jelly beans … this barrel has 75% red jelly beans and 25% blue jelly beans.
Hypotheses and Hypothesis Testing. Hypothesis An educated prediction about the outcome of an investigation A statement explaining that a causal relationship.
Zero of Polynomial Functions Factor Theorem Rational Zeros Theorem Number of Zeros Conjugate Zeros Theorem Finding Zeros of a Polynomial Function.
Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 5: Probability: What are the Chances? Section 5.2 Probability Rules.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
Chapter 11 Polynomial Functions
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
THE QUADRATIC FORMULA.
Problems With Assistance Module 2 – Problem 4
CHAPTER 6 Random Variables
Mobile Testing - Bug Report
Expanding and Factoring Algebraic Expressions
Sampling Distribution Models
Outline lecture Revise arrays Entering into an array
Binomial and Geometric Random Variables
Unit 5: Hypothesis Testing
Sampling Distributions and Estimation
What to do when a test fails
Dividing Polynomials.
Part III – Gathering Data
QlikView Licensing.
Debugging Intermittent Issues
Chapter 5 Sampling Distributions
Observational Studies and Experiments
Equations and Inequalities
Chapter 5 Sampling Distributions
Elementary Statistics
The voice of the process
How to describe your problem to a pc technician. Are you looking for PC technician near me? Ok, so you have decided that you are not able to repair the.
Introduction to Summary Statistics
CSCE 315 – Programming Studio, Fall 2017 Tanzir Ahmed
Introduction to Summary Statistics
Review: What influences confidence intervals?
Inferential Statistics
Processor Fundamentals
Designing Experiments
Discrete Event Simulation - 4
Chapter 5 Sampling Distributions
Properties of the Real Numbers Part I
Chapter 4: Designing Studies
Significance Tests: The Basics
Uncertainty and Error
Solving Linear Equations
Sampling Distributions
Chapter 6: Probability: What are the Chances?
-Read question -”Would we say that it is very likely? Unlikely? Impossible?”
Tackling Timed Writings
Chapter 6 Confidence Intervals
Utility Billing Balancing the Accounts Receivable
Confidence Intervals for Proportions
An Introduction to Debugging
A New Technique for Destination Choice
Chapter 4: Designing Studies
Confidence Intervals for Proportions
Standard Normal Table Area Under the Curve
Presentation transcript:

Debugging Intermittent Issues Lloyd Moore, President Lloyd@CyberData-Robotics.com www.CyberData-Robotics.com Northwest C++ Users Group, January 2017

Agenda The problem with intermittent issues Getting a baseline Basic statistics Making a change Is it fixed? Dealing with multi-causal issues

Problem with Intermittent Issues Fundamentally you do not know what is causing the problem, therefore you cannot control for it You never know from one run to the next if the issue will show up From a single run you cannot say anything, so you will need multiple runs Once you have multiple runs you need statistics to identify and describe any change in behavior

Getting a Baseline A baseline is a reference point or configuration where you control any factor that might be related to the issue Since you really don't know what is causing the issue – you likely won't actually control for it, but at least you minimize the number of variables affecting future measurements Ideally the goal is to simply control as much of the system as possible

Key Point Intermittent failures are NOT really intermittent!!! They are caused by some varying condition that you are not measuring, observing, and controlling for If the varying condition happens in a particular way – you get a failure – every time The debugging problem is to find the condition that is varying, quantify how varies then control it

Getting a Baseline What you need to control will vary based on what you are debugging “Pure software” based issues will often be easier to control than “network connected” issues or “hardware related” issues Note that “pure software” issues are also far less likely to be intermittent as there are far fewer sources of randomness Most notable pure software randomness is system loading leading to race conditions

Items to look at controlling Version of the software you are running Physical location of the target system Other processes running on the target system Network traffic to the target system – ideally connect the system to a wired network connection on an isolated network Other processes, and response time, of remotely connected systems that the target system depends on

More items to control Data being manipulated by the system State of the target system – does each run of the system begin in the same state, including existence and size of debugging files Depending on what is going on ANYTHING could affect the issue: lighting, time of day, people in the room, temperature, be creative May not get everything on the first pass – for really hard problems establishing a baseline becomes iterative

Also Control Yourself As you work the problem and repeatedly go through the test case you will unconsciously change your behavior Humans naturally learn to avoid failure conditions, and this includes behavior patterns that show up bugs Result can be VERY subtle, even something like changing your keystroke rate slightly, and this can affect your testing

Can't Control It – Measure it Many times you will not be able to control a factor that may be important – time of day for example If you cannot control a factor make every attempt to measure and record the value during your test runs Often times after hours of testing and looking at something else, a pattern will emerge in the collected data that strongly hints at the real issue

Baseline Failure Rate Part of the baseline configuration is the rate at which you experience the failure condition Once you have the test environment as stable and quantified as possible you need to run the system multiple times and quantify how often you see the failure Three basic outcomes here: The failure no longer occurs The failure now happens all the time The failure happens X of Y trials

Baseline Failure Rate If you no longer have an intermittent failure (no failure or constant failure) something you changed in setting up the baseline is likely related to the issue – this is good, you can now systemically vary the baseline setup to debug Most likely you will still get some failures, for the next set of slides we will assume that you have run the system 10 times and saw 3 failures.....

Basic Statistics 3 failures in 10 runs is: 3/10 = 0.3 = 30% failure rate Success rate is: 1.0 – 0.3 = 0.7 = 70% Note that this number is an APPROXIMATION – you can really never know the actual rates What is the chance of testing 10 times and not seeing a single failure: Chance of success on first try * chance of success on second try * chance …. 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%

How many times to test? So if we CAN run the test 10 times and not see a failure how many times do we really need to test? Yes there is a formula to estimate this – have never actually used this in practice as several of the variables involved are pretty hard to estimate – things like degrees of freedom Simply use a rule of thumb here – want to get the chance of not seeing a failure when there really is one below either 5% or 1%

How many times to test? If I want to get below 1% will 15 trials be enough? 0.70^15 = 0.0047 = 0.47% - Yep, and this approach is good enough for day to day work Those that are more math inclined could also solve the equation: 0.01 = 0.70^X , then of course make sure to take the next full integer! (Hint: Need logb ar = r logb a)

OK now what? We have a controlled test set up We have an approximation of the failure rate We may also have some clues on what may be causing the problem Next step is to attempt a change and observe the failure rate

Is it fixed? I made a change and now I don't see the failure condition any longer – I'm done right? Hum – NO! From the previous slide: What is the chance of testing 10 times and not seeing a single failure: Chance of success on first try * chance of success on second try * chance …. 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%

Is it fixed? One issue with problems that display intermittently is that you can never really know for sure if they are fixed The best you can do is estimate the probability that they are fixed and decide if that is good enough for your application You can always run more trials to increase the certainty that you really fixed the problem

Is it fixed? Assume our original 30% failure rate, 70% success rate: Trials: Formula: Chance of not seeing failure: 10 0.7^10 2.82% 15 0.7^15 0.474% 20 0.7^20 0.079% 25 0.7^25 0.013%

Multi-causal Issues I made a change and my failure rate decreased, but not to zero, now what? Very good indication that you have either affected the original problem or have actually solved one problem but another remains The key to sorting this out is to look for details on the actual failure Often times multiple issues have a similar looking failure, but are really two different things

Multi-causal Issues Attempt to get more information on the exact nature of the remaining failure, things to look at: Stack traces Timing values, how long into the run? Variation in results – cluster plots helpful here More than one trigger for the failure Many times you will stumble upon this information while debugging the original issue Very common for intermittent issues to be multi-causal as are ignored when first seen

Multi-causal Issues If you suspect you have more than one issue setup a separate debugging environment for each suspected issue You may also be able to make a list of the separate failure cases and failure rates, decomposing the original numbers Helpful to remember the original failure rate is simply the sum of the individual failure rates Work the issue with the highest failure rate first Keep working down the list until all are solved

Summary Intermittent issues are not really intermittent Need to track down what unknown variable is changing and handle that variation A baseline system configuration with failure rates is key to telling if a change occured Just because you don't see the failure any more is NOT a guarantee that it is fixed Multi-causal issues can be separated and worked as individual issues once you have details on the failure

Questions?