Presentation is loading. Please wait.

Presentation is loading. Please wait.

The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at.

Similar presentations


Presentation on theme: "The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at."— Presentation transcript:

1 The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at ICSIE, Dubai, May 5, 2015 Magne Jørgensen

2 The presentation is based on the following papers: M. Jørgensen, T. Dybå, D. I. K. Sjøberg, K. Liestøl. Incorrect results in software engineering experiments. How to improve research practices. To appear in Journal of Systems and Software. M. Jørgensen. The influence of selection bias on effort overruns in software development projects, Information and Software Technology 55(9):1640-1650, 2013. M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11):2494-2503, 2012. M. Jørgensen, T. Halkjelsvik, and B. Kitchenham. How does project size affect cost estimation error? Statistical artefacts and methodological challenges, International Journal of Project Management, 30(7):751-862, 2012.

3 THE EFFECT OF LOW POWER, RESEARCHER BIAS AND PUBLICATION BIAS How much of the statistical results in software engineering are incorrect?

4 The average statistical power of software engineering studies is only about 30% This means that even if all studied relationships were true (unrealistic best case), only 30% of our hypothesis tests should be statistically significant (p<0.05). –If only 50% of our tests are on true relationships, we should observe about 17.5% statistically significant tests. How much of our published statistical tests are actually statistically significant?

5 Fanelli, Daniele. “Positive” results increase down the hierarchy of the sciences." PLoS One 5.4 (2010) Computer science studies: 80% significant results (software engineering: 50%)

6

7 Similar calculations for software engineering suggests that at least one third of the statistically significant results in software engineering experiments are incorrect! M. Jørgensen, T. Dybå, D. I. K. Sjøberg, K. Liestøl. Incorrect results in software engineering experiments. How to improve research practices. To appear in Journal of Systems and Software.

8 Illustration of researcher and publication bias How easy is it to find statistically significant results With “flexible” statistical analyses? I made a simple test...

9 My hypothesis: People with longer names write more complex texts Dr. Pensenschneckerdorf Dr. Hart The results advocate, when presupposing satisfactory statistical power, that the evidence backing up positive effect is weak. We found no effect.

10 Design and results of study Variables: –LengthOfName: Length of surname of the first author –Complexity1:Number of words per paragraph –Complexity2: Flesch-Kincaid reading level Data collection: –The first 20 publications identified by “google scholar” using the search string “software engineering” for year 2012. (n=20 is a typical, low stat. power sample for software engineering studies) Results were statistically significant! –r LengthOfName,Complexity1 = 0.581 (p=0.007) –r LengthOfName,Complexity2 = 0.577 (p=0.008) Conclusion: The analysis reject the null-hypothesis that there is no difference, i.e., a support of that long names are connected with more complex texts. Results can be published!?

11 The regression line also supports that there is a relation between length of name and complexity of writing

12 How did I do it? (How to easily get “interesting” results in any study) Publication bias: Only two, out of fourteen, significant measures of paper complexity were reported. Researcher bias 1: A (defendable?), post hoc (after looking at the data) change in how to measure name length. –The use of surname length was motivated by the observation that not all authors informed about their first name. Researcher bias 2: A (defendable?), post hoc removal of two observations. –Motivated by the lack of data for the Flesh-Kincaid measure of those two papers. Low number of observations: Statistical power approx. 0.3 (assuming r=0.3, p<0.05). –If research was a game where the winners have p<0.05, 5 studies with 20 observations is much better than one with 100.

13 MISSING VARIABLES LEADING TO INCORRECT CONCLUSIONS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART I

14 Here is my first misuse of statistics.... I measured an increase in productivity of an IT-department (function points/man-month). The management was happy, since this showed that their newly implemented processes (time boxing) had been successful. Later, to my surprise, when I grouped the project into those using 4GL and those using 3GL (Cobol) I found a productivity decrease in both groups.

15 Period 14 GL develop.3 GL develop.Total FP50020002500 Effort50045005000 Productivity1.00.440.50 Period 24 GL develop.3 GL develop.Total FP200010003000 Effort180030004800 Productivity0.90.330.63 Change in productivity -0.1-0.110.13

16 Missing variable in analysis The increase in total productivity was caused by more and more of the work done using the higher productivity environment 4GL All teams had decreased their productivity, but the higher productivity teams had done more of the work. With missing variables we can get very misleading results. Typically we are poor at noticing what is NOT there in an analysis.

17 ASSUMING FIXED VARIABLES WHEN THE VARIABLES ARE STOCHASTIC ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART II

18 Sir Francis Galton (“Filial regression to mediocrity”): The father of regression analysis The first to violate – and be misled by - the fixed variable assumption Perhaps the most common mistakes made in statistical analyses (ref. Milton Friedman, Nobel prize winner in economy)

19 REGRESSION ANALYSIS, ANOVA, T-TEST, CATEGORICAL ANALYSES. ALL OF THEM REQUIRE FIXED VARIABLES. IF WE HAVE STOCHASTIC VARIABLES (RANDOM VARIABLE, VARIABLES WITH MEASUREMENT ERROR) RESULTS ARE BIASED

20 IIlustration: Salary discrimination? Assume an IT-company which: –Has 100 different tasks they want to complete and for each task hire one male and one female (200 workers) –The “base salary” of a task varies (randomly) from 50.000 to 60.000 USD and is the same for the male and the female employees. –The actual salary is the “base salary” added a random, gender independent, bonus. This is done through use of a “lucky wheel” with numbers (bonuses) between 0 and 10.000. This should lead to (on average): Salary of female = Salary of male A regression analysis with female salary as the dependent variable show however that the females are discriminated (less likely to get a high bonus)! –Salary of female = 26100 + 0.56 * Salary of male On the other hand, with male salary as the dependent variable, men are discriminated! –Salary of male = 26900 + 0.55 * Salary of female

21 Salary men Salary women

22 This misuse of the fixed variable assumption may be the reason why most studies report an economy of scale (or linear return on scale), while the IT-industry experiences a diseconomy of scale (M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11):2494-2503, 2012.)

23 PUBLICATION BIAS MAKES US BELIEVE IN TOO LARGE EFFECTS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART III

24 “Why most discovered true associations are inflated”, Ioannidis, Epidemiology, Vol 19, No 5, Sept 2008

25 Effect sizes in studies on pair programming Source: Hannay, Jo E., et al. "The effectiveness of pair programming: A meta-analysis." Information and Software Technology 51.7 (2009): 1110-1122.

26 Total publication bias (only statistically significant results are published) implies that published results has ZERO strength!

27 SOME COMMON TYPES OF REGRESSION ANALYSES ARE EXTREME ON PUBLICATION BIAS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART IV

28 Illustration: Building a regression model Data set: –Effort-variable + 15 other project variables –Twenty software projects. Regression model: –Selected the best 4-variable regression model (OLS), based on ”best subset”. –Removed one outlier. Results: –R 2 =76%, –R 2 -adj=70%, –R 2 -pred = 56% –MdMRE = 28%

29 Not bad results... Especially since all data were random numbers between 1 and 10! Best subset is a rather extreme type of publication bias, but same problem is present with stepwise regression. Best 4 out of 15 variable-model, means that we publish only the best model out of 1365 tested models! (  extreme selection bias)

30 What to do about it Base regression variable inclusion on a priori judgment of importance Do not use R 2 or similar measures to assess the goodness of your prediction model Compare the model against reasonable alternatives. Increase the statistical power

31 Appearances to the mind are of four kinds. Things either are what they appear to be; or they neither are, nor appear to be; or they are, and do not appear to be; or they are not, and yet appear to be. Rightly to aim in all these cases is the wise man's task. Epictetus (AD 55-135), Discourses, Book 1, Chapter 27 Last words … and, we have to increase the statistical power of our empirical studies and accept publication of non-sign. results.

32 BONUS MATERIAL

33 Example: Is there an economy of scale in IT-projects? Created a dataset where, for all project sizes, each “true” line of code (LOC) took one hour, i.e., LOC_true = Effort –This gives true productivity of one LOC per hour for all projects regardless of size, i.e., there is no economy of scale. Each measurement of lines of code was added some measurement error, i.e., errors due to forgetting to count lines of code, counting the same code twice, different counting practices in different projects, etc.) –Observed LOC = true LOC + measurement error –The observed LOC is now a stochastic variable and not a fixed variable. Project data were divided into four size groups (very small, small, large, very large) based on their observed lines of code and the productivity for each category was measured.

34 What we then observe is an (incorrect) economy of scale


Download ppt "The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at."

Similar presentations


Ads by Google