The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at.

Slides:



Advertisements
Similar presentations
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Advertisements

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
1 Multiple Regression Interpretation. 2 Correlation, Causation Think about a light switch and the light that is on the electrical circuit. If you and.
Chapter 12 Simple Regression
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Today Concepts underlying inferential statistics
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Multiple Regression – Basic Relationships
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Statistical hypothesis testing – Inferential statistics I.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
Active Learning Lecture Slides
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Chapter 13: Inference in Regression
Hypothesis testing – mean differences between populations
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 25 Categorical Explanatory Variables.
CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.
Confidence Intervals for the Regression Slope 12.1b Target Goal: I can perform a significance test about the slope β of a population (true) regression.
Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Statistics and Quantitative Analysis U4320 Segment 12: Extension of Multiple Regression Analysis Prof. Sharyn O’Halloran.
Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Chapter 13 Multiple Regression
11/19/2015Slide 1 We can test the relationship between a quantitative dependent variable and two categorical independent variables with a two-factor analysis.
Education 793 Class Notes Decisions, Error and Power Presentation 8.
META-ANALYSIS, RESEARCH SYNTHESES AND SYSTEMATIC REVIEWS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
Paired Samples and Blocks
Results: How to interpret and report statistical findings Today’s agenda: 1)A bit about statistical inference, as it is commonly described in scientific.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.
Hypothesis Tests for 1-Proportion Presentation 9.
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
From myths and fashions to evidence-based software engineering Magne Jørgensen.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Correlation and Regression
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Step 1: Specify a null hypothesis
Paired Samples and Blocks
26134 Business Statistics Week 5 Tutorial
Testing for moderators
Correlation A Lecture for the Intro Stat Course
Elementary Statistics
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
CHAPTER 18: Inference in Practice
Significance Tests: The Basics
Simple Linear Regression
Inferential Statistics
Multiple Regression – Split Sample Validation
CHAPTER 16: Inference in Practice
Presentation transcript:

The use and misuse of statistics in studies on software development. Things you should know about statistics that you didn’t learn in school Keynote at ICSIE, Dubai, May 5, 2015 Magne Jørgensen

The presentation is based on the following papers: M. Jørgensen, T. Dybå, D. I. K. Sjøberg, K. Liestøl. Incorrect results in software engineering experiments. How to improve research practices. To appear in Journal of Systems and Software. M. Jørgensen. The influence of selection bias on effort overruns in software development projects, Information and Software Technology 55(9): , M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11): , M. Jørgensen, T. Halkjelsvik, and B. Kitchenham. How does project size affect cost estimation error? Statistical artefacts and methodological challenges, International Journal of Project Management, 30(7): , 2012.

THE EFFECT OF LOW POWER, RESEARCHER BIAS AND PUBLICATION BIAS How much of the statistical results in software engineering are incorrect?

The average statistical power of software engineering studies is only about 30% This means that even if all studied relationships were true (unrealistic best case), only 30% of our hypothesis tests should be statistically significant (p<0.05). –If only 50% of our tests are on true relationships, we should observe about 17.5% statistically significant tests. How much of our published statistical tests are actually statistically significant?

Fanelli, Daniele. “Positive” results increase down the hierarchy of the sciences." PLoS One 5.4 (2010) Computer science studies: 80% significant results (software engineering: 50%)

Similar calculations for software engineering suggests that at least one third of the statistically significant results in software engineering experiments are incorrect! M. Jørgensen, T. Dybå, D. I. K. Sjøberg, K. Liestøl. Incorrect results in software engineering experiments. How to improve research practices. To appear in Journal of Systems and Software.

Illustration of researcher and publication bias How easy is it to find statistically significant results With “flexible” statistical analyses? I made a simple test...

My hypothesis: People with longer names write more complex texts Dr. Pensenschneckerdorf Dr. Hart The results advocate, when presupposing satisfactory statistical power, that the evidence backing up positive effect is weak. We found no effect.

Design and results of study Variables: –LengthOfName: Length of surname of the first author –Complexity1:Number of words per paragraph –Complexity2: Flesch-Kincaid reading level Data collection: –The first 20 publications identified by “google scholar” using the search string “software engineering” for year (n=20 is a typical, low stat. power sample for software engineering studies) Results were statistically significant! –r LengthOfName,Complexity1 = (p=0.007) –r LengthOfName,Complexity2 = (p=0.008) Conclusion: The analysis reject the null-hypothesis that there is no difference, i.e., a support of that long names are connected with more complex texts. Results can be published!?

The regression line also supports that there is a relation between length of name and complexity of writing

How did I do it? (How to easily get “interesting” results in any study) Publication bias: Only two, out of fourteen, significant measures of paper complexity were reported. Researcher bias 1: A (defendable?), post hoc (after looking at the data) change in how to measure name length. –The use of surname length was motivated by the observation that not all authors informed about their first name. Researcher bias 2: A (defendable?), post hoc removal of two observations. –Motivated by the lack of data for the Flesh-Kincaid measure of those two papers. Low number of observations: Statistical power approx. 0.3 (assuming r=0.3, p<0.05). –If research was a game where the winners have p<0.05, 5 studies with 20 observations is much better than one with 100.

MISSING VARIABLES LEADING TO INCORRECT CONCLUSIONS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART I

Here is my first misuse of statistics.... I measured an increase in productivity of an IT-department (function points/man-month). The management was happy, since this showed that their newly implemented processes (time boxing) had been successful. Later, to my surprise, when I grouped the project into those using 4GL and those using 3GL (Cobol) I found a productivity decrease in both groups.

Period 14 GL develop.3 GL develop.Total FP Effort Productivity Period 24 GL develop.3 GL develop.Total FP Effort Productivity Change in productivity

Missing variable in analysis The increase in total productivity was caused by more and more of the work done using the higher productivity environment 4GL All teams had decreased their productivity, but the higher productivity teams had done more of the work. With missing variables we can get very misleading results. Typically we are poor at noticing what is NOT there in an analysis.

ASSUMING FIXED VARIABLES WHEN THE VARIABLES ARE STOCHASTIC ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART II

Sir Francis Galton (“Filial regression to mediocrity”): The father of regression analysis The first to violate – and be misled by - the fixed variable assumption Perhaps the most common mistakes made in statistical analyses (ref. Milton Friedman, Nobel prize winner in economy)

REGRESSION ANALYSIS, ANOVA, T-TEST, CATEGORICAL ANALYSES. ALL OF THEM REQUIRE FIXED VARIABLES. IF WE HAVE STOCHASTIC VARIABLES (RANDOM VARIABLE, VARIABLES WITH MEASUREMENT ERROR) RESULTS ARE BIASED

IIlustration: Salary discrimination? Assume an IT-company which: –Has 100 different tasks they want to complete and for each task hire one male and one female (200 workers) –The “base salary” of a task varies (randomly) from to USD and is the same for the male and the female employees. –The actual salary is the “base salary” added a random, gender independent, bonus. This is done through use of a “lucky wheel” with numbers (bonuses) between 0 and This should lead to (on average): Salary of female = Salary of male A regression analysis with female salary as the dependent variable show however that the females are discriminated (less likely to get a high bonus)! –Salary of female = * Salary of male On the other hand, with male salary as the dependent variable, men are discriminated! –Salary of male = * Salary of female

Salary men Salary women

This misuse of the fixed variable assumption may be the reason why most studies report an economy of scale (or linear return on scale), while the IT-industry experiences a diseconomy of scale (M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11): , 2012.)

PUBLICATION BIAS MAKES US BELIEVE IN TOO LARGE EFFECTS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART III

“Why most discovered true associations are inflated”, Ioannidis, Epidemiology, Vol 19, No 5, Sept 2008

Effect sizes in studies on pair programming Source: Hannay, Jo E., et al. "The effectiveness of pair programming: A meta-analysis." Information and Software Technology 51.7 (2009):

Total publication bias (only statistically significant results are published) implies that published results has ZERO strength!

SOME COMMON TYPES OF REGRESSION ANALYSES ARE EXTREME ON PUBLICATION BIAS ERRORS WE FREQUENTLY MAKE IN EMPIRICAL STUDIES – PART IV

Illustration: Building a regression model Data set: –Effort-variable + 15 other project variables –Twenty software projects. Regression model: –Selected the best 4-variable regression model (OLS), based on ”best subset”. –Removed one outlier. Results: –R 2 =76%, –R 2 -adj=70%, –R 2 -pred = 56% –MdMRE = 28%

Not bad results... Especially since all data were random numbers between 1 and 10! Best subset is a rather extreme type of publication bias, but same problem is present with stepwise regression. Best 4 out of 15 variable-model, means that we publish only the best model out of 1365 tested models! (  extreme selection bias)

What to do about it Base regression variable inclusion on a priori judgment of importance Do not use R 2 or similar measures to assess the goodness of your prediction model Compare the model against reasonable alternatives. Increase the statistical power

Appearances to the mind are of four kinds. Things either are what they appear to be; or they neither are, nor appear to be; or they are, and do not appear to be; or they are not, and yet appear to be. Rightly to aim in all these cases is the wise man's task. Epictetus (AD ), Discourses, Book 1, Chapter 27 Last words … and, we have to increase the statistical power of our empirical studies and accept publication of non-sign. results.

BONUS MATERIAL

Example: Is there an economy of scale in IT-projects? Created a dataset where, for all project sizes, each “true” line of code (LOC) took one hour, i.e., LOC_true = Effort –This gives true productivity of one LOC per hour for all projects regardless of size, i.e., there is no economy of scale. Each measurement of lines of code was added some measurement error, i.e., errors due to forgetting to count lines of code, counting the same code twice, different counting practices in different projects, etc.) –Observed LOC = true LOC + measurement error –The observed LOC is now a stochastic variable and not a fixed variable. Project data were divided into four size groups (very small, small, large, very large) based on their observed lines of code and the productivity for each category was measured.

What we then observe is an (incorrect) economy of scale