Experimentation in Computer Science (Part 2)
Experimentation in Software Engineering --- Outline Empirical Strategies Measurement Experiment Process
E Experiment Process: Phases Experiment Definition Experiment Planning Experiment Operation Analysis & Interpretation Presentation & Package Conclusions Experiment Idea Experiment Process
Experiment Process: Phases Defined Experiment Idea: ask the right question (insight) Experiment Definition: ask the question right Experiment Planning: design experiment to answer question Experiment Operation: collect metrics Analysis and Interpretation: statistically evaluate and determine practical consequences Presentation: disseminate results
E Experiment Process: Phases Experiment Definition Experiment Planning Experiment Operation Analysis & Interpretation Presentation & Package Conclusions Experiment Idea Experiment Process
Experiment Definition: Overview Formulate experiment idea -- ask the right question Define goals -- why conduct the experiment State research questions: Descriptive – what percentage of developers use OO? Relational – what percentage of experienced / novice developers use OO? Causal – what’s the average productivity of developers using OO versus developers using non-OO?
7 Experiment Definition: Overview – Example How do test suite size and test case composition affect the costs and benefits of web testing methodologies?
E Experiment Process: Phases Experiment Definition Experiment Planning Experiment Operation Analysis & Interpretation Presentation & Package Conclusions Experiment Idea Experiment Process
9 Experiment Planning: Overview Context Selection Hypothesis Formulation Variables Selection Selection of Subjects Experiment Design Experiment Operation Experiment Definition Experiment Planning Instrumen- tation Validity Evaluation
10 Experiment Planning: Context Selection Context: environment and personnel: Dimensions include: off-line vs on-line student vs professional personnel toy vs real problems specific vs general software engineering domain Selection drivers: validity vs cost
11 Experiment Planning: Hypothesis Formulation Hypothesis: A formal statement related to a research question Forms the basis for statistical analysis of results through hypothesis testing Data collected in the experiment is used to, if possible, reject the hypothesis
12 Experiment Planning: Hypothesis Formulation There are two hypotheses for each question of interest: Null Hypothesis, H 0 : Describes the state in which the prediction does not hold. Alternative Hypothesis, H a, H 1, etc : Describes the prediction we believe will be supported by evidence. Goal of experiment is to reject H 0 with as high significance as possible; this rejection then implies acceptance of the alternative hypothesis
13 Experiment Planning: Hypothesis Formulation Hypothesis testing involves risks Type-I-error: The probability of rejecting a true null hypothesis. In this case we infer a pattern or relationship that does not exist. Type-II-error: The probability of not rejecting a false null hypothesis. In this case we fail to identify a pattern or relationship that does exist. Power of a statistical test: The probability that the test will reveal a true pattern if the null hypothesis is false (1 – P(type-II-error))
14 Experiment Planning: Variable Selection Types of Variables to Select: Independent: manipulated by investigator or nature Dependent: affected by changes in Independent Also Select: Measures and measurement scales Ranges for variables Specific levels of independent variables to be used
15 Experiment Planning: Selection of Subjects/Objects Selection process strongly affects ability to generalize results Process for selecting subjects/objects: Identify population U Draw a sample from U using a sampling technique
16 Experiment Planning: Selection of Subjects/Objects Probability sampling: Simple random: randomly select from U Systematic random:select first subject from U at random, then select every nth after that Stratified random: divide U into strata following a known distribution, then apply random within strata Non-probability sampling: Convenience: select the nearest, most convenient Quota: used to get subjects from various elements of a population; convenience is used for each element
17 Experiment Planning: Selection of Subjects/Objects Larger sample sizes result in lower error If population has large variability, larger sample size is needed Data analysis methods may influence choice of sample size However: higher sample size implies higher cost Hence, we want a sample as small as possible, but large enough so that we can generalize!
Experiment Planning: Experiment Design - Principles Randomization. Statistical methods require that observations be made from independent random variables; applies to subjects, objects, treatments. Blocking. Given a factor that may affect results but that we aren’t interested in; we block subjects, objects, or techniques w.r.t. that factor, and analyze blocks independently (e.g, program in TSE paper). Balancing. Assign treatments such that each has an equal number of subjects; not essential, but simplifies and strengthens statistical analysis
Experiment Planning: Experiment Design - Design Types We will consider several, suitable for experiments with: One factor with two treatments One factor with more than two treatments Two factors with two treatments More than two factors each with two treatments Notation: i : the mean of the dependent variable for treatment i
Experiment Planning: Experiment Design – 1 Fctr, 2 Trtmts Design type: completely randomized Description: simple means comparison Example hypothesis: H 0 : 1 = 2 H 1 : 1 <> 2, 1 > 2 or 1 < 2, Examples of analyses: T-test Mann-Whitney SubjectsTrtmt 1Trtmt 2 1X 2X 3X 4X 5X 6X
Experiment Planning: Experiment Design – 1 Fctr, 2 Trtmts Design type: completely randomized Description: simple means comparison Example hypothesis: H 0 : 1 = 2 H 1 : 1 <> 2, 1 > 2 or 1 < 2, Examples of analyses: T-test Mann-Whitney EXAMPLE: Investigate whether humans using a new testing method detect faults better than humans using a previous method. The factor is the method, treatments are old and new methods, dependent variable could be number of faults found.
Experiment Planning: Experiment Design – 1 Fctr, 2 Trtmts Design type: paired comparison Description: compare differences between techniques more precisely; beware learning effects Example hypothesis: H 0 : d = 0 ( d = mean of diff) H 1 : d <>0, d >0, or d <0 Examples of analyses: Paired t-test, Sign test, Wilcoxon SubjectsTrtmt 1Trtmt
Experiment Planning: Experiment Design – 1 Fctr, 2 Trtmts EXAMPLE: Investigate whether a new testing criterion facilitates fault detection better than a previous criterion. The factor is the criterion, treatments are use of old and new criteria, dependent variable could be number of faults found. Design type: paired comparison Description: compare differences between techniques more precisely; beware learning effects Example hypothesis: H 0 : d = 0 ( d = mean of diff) H 1 : d <>0, d >0, or d <0 Examples of analyses: Paired t-test, Sign test, Wilcoxon
Experiment Planning: Experiment Design – 1 Fctr, 3+ Trtmts Design type: completely randomized Description: means comparison Example hypothesis: H 0 : 1 = 2 = 3 =…= a H 1 : i <> j for some (i,j) Examples of analyses: ANOVA Kruskal-Wallis SubjectsTrtmt 1Trtmt 2Trtmt 3 1X 2X 3X 4X 5X 6X
Experiment Planning: Experiment Design – 1 Fctr, 3+ Trtmts Design type: completely randomized Description: means comparison Example hypothesis: H 0 : 1 = 2 = 3 =…= a H 1 : i <> j for some (i,j) Examples of analyses: ANOVA Kruskal-Wallis EXAMPLE: Investigate whether humans using a new testing method detect faults better than humans using two previous methods. The factor is the method, treatments are new and two old methods, dependent variable could be number of faults found.
Experiment Planning: Experiment Design – 1 Fctr, 3+ Trtmts Design type: randomized complete block Description: compare diffs; esp. if large variability between subjects Example hypothesis: H 0 : 1 = 2 = 3 =…= a H 1 : i <> j for some (i,j) Examples of analyses: ANOVA Kruskal-Wallis SubjectsTrtmt 1Trtmt 2Trtmt
Experiment Planning: Experiment Design – 1 Fctr, 3+ Trtmts Design type: randomized complete block Description: compare diffs; esp. if large variability between subjects Example hypothesis: H 0 : 1 = 2 = 3 =…= a H 1 : i <> j for some (i,j) Examples of analyses: ANOVA, Kruskal-Wallis EXAMPLE: Investigate whether a new testing criterion facilitates fault detection better than two previous criteria. The factor is the criterion, treatments are use of new and old criteria, dependent variable could be number of faults found.
Experiment Planning: Experiment Design – 2 Fctrs, 2 Trtmts Design type: 2*2 factorial, 2 treatments Three hypotheses Effect of treatment Ai Effect of treatment Bi Effect of interaction between Ai and Bi Factor A Trtmt A1Trtmt A2 Factor BTrtmt B1Subject 4,6Subject 1,7 Trtmt B2Subject 2,3Subject 5,8 Example hypothesis: H 0 : 1 = 2 = 0 H 1 : at least one i <> j 0 (Hypothesis instantiated for each treatment and for interaction) Examples of analyses: ANOVA
Experiment Planning: Experiment Design – 2 Fctrs, 2 Trtmts Example: Investigate regression testability of code using retest-all and regression test selection, in the case where tests are coarse-grained and the case where they are fine- grained. Factor A is technique, Factor B is granularity. Design is 2*2 factorial because both factors have 2 treatments and every combination of treatments occurs Design type: 2*2 factorial, 2 treatments Three hypotheses Effect of treatment Ai Effect of treatment Bi Effect of interaction between Ai and Bi Example hypothesis: H 0 : 1 = 2 = 0 H 1 : at least one i <> j 0 (Hypothesis instantiated for each treatment and for interaction) Examples of analyses: ANOVA
Experiment Planning: Experiment Design – k Fctrs, 2 Trtmts Given k factors, results can depend on each factor or interactions among them. 2 k design has k factors with two treatments, tests all combinations Hypotheses and analyses are the same as for 2*2 factorial Fctr AFctr BFctr CSbjcts A1B1C12, 3 A2B1C11, 13 A1B2C15, 6 A2B2C110, 16 A1B1C27, 15 A2B1C28, 11 A1B2C24, 9 A2B2C212, 14
Experiment Planning: Experiment Design – k Fctrs, 2 Trtmts As factors grow, expense grows. If high-order interactions can be assumed to be negligible, it is possible to run a fraction of complete factorial This approach may be used, in particular, for exploratory studies, to identify factors having large effects Strengthen results by running other fractions in sequence Fctr AFctr BFctr CSbjcts A2B1C12, 3 A1B2C11, 8 A1B1C25, 6 A2B2C24, 7 One-half fractional factorial design of the 2 k factorial design Select combinations s.t. if one factor is removed, remaining design is full 2 k-1