Introducing Concepts of Statistical Inference Beth Chance, John Holcomb, Allan Rossman Cal Poly – San Luis Obispo, Cleveland State University.

Slides:

Advertisements

Similar presentations

1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.

Advertisements

ANALYZING MORE GENERAL SITUATIONS UNIT 3. Unit Overview  In the first unit we explored tests of significance, confidence intervals, generalization, and.

Concepts of Statistical Inference: A Randomization-Based Curriculum Allan Rossman, Beth Chance, John Holcomb Cal Poly – San Luis Obispo, Cleveland State.

CHAPTER 24: Inference for Regression

John Holcomb - Cleveland State University Beth Chance, Allan Rossman, Emily Tietjen - Cal Poly State University George Cobb - Mount Holyoke College

Chapter 10: Hypothesis Testing

Stat 301 – Day 17 Tests of Significance. Last Time – Sampling cont. Different types of sampling and nonsampling errors  Can only judge sampling bias.

Lecture 5 Outline – Tues., Jan. 27 Miscellanea from Lecture 4 Case Study Chapter 2.2 –Probability model for random sampling (see also chapter 1.4.1)

Stat 301 – Day 14 Review. Previously Instead of sampling from a process  Each trick or treater makes a “random” choice of what item to select; Sarah.

Stat 512 – Lecture 12 Two sample comparisons (Ch. 7) Experiments revisited.

Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,

Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.

Stat 217 – Day 10 Review. Last Time Judging “spread” of a distribution “Empirical rule”: In a mound-shaped symmetric distribution, roughly 68% of observations.

Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.

The Practice of Statistics

Stat 217 – Day 15 Statistical Inference (Topics 17 and 18)

Stat 217 – Day 20 Comparing Two Proportions The judge asked the statistician if she promised to tell the truth, the whole truth, and nothing but the truth?

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.

Confidence Intervals and Hypothesis Tests

The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.

Simulation and Resampling Methods in Introductory Statistics Michael Sullivan Joliet Junior College

Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Hypotheses STAT 101 Dr. Kari Lock Morgan SECTION 4.1 Statistical test Null and alternative.

Dennis Shasha From a book co-written with Manda Wilson

Chapter 8 Introduction to Hypothesis Testing. Hypothesis Testing Hypothesis testing is a statistical procedure Allows researchers to use sample data to.

Review Tests of Significance. Single Proportion.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

Hypothesis Testing:.

Chapter 13: Inference in Regression

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

Using Simulation Methods to Introduce Inference Kari Lock Morgan Duke University In collaboration with Robin Lock, Patti Frazer Lock, Eric Lock, Dennis.

Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.

CAUSE Webinar: Introducing Math Majors to Statistics Allan Rossman and Beth Chance Cal Poly – San Luis Obispo April 8, 2008.

Using Lock5 Statistics: Unlocking the Power of Data

Statistics: Unlocking the Power of Data Lock 5 Afternoon Session Using Lock5 Statistics: Unlocking the Power of Data Patti Frazer Lock University of Kentucky.

Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!

Using Activity- and Web-Based Materials in Post-Calculus Probability and Statistics Courses Allan Rossman (and Beth Chance) Cal Poly – San Luis Obispo.

1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 10, Slide 1 Chapter 10 Understanding Randomness.

10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.

Introducing Statistical Inference with Randomization Tests Allan Rossman Cal Poly – San Luis Obispo

Chapter 20 Testing hypotheses about proportions

Hypotheses tests for means

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

Nonparametric Tests IPS Chapter 15 © 2009 W.H. Freeman and Company.

Implementing a Randomization-Based Curriculum for Introductory Statistics Robin H. Lock, Burry Professor of Statistics St. Lawrence University Breakout.

Academic Research Academic Research Dr Kishor Bhanushali M

PANEL: Rethinking the First Statistics Course for Math Majors Joint Statistical Meetings, 8/11/04 Allan Rossman Beth Chance Cal Poly – San Luis Obispo.

CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.

Welcome to MM570 Psychological Statistics

+ Using StatCrunch to Teach Statistics Using Resampling Techniques Webster West Texas A&M University.

Early Inference: Using Randomization to Introduce Hypothesis Tests Kari Lock, Harvard University Eric Lock, UNC Chapel Hill Dennis Lock, Iowa State Joint.

1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 24, Slide 1 Chapter 25 Paired Samples and Blocks.

Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.

Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.

Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.

BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.

CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.

Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.

Teaching Introductory Statistics with Simulation-Based Inference Allan Rossman and Beth Chance Cal Poly – San Luis Obispo

Review Statistical inference and test of significance.

Using Simulation to Introduce Concepts of Statistical Inference Allan Rossman Cal Poly – San Luis Obispo

Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo

Using Randomization Methods to Build Conceptual Understanding in Statistical Inference: Day 1 Lock, Lock, Lock, Lock, and Lock Minicourse – Joint Mathematics.

Introducing Statistical Inference with Resampling Methods (Part 1)

What Is a Test of Significance?

Unit 5: Hypothesis Testing

Stat 217 – Day 28 Review Stat 217.

Using Simulation Methods to Introduce Inference

Using Simulation Methods to Introduce Inference

Significance Tests: The Basics

Presentation transcript:

Introducing Concepts of Statistical Inference Beth Chance, John Holcomb, Allan Rossman Cal Poly – San Luis Obispo, Cleveland State University

2 2 Ptolemaic Curriculum? “Ptolemy’s cosmology was needlessly complicated, because he put the earth at the center of his system, instead of putting the sun at the center. Our curriculum is needlessly complicated because we put the normal distribution, as an approximate sampling distribution for the mean, at the center of our curriculum, instead of putting the core logic of inference at the center.” – George Cobb (TISE, 2007)

3 Is this feasible? Experience at post-calculus level  Developed spiral curriculum with logic of inference (for 2 × 2 tables) in chapter 1  ISCAM: Investigating Statistical Concepts, Applications, and Methods (Chance, Rossman) New project (funded by NSF/CCLI)  Rethinking for lower mathematical level  More complete shift, including focus on entire statistical process as a whole 3

Workshop goals Enable you to:  Re-examine how you introduce concepts of statistical inference to your students  Help your students to understand fundamental concepts of statistical inference  Develop students’ understanding of the process of statistical investigations  Introduce normal-based methods of inference to complement randomization-based ones

Workshop goals (cont.) Enable you to:  Implement activities based on real data from genuine studies  Assess student understanding of inference concepts  Make effective use of simulations, both tactile and computer-based

6 CAUSE Webinar April Agenda Mon pm: Inference for proportion  Overview, introductions  Statistical significance via simulation  Exact binomial inference  CI for proportion  Transition to normal-based inference for proportion

77 Agenda (cont.) Tues am: Inference for 2 × 2 table  Simulating randomization test  Fisher’s exact test  Observational studies, confounding  Independent random samples Tues pm: Comparing 2 groups with quant response  Simulating randomization test  Matched pairs designs

88 Agenda (cont.) Wed am: Assessment issues  Strategies for assessing student understanding/learning  Preliminary findings Wed pm: More inference scenarios  Comparing several groups (ANOVA, chi-square)  Correlation/regression  Discussion of implementation issues

99 Some notes Agenda is always subject to change  Already has changed some! We’ll discuss some assessment, implementation issues throughout Please offer questions, comments as they arise  Be understanding when we don’t have all the answers! We’ll also discuss some thorny issues that we have not resolved among ourselves

Introductions Who are you? Where/what do you teach? Why interested in this topic?

11 Example 1: Helper/hinderer? Sixteen infants were shown two videotapes with a toy trying to climb a hill  One where a “helper” toy pushes the original toy up  One where a “hinderer” toy pushes the toy back down Infants were then presented with the two toys as wooden blocks  Researchers noted which toy infants chose r-Hinderer.html r-Hinderer.html

12 Example 1: Helper/hinderer? Data: 14 of the 16 infants chose the “helper” toy Core question of inference:  Is such an extreme result unlikely to occur by chance (random selection) alone …  … if there were no genuine preference (null model)?

13 Analysis options Could use a binomial probability calculation We prefer a simulation approach  To emphasize issue of “how often would this happen in long run?”  Starting with tactile simulation

14 Strategy Students flip a fair coin 16 times  Count number of heads, representing choices of “helper” toy  Fair coin represent null model of no genuine preference Repeat several times, combine results  See how surprising to get 14 or more heads even with “such a small sample size”  Approximate (empirical) P-value Turn to applet for large number of repetitions: st3/BinomDist.html st3/BinomDist.html

15 Results  Pretty unlikely to obtain 14 or more heads in 16 tosses of a fair coin, so …  Pretty strong evidence that infants do have genuine preference for helper toy and were not just picking at random

Example 1: Helper/hinderer Can do this on day 1 of course Logic of statistical inference/significance Null model, simulation, p-value, significance

Example 2: Kissing Study: 8 of 12 kissing couples lean to right Does this provide evidence against 50/50 model? Does this provide evidence against 75/25 model? What models does this provide evidence against?

Example 2: Kissing Many new ideas here:  Students describe rather than perform simulation  Non-significant result (8/12)  Null model other than 50/50  Looking at lower tail  Sample size effect  Big idea: Interval of plausible values (CI)  Effect of confidence level  Importance of random sampling

Transition to normal-based inference Two methods to find p-value for proportion:  Approximation by simulation  Exact binomial calculation Why should we present normal approx at all?  Because it’s commonly used (not good reason)  Because even minimally observant student will notice similarities of these simulated distributions  Because z-scores convey additional information Distance from expected, measured in SDs

Example 1: Baseball Big Bang Some non-trivial aspects  Defining parameter  Expressing hypotheses  Sampling distribution z = conveys more information than p-value ≈ 0 95% CI:  Does this produce more/less understanding than forming CI by inverting test?

Example 2: Which tire? Which tire would you choose? Fun, simple in-class data collection  Almost always in conjectured direction  May or may not be significant Can use simulation or binomial or normal Investigate effect of sample size

Example 3: Cat Households Sensible to use normal approx here H0:  = 1/3, Ha:  ≠ 1/3 z = -10.4, p-value ≈ % CI: (.312,.320) P-value and CI are complementary  But provide different information Statistical vs practical significance

Example 4: Female Senators 95% CI for  : (.096,.244) Beware of biased sampling methods If you have access to entire population: no inference to be drawn!

24 Example 2: Dolphin therapy? Subjects who suffer from mild to moderate depression were flown to Honduras, randomly assigned to a treatment Is dolphin therapy more effective than control? Core question of inference:  Is such an extreme difference unlikely to occur by chance (random assignment) alone (if there were no treatment effect)?

25 Some approaches Could calculate test statistic, P-value from approximate sampling distribution (z, chi-square)  But it’s approximate  But conditions might not hold  But how does this relate to what “significance” means? Could conduct Fisher’s Exact Test  But there’s a lot of mathematical start-up required  But that’s still not closely tied to what “significance” means Even though this is a randomization test

26 Alternative approach Simulate random assignment process many times, see how often such an extreme result occurs  Assume no treatment effect (null model)  Re-randomize 30 subjects to two groups (using cards) Assuming 13 improvers, 17 non-improvers regardless  Determine number of improvers in dolphin group Or, equivalently, difference in improvement proportions  Repeat large number of times (turn to computer)  Ask whether observed result is in tail of distribution Indicating saw a surprising result under null model Providing evidence that dolphin therapy is more effective

27 Analysis hins/Dolphins.html hins/Dolphins.html

Non-simulation approach Exact randomization distribution  Hypergeometric distribution  Fisher’s Exact Test  p-value = =.0127

29 Conclusion Experimental result is statistically significant  And what is the logic behind that? Observed result very unlikely to occur by chance (random assignment) alone (if dolphin therapy was not effective)

Example 2: Yawning What’s different here? Group sizes not the same So calculating success proportions more important Experimental result not significant Lack of surprising-ness is harder for students to spot than surprising-ness Well-stated conclusion is more challenging, subtle  Don’t want to “accept null model”

Example 3: Murderous Nurse? Murder trial: U.S. vs. Kristin Gilbert  Accused of giving patients fatal dose of heart stimulant  Data presented for 18 months of 8-hour shifts  Relative risk: 6.34

Example 3 (cont.) Structurally the same as dolphin and yawning examples, but with one crucial difference  No random assignment to groups Observational study Allows many potential explanations other than “random chance”  Confounding variables  Perhaps she worked intensive care unit or night shift  Is statistical significance still relevant? Yes, to see if “random chance” can plausibly be ruled out as an explanation  Some statisticians disagree

Example 4: Native Californians? What’s different here? Not random assignment to groups Independent random sampling from populations So …  Scope of conclusions differs Generalize to larger populations, but no cause/effect conclusions  Use different kind of randomness in simulation To model use of randomness in data collection

34 Example 1: Lingering sleep deprivation? Does sleep deprivation have harmful effects on cognitive functioning three days later?  21 subjects; random assignment Core question of inference:  Is such an extreme difference unlikely to occur by chance (random assignment) alone (if there were no treatment effect)?

35 One approach Calculate test statistic, p-value from approximate sampling distribution

36 Randomization approach Simulate randomization process many times under null model, see how often such an extreme result (difference in group means) occurs Start with tactile simulation using index cards  Write each “score” on a card  Shuffle the cards  Randomly deal out 11 for deprived group, 10 for unrestricted group  Calculate difference in group means  Repeat many times

Example 1 Sleep deprivation (cont.) Conclusion: Fairly strong evidence that sleep deprivation produces lower improvements, on average, even three days later  Justifcation: Experimental results as extreme as those in the actual study would be quite unlikely to occur by chance alone, if there were no effect of the sleep deprivation

Exact randomization distribution Exact p-value 2533/ =.0072

Example 2: Age discrimination? Employee ages:  25, 33, 35, 38, 48, 55, 55, 55, 56, 64 Fired employee ages in bold:  25, 33, 35, 38, 48, 55, 55, 55, 56, 64 Robert Martin: 55 years old Do the data provide evidence that the firing process was not “random”  How unlikely is it that a “random” firing process would produce such a large average age?

Exact permutation distribution Exact p-value: 6 / 120 =.05

Example 3: Memorizing letters You will be given a string of 30 letters Memorize as many as you can, in order, in 20 seconds

Confidence Intervals based on Randomization Tests (Quantitative) Invert randomization test  Subtract  from all subjects in group B, re-randomize, add  from all subjects in group B, compare to observed difference  Similar to binomial example (kissing study) Get standard error from randomization distribution and use observed +- 2 SEs Get percentiles from randomization distribution and use observed +- percentiles t-interval Bootstrapping

Series of Lab Assignments Lab 1: Helper/Hinderer (Binomial test) Lab 2: Dolphin Therapy (2x2 table) Lab 3: Textbook prices (matched pairs from normal population) or JFK/JFKC (randomization on quantitative variable) Lab 4: Random Babies Lab 5: One-sample z-test for proportion (Reeses Pieces) Lab 6: Sleepless nights (t-test, confidence interval) Lab 7: Sleep deprivation (randomization test) Lab 8: Study Hours and GPA (regression with simulation and Minitab output)

Random Babies Suppose that 4 mothers give birth to baby boys at the same hospital on the same night Hospital staff returns babies to mothers at random! How likely is it that …  … nobody gets the right baby?  … everyone gets the right baby?  …

Random Babies Last NamesFirst Names JonesJerry MillerMarvin SmithSam WilliamsWilly

Random Babies Last NamesFirst Names JonesMarvin Miller Smith Williams

Random Babies Last NamesFirst Names JonesMarvin MillerWilly Smith Williams

Random Babies Last NamesFirst Names JonesMarvin MillerWilly SmithSam Williams

Random Babies Last NamesFirst Names JonesMarvin MillerWilly SmithSam 1 match WilliamsJerry

Random Babies

Random Babies

Random Babies Probability distribution  0 matches: 9/24=3/8  1 match: 8/24=1/3  2 matches: 6/24=1/4  3 matches: 0  4 matches: 1/24 Expected value  0(9/24)+1(8/24)+2(6/24)+3(0)+4(1/24)=1

Random Babies First simulate, then do theoretical analysis Able to list sample space Short cuts when are actually equally likely Simple, fun applications of basic probability

Naming Presidents List as many U.S. Presidents as you can in reverse chronological order (starting with the current President) Score = # correct before first error

Naming Presidents ObamaBushClintonBush ReaganCarterFordNixon JohnsonKennedyEisenhowerTruman RooseveltHooverCoolidgeHarding WilsonTaftRooseveltMcKinley ClevelandHarrisonClevelandArthur GarfieldHayesGrantJohnson LincolnBuchananPierceFillmore TaylorPolkTylerHarrison Van BurenJacksonAdamsMonroe MadisonJeffersonAdamsWashington

Naming Presidents Use sample data to determine 90% t-interval What percentage of sample values are within this interval?  Is this close to 90%?

Naming Presidents Lessons:  Confidence interval is not a prediction interval  Pay attention to what the parameter (“it”) is

58 Advantages You can do this at beginning of course  Then repeat for new scenarios with more richness  Spiraling could lead to deeper conceptual understanding Emphasizes scope of conclusions to be drawn from randomized experiments vs. observational studies Makes clear that “inference” goes beyond data in hand Very powerful, easily generalized  Flexibility in choice of test statistic (e.g. medians, odds ratio)  Generalize to more than two groups Takes advantage of modern computing power

59 Question #1 Should we match type of randomness in simulation to role of randomness in data collection?  Major goal: Recognize distinction between random assignment and random sampling, and the conclusions that each permit  Or should we stick to “one crank” (always re-randomize) in the analysis, for simplicity’s sake?  For example, with 2 × 2 table, always fix both margins, or only fix one margin (random samples from two independent groups), or fix neither margin (random sampling from one group, then cross-classifying)

60 Question #2 What about interval estimation?  Estimating effect size at least as important as assessing significance How to introduce this?  Invert test Test “all” possible values of parameter, see which do not put observed result in tail Easy enough with binomial, but not as obvious how to introduce this (or if it’s possible) with 2×2 tables  Alternative: Estimate +/- margin-of-error Could estimate margin-of-error with empirical randomization distribution or bootstrap distribution

61 Question #3 How much bootstrapping to introduce, and at what level of complexity?  Use to approximate SE only?  Use percentile intervals?  Use bias-correction? Too difficult for Stat 101 students? Provide any helpful insights?

62 Question #4 What computing tools can help students to focus on understanding ideas?  While providing powerful, generalizable tool? Some possibilities  Java applets, Flash Very visual, contextual, conceptual; less generalizable  Minitab Provide students with macros? Or ask them to edit? Or ask them to write their own? RR Need simpler interface?  Other packages?  StatCrunch, JMP h ave been adding resampling capabilities

63 Question #5 What about normal-based methods? Do not ignore them!  Introduce after students have gained experience with randomization-based methods  Students will see t-tests in other courses, research literature  Process of standardization has inherent value  A common shape often arises for empirical randomization/sampling distributions Duh!

64 Assessment: Developing instruments that assess … Conceptual understanding of core logic of inference  Jargon-free multiple choice questions on interpretation, effect size, etc.  “Interpret this p-value in context”: probability of observed data, or more extreme, under randomness, if null model is true Ability to apply to new studies, scenarios  Define null model, design simulation, draw conclusion  More complicated scenarios (e.g., compare 3 groups)

Understanding of components of activity/simulation Designed for use after an in-class activity using simulation. Example Questions  What did the cards represent?  What did shuffling and dealing the cards represent?  What implicit assumption about the two groups did the shuffling of cards represent?  What observational units were represented by the dots on the dotplot?  Why did we count the number of repetitions with 10 or more “successes” (that is, why 10)? 65

66 Conducting small classroom experiments Research Questions:  Start with study that has with significant result or non?  Start with binomial setting or 2×2 table?  Do tactile simulations add value beyond computer ones?  Do demonstrations of simulations provide less value than student-conducted simulations?

67 Conclusions/Lessons Learned Put core logic of inference at center  Normal-based methods obscure this logic  Develop students’ understanding with randomization-based inference  Emphasize connections among Randomness in design of study Inference procedure Scope of conclusions  But more difficult than initially anticipated “Devil is in the details”

Conclusions/Lessons Learned Emphasize purpose of simulation Don’t overlook null model in the simulation Simulation vs. Real study Plausible vs. Possible How much worry about being a tail probability How much worry about p-value = probability that null hypothesis is true 68

69 Thanks very much! Thanks to NSF (DUE-CCLI # ) Thanks to George Cobb, advisory group More information:  Draft modules, assessment instruments  Questions/comments: