Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE.

Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE

Introductions  Beth  Nathan

Goals  What/why SBI? (11:00-11:30 ET)  One proportion examples and where to from here (1130-11:50)  Q+A(11:50-12:00)  Two group simulation (12:00-12:15)  How to assess and what student performance looks like? (12:15-12:35)  How to get started/get more information; Q+A (12:35-12:45)

Brief and select history of stat ed  Consensus approach for intro stats by late 1990s, but nexus in early 1980s  Descriptive Statistics  Probability/Design/Sampling Distributions  Inference (testing and intervals)  GAISE College Report (2005)  Six pedagogical suggestions for Stat 101: Conceptual understanding, Active learning, Real data, Statistical literacy and thinking, Use technology, and Use assessments for learning

Brief history of stat ed  No real pressure to change content  Major changes  Increased computational resources for data collection and analysis  Recognition of the utility of simulation to enhance student understanding of random processes  Assessment results illustrating that students don’t really (a) improve much pre to post-course on standardized tests of statistical thinking or (b) retain much from a typical introductory statistics course

Intro Stat as a Cathedral of Tweaks (A George Cobb analogy) Boswell, a biographer, famously described Samuel Johnson as a “cathedral of tics” due to his gesticulations and tics Thesis: The usual normal distribution- worshipping intro course is a cathedral of tweaks.

The orthodox doctrine  The orthodox doctrine is simple  Central limit theorem justifies use of normal distribution  If observed statistic is in the tails (>2SEs), reject null  Confidence interval is estimate +/- 2SEs

The Cathedral of Tweaks (a ) One tower: z vs t If we know the population SD we use z If we estimate the SD we use t… except for proportions; then we use z, not t, even when we estimate the SD… …except when you do tests on proportions, then use the null value

Still More Tweaks  Another tower: If your data set is not normal you may need to transform  Another tower: If you work with small samples there are guidelines for when you can use methods based on the normal, e.g., n > 30, or np > 5 and n(1-p) > 5

The consequence  Few students ever leave our course seeing statistics as this

The consequence  The better students may get a fuzzy impression

The consequence  All too many noses stay too close to the canvas, and see disconnected details

A potential solution?  ‘Simulation-based methods’ = simulation, bootstrapping and/or permutation tests (Alt: Resampling, Randomization, etc.)  Use of these methods to:  Estimate/approximate the null distribution for significance tests  Estimate/approximate the margin of error for confidence intervals

General trends  Momentum behind simulation-based approach to inference in last 8-10 years  Cobb 2005 talk (USCOTS)  Cobb 2007 paper (TISE)  2011 USCOTS: The Next Big Thing  Continued workshops, sessions – e.g., numerous at eCOTS!

General trends  Recent curricula  Lock5 (theory and randomization, more traditional sequence of topics)  Tintle et al. ISI (theory and randomization, four pillars of inference and then chapters based on type of data)  CATALST (emphasis on modelling)  OpenIntro  Others--- Statistical Reasoning in Sports (Tabor- geared to HS students)

General trends  Many sessions at conferences talking about approach, benefits, questions/concerns  Assessment: Multiple papers (Tintle et al. 2011, Tintle et al. 2012, Tintle et al. 2014, Chance et al. 2014, Swanson et al. 2014); Better on many things, do no harm on others; more papers coming

Simulating a single proportion

Set-up: Can dogs understand human cues?  A dog is shown two cups (on ground, 2.5 meters from dog) and then given a choice of which one to approach.  Before approaching the cups the researcher leans in one direction or the other  The dog (Harley) chooses the correct cup 9 out of 10 times  Is the dog ‘understanding’ the researcher?

Questions for students  What do you think?  Why?

In class dialogue  Probably ‘understanding’ the researcher  Assuming some things about the study design  Not always the same cup; same color/kinds of cups; object underneath doesn’t have a scent, etc.  Why ‘understanding the researcher’?  9 out of 10 is ‘convincing’  Why convincing?  Unlikely to happen by chance

In class dialogue  What about people not convinced? How would you convince them of your ‘gut feeling’ that 9 out of 10 is ‘rare’ and ‘not likely to happen by chance’  What would happen by chance is 5 or 6 or 4 or …  Flip a coin

In class tactile simulation  Flip coins  Students come to front and put dots on dotplot  Illustrate that 9 out of 10 heads is rare ->confirming intuition that 9 out of 10 correct is rare

Applet  http://math.hope.edu/isi (or our Wiley textbook site Introduction to Statistical Investigations) for links to rossmanchance.com applets http://math.hope.edu/isi  One proportion applet demo

Take homes  Logic of inference very early in the class  No technical lingo  Follow-up with 6 out of 10. Mechanical arm points at a cup. Dog just guessing?

Another quick example  Eight out of last 10 patients with heart transplants at St. George’s Hospital died within 30 days. Made news because heart transplant surgeries were suspended pending an investigation  Historical national data is ~15% 30 day mortality rate after heart transplant  What do think? Would you suspend heart transplants at that hospital? Could there be another explanation?  How can we investigate the “random chance” explanation?

St. George’s  Simulation  Coin tossing?  Ross a die?  Spinner?  Observations  Where is distributed centered?  Why is it not symmetric?  Do I care?  Where does 8 fall in this distribution?

Take homes  Follow-up: 71 out 361 patients at St. George’s died since 1986 (19.67%)

Take homes  Where do you go from here  P-value/null-alt hypothesis language  What impacts strength of evidence  Standardized statistics  Normal approx. to binomial (“Theory based approach” )  St. George’s  Process to population

Take homes  Have them design their own simulations for a while  Technology – not a black box; directly connects to tactile in class simulation  Contrast with traditional approach  Lots of probability scaffolding; abstract theory; disconnection from real data; technical language and notation, etc.  Less ‘spiraling’ and less opportunity to do inference (the main objective?)

More take homes  SBI  Integration of GAISE (content and pedagogy)  Keeping data front and center (e.g., 6 steps of inference)  Build on strong conceptual foundation of what inference is  Layer confidence intervals, generalizability and causation on top of this foundation  Through choice of examples they see many other important issues dealing with data collection and messy data, but always in the context of a full statistical investigation

In our course…  Chapter 1 – simulating one proportion (logic of inference – significance testing)  Chapter 2- importance of random samples (scope of inference - generalizing) (one proportion)  Chapter 3- estimation (logic of inference - confidence intervals) (one proportion)  Chapter 4 – randomized experiments vs. observational studies (scope of inference – causation) (two groups)  Chapters 5-7 – comparing two groups (proportions, quant variable, paired)  Chapters 8-10 – comparing multiple groups/regression (association)

In our course…  Chapters 5-10  Focus on overall statistical process  Six steps  Integrated presentation of descriptive and inferential statistics  Shuffling to break the association  3S process: Statistic, Simulate, Strength of Evidence  Theory-based approaches predict ‘what would happen if you simulated’ (more or less) and have valid predictions if certain data conditions are met  Simplified versions of those conditions, can always verify with simulation!

Lingering effects of sleep deprivation  Participants were trained on a visual discrimination task on the computer and then half were not allowed to sleep that night. Everyone got as much sleep as they wanted on nights 2 and 3 and then the subjects were retested. The response variable is the improvement in their reaction times (positive values indicate how much faster they were on the task the second time)

Lingering effects of sleep deprivation  Key question: Could this have happened by random chance alone?  Now: randomness is from the random assignment in the experiment  So what do we need to know?  How does our statistic behave by random chance alone when there really is no treatment effect?  How can we simulate this?

Lingering effects of sleep deprivation  Key question: Could this have happened by random chance alone?  Students take 21 index cards and write down each improvement score  The cards are shuffled and 11 are dealt to be the “sleep deprived” group and the remaining 10 are the “unrestricted sleep” group  Assuming nothing special about which group you are assigned to, your outcome is not going to change, there is no treatment effect  After each shuffle we calculate the new statistic and produce a distribution of the different values of the statistic under this model

Lingering effects of sleep deprivation  Applet demo

Follow up Original dataFake data

Follow up Real dataFake data

Take home messages  Core logic of inference is the same  From this point on, practically a “downhill” slope  Standardized statistic is simply statistic/SE (SE from simulation)  “Quick and dirty” 95% CI is simply +/- 2*SE (SE from simulation)  Alternative choice of statistic is nice and easy  “Why are we using the mean instead of the median if the median is better?”  Students are ‘ready’ to confront different situations  Theory-based is convenient prediction when certain conditions are met – overlay of distribution

How do you do assessment?  May ask students to use applets on exam  Applets can be used on personal devices, most can be downloaded locally in advance  But not required  Can be asked to interpret results  Can be asked to design the simulation  Do ask more conceptual questions about logic and scope of inference  Interpretation of p-value

What kinds of questions do you ask?  Screen capture and fill in blanks/interpret output  “What values would you use in the applet to…”  “Which graph represents the null distribution?” (e.g., where centered)  “Circle the dots that represent the p-value.” or “Indicate on the graph how to find the p-value”  “Based on the simulated null distribution, how strong is the evidence against the null hypothesis”  What–if questions  Show a skewed simulated distribution and ask ‘what’s wrong’ with theory-based p- value  How would the null distribution change if we increased the sample size

Another example assessment question  Two different approaches were taken in order to yield a p-value.  Option #1. 1000 sets of 20 “coin tosses” were generated where the probability of heads was 10%. Out of the 1000 sets of tosses 129 sets had at least 4 heads occur, and so a p-value of 0.129 is obtained, showing little evidence that more than 10% of Dordt students study more than 35 hours a week.  Option #2. The Theory-Based Inference applet was used, generating a z- score of 1.49 with a p-value of 0.068, yielding moderate evidence that more than 10% of Dordt students study more than 35 hours a week.

Another example assessment question One Proportion applet results (Option #1) Theory Based Inference Applet (Option #2) Student question: Briefly explain which p-value (Option #1 or Option #2) is more valid and why.

Assessment results  Major assessment studies are underway  Evidence is mounting for  Improved student conceptual understanding of numerous inferential outcomes  “No harm” on other outcomes  For both stronger and weaker students  Regardless of institution or level of instructor experience with SBI

 Methods  Traditional curriculum (Moore 2010) - 94 students; spring 2011  New curriculum (ISI, 2011 version) – 155 students; fall 2011 and spring 2012  All students completed the 40-question CAOS test during the first week of the semester and again during the last week of the semester. Students were given course credit for completing the assessment test, but not for their performance, and the test was administered electronically outside of class.  Two instructors taught the course each semester, with one instructor the same each semester, and one different in spring 2011 than in fall 2011/spring 2012 Dordt’s before and after story

 Overall performance Dordt’s before and after story Very similar to Tintle et al (2011) results at another institution Approx. twice the gains using new curriculum as compared to traditional (11.6% vs. 5.6%; p<0.001) Pre-test Post-test

Subscale CohortPretestPosttestDiff.Paired t- test p- value Cohort p-value 95% CI for cohort Data Collection and Design Random. Tradition. 34.8% 34.9% 53.1% 36.5% 18.2% 1.6% <0.001 0.54 <0.001 (9.2%, 23.9%) Descript. Statistics Random. Tradition. 55.1% 53.5% 61.1% 69.6% 6.0% 16.1% 0.015 <0.001 0.014 (-2.1%, -18.1%) Graphical Representati ons Random. Tradition. 55.8% 58.5% 64.4% 60.9% 8.6% 2.4% <0.001 0.23 0.03 (0.6%, 11.4%) Boxplots Random. Tradition. 35.0% 32.4% 41.6% 34.1% 6.6% 1.6% 0.010 0.55 0.18(-2.3%, 12.3%) Bivariate Data Random. Tradition. 58.1% 56.4% 60.7% 64.8% 2.6% 8.4% 0.28 0.005 0.12(-13.3%, 1.6%) Dordt’s before and after story

Averages by Topic Subscale CohortPrePostDiff.Paired t-test p-value Cohort p-value 95% CI for cohort Prob. Random. Tradition. 31.9% 32.4% 56.5% 35.2% 24.5% 2.7% <0.001 0.52 <0.001 (10.8%, 32.7%) Samp Var. Random. Tradition. 36.7% 38.7% 39.4% 43.5% 2.7% 4.8% 0.22 0.11 0.57(-9.4%, 5.2%) CIs Random. Tradition. 37.9% 42.9% 51.8% 47.8% 13.9% 4.9% <0.001 0.12 0.026 (1.1%, 16.7%) Tests of Sig. Random. Tradition. 46.1% 50.0% 70.0% 60.6% 23.9% 10.6% 0.000 <0.001 (6.6%, 19.9%) Dordt’s before and after story

 Fall 2013 and Spring 2014  22 different instructor-semesters  17 different instructors  12 different institutions  N=725; pre-post on 30 question ISI assessment (adapted from CAOS)  Many different instructional styles (traditional classroom, active learning pedagogy, computer lab, flipped classroom)  Many different institutions (high school, community college, large university, mid-sized university, small liberal arts college) Transferability

 Similar findings to author’s institutions; Significantly better overall post- course performance Transferability- Overall

Subscale PretestPosttestDiff.Paired t-test p- value Overall 48.7%57.8%9.1% <0.001 Data Collection and Design 64.7%67.2%2.4% 0.03 Descript. Statistics 36.8%44.5%7.7% <0.001 Graphical Representations 50.9%59.0%8.1% <0.001 Probability 35.8%47.2%11.4% <0.001 Sampling Variability 20.9%24.8%4.0% 0.001 CIs 52.7%64.2%11.5% <0.001 Tests of Sig. 58.7%70.5%11.8% <0.001 Transferability – by subscale

2013-2014 data

 Student and instructor variables; within section clustering; etc. (‘13-’14; ‘14- ’15) 2013-2014 data

 Student and instructor variables; within section clustering; etc. (‘13-’14; ‘14- ’15) 2014-2015 data

 Better only for weak students? Only strong students? Thinking about student ability levels

ACT GroupCurriculum Pre-test Mean (SD) Post-test Mean (SD) Change Mean (SD) 2 Difference in curriculum means 3 Low Consensus (n=21) 41.7 (10.2)46.3 (10.1)4.0 (11.7)8.2*** Early-SBI (n=55) 42.7 (10.1)54.9 (11.9) 12.2 (10.5)*** Middle Consensus (n=34) 46.0 (8.2)52.4 (10.3)6.5 (9.2)***4.7* Early-SBI (n=48) 43.4 (10.0)55.1 (10.8) 11.2 (11.4)*** High Consensus (n=36) 51.3 (7.7)57.1 (7.7)5.8 (9.2)**6.0* Early-SBI (n=49) 47.8 (9.8)59.5 (12.0) 11.8 (10.1)*** Overall Consensus (n=91) 46.4 (9.3)52.0 (11.0)5.6 (9.8)6.0*** Early-SBI (n=152) 44.9 (10.1)56.5 (11.6)11.6 (10.7) Pre- and post-course conceptual understanding stratified by ACT score

How grouped Grouping (n) Pre-test Mean (SD) Post-test Mean (SD) Change Mean (SD) 1 Pre-test concept score Less- preparation (291) 35.0 (5.0)49.6 (12.2)14.7 (12.4)*** Typical preparation (586) 50.9 (5.6)57.7 (12.4)6.8 (12.1)*** Higher preparation (201) 68.9 (6.4)73.3 (10.7)4.3 (9.6)*** Self- reported college GPA Weaker (193)45.6 (12.3)52.9 (13.1)7.3 (11.8)*** Typical (654)50.0 (12.0)58.1 (13.6)8.1 (12.6)*** Stronger (231)53.8 (13.6)64.9 (14.7)11.1 (12.2)*** Overall50.0 (12.6)58.6 (14.3)8.6 (12.5)*** Pre- and post-course conceptual understanding stratified by pre- course performance among SBI students in 2013-2014 (n=1078)

Group Grouping Pre-test Mean (SD) Post-test Mean (SD) Change Mean (SD) Graphical Representations Pre-test30.5 (18.4)46.5 (20.9)15.8 (24.4)*** GPA44.0 (25.1)51.3 (25.0)6.5 (25.9)** Data collection and design Pre-test50.9 (21.2)56.6 (23.9)5.2 (31.6)** GPA62.6 (23.5)60.4 (25.9)-2.6 (33.4) Descriptive statistics Pre-test17.2 (27.2)31.8 (35.1)14.9 (43.6)*** GPA31.3 (34.8)36.1 (33.4)4.4 (43.0) Tests of significance Pre-test40.7 (13.6)57.5 (17.4)16.6 (21.9)*** GPA49.1 (17.0)60.0 (17.3)10.7 (21.4)*** Confidence Intervals Pre-test33.4 (16.2)50.2 (22.7)17.1 (25.8)*** GPA40.0 (18.5)50.4 (23.9)10.7 (25.6)*** Sampling Variability Pre-test28.0 (29.3)40.6 (34.8)12.6 (44.6)*** GPA44.8 (35.3)44.4 (39.2)-0.0 (46.4) Probability/Simul ation Pre-test19.9 (28.4)38.7 (34.9)18.2 (44.4)*** GPA29.8 (31.2)41.9 (36.4)10.9 (41.1)*** Pre- and post-course conceptual understanding by subscale - Less prepared and/or weaker SBI students in 2013-2014

 What we know  Increasing interest in the approach  Including high school and Common Core State Standards  The ISI version of the curriculum (early, middle and current versions) have demonstrated  Improved learning gains and retention in logic and scope of inference compared to traditional curriculum at same institutions  These results appear to translate reasonably well to other institutions  ‘Do no harm’ in descriptive statistics and other areas  Preliminary evidence that the more SBI you do the more beneficial effect you will see (analysis ongoing) Discussion

 What we don’t know  Pedagogy? Content? Spiraling?  Conflated!  What you should ‘take’ and what you can ‘leave’; student learning trajectories  Key instructor/institutional requirements for success  How the approach can be improved even further for greater success Discussion

 Assessment initiative  Pre- and post- concepts and attitudes; common exam questions  Goal: What works, what doesn’t, comparisons by institution, instructor, style, etc. Individualized instructor reports to learn about your own students outcomes  First edition of ISI curriculum available via Wiley  http://math.hope.edu/isi (sample materials; applets)  Continued conversation  Blog and listserv re: teaching with SBI (www.causeweb.org/sbi)www.causeweb.org/sbi  Numerous articles and FAQ on blog  Upcoming longer workshops  Philadelphia (June 4-5), Atlanta GA (July 7-9), JSM (July 30), Mathfest (August 5-6), AMATYC (Nov 17-20) Our plans…

 www.causeweb.org/sbi  Getting started – pilots; on-site training, etc.  Convincing colleagues (in and out of the department)  Technology integration (applets; stat package)  Large class sizes  Why so much time on proportions and not quantitative?  What about bootstraping? Q+A

 Acknowledgments: Entire ISI Team (Tintle, Chance, Cobb, Rossman, Roy, Swanson and VanderStoep)  Funding: NSF (DUE-1140629 and DUE-1323210), Wiley, other funding agencies (HHMI; Teagle Foundation, etc.) Acknowledgments

 Cobb, G. (2007). The Introductory Statistics Course: A Ptolemaic Curriculum? Technology Innovations in Statistics Education, 1(1), 1-15.  delMas, R., Garfield, J., Ooms, A., and Chance, B., (2007). Assessing Students’ Conceptual Understanding after a First Course in Statistics, Statistics Education Research Journal, 6(2), 28-58.  Holcomb, J., Chance, B. Rossman, A., & Cobb, G. (2010a). Assessing Student Learning About Statistical Inference, Proceedings of the 8 th International Conference on Teaching Statistics.  Holcomb, J., Chance, B. Rossman, A., Tietjen, E., & Cobb, G. (2010b), Introducing Concepts of Statistical Inference via Randomization Tests, Proceedings of the 8 th International Conference on Teaching Statistics.  Tintle, N., Chance, B., Cobb, G., Rossman, A., Roy, S., Swanson, T., & VanderStoep, J (2016). Introduction to Statistical Investigations. Hoboken, NJ: John Wiley and Sons.  Tintle, N., VanderStoep, J., Holmes, V-L., Quisenberry, B., & Swanson, T. (2011). Development and assessment of a preliminary randomization-based introductory statistics curriculum. Journal of Statistics Education, 19(1).  Tintle, N., Topliff, K., VanderStoep, J., Holmes, V-L., & Swanson, T. (2012). Retention of Statistical Concepts in a Preliminary Randomization-Based Introductory Statistics Curriculum. Statistics Education Research Journal, 11(1). References

Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE.

Similar presentations

Presentation on theme: "Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE.

Similar presentations

Presentation on theme: "Teaching the statistical investigation process with simulation-based inference BETH CHANCE, CAL POLY- SAN LUIS OBISPO NATHAN TINTLE, DORDT COLLEGE."— Presentation transcript:

Similar presentations

About project

Feedback