Thoughts on the Future of Statistics Teaching in the light of Big Data Louisiana State University - Stephenson Dept. of Entrepreneurship and Decision Sciences Helmut Schneider, PhD, Xuan Wang
Overview Hypothesis Testing Causal Inference Big Data Causal Inference, Miguel A. Hernan, James M. Robins Judea Pearl Causal http://bayes.cs.ucla.edu/home.htm Judea Pearl:Causal Inference: http://bayes.cs.ucla.edu/home.htm Miguel A. Hernan, James M. Robins, Causal Inference
Hypothesis Testing Formulate a Theory State Hypothesis: Ho versus H1 Take a sample Compute statistics Make decision What is the reason for these steps?
Problem Identification Traditional Data Sources Big Data Traditional Data Sources Small volume – low statistical power Limited variety – Biased estimates Low velocity – estimates may not be valid in the future Untapped Sources High volume – high statistical significance - small p value High variety – small bias High velocity – dynamic update of estimates
Statistical Significance versus Practical Significance Accounting faculty research… Auditors take samples…
Statistical Significance versus Practical Significance Cancer Doctors Cite Risks of Drinking Alcohol 12 million women and over a quarter of a million breast cancer cases Statistical significance versus practical significance Risk Ratio 9% versus Risk Difference 0.18 percentage points.
Big Data Implications Big data makes everything statistically significant. This is how the real world works. Implications for teaching statistics Need for students to understand practical significance versus statistical significance.
Causal Inference Correlation is not causation. Statisticians only deal with correlations. But yet they also teach students that there is spurious correlation. Myth: In Big Data correlation is causation. Need for students to learn to judge causation.
Even in Big Data Correlation is not causation! Need for students to learn about causality. 9
When can we Make Causal Claims Randomized Designs Observational Data Well – Defined Treatment Positivity Exchangeability 10
Confounding: Directed Acyclic Graphs (DAG) Treatment Outcome Need for students to learn about confounding and DAGs. Confounder Factor
Statistical Significance versus Unbiased Estimates Causality Unbiased Estimates Timely Estimates Variety Velocity Statistical Significance Volume
Causal Inference
Conclusions Students need to learn about the reasons for using hypothesis testing in todays Big Data environment. Need for students to learn to judge practical significance versus statistical significance. Need for students to learn about DAGs. Students need to learn about methods to establish causation.