Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics.

Similar presentations


Presentation on theme: "Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics."— Presentation transcript:

1

2 Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu

3 UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu All slides posted at http://www.stat.wisc.edu/~ifischer/Intro_Stat/UWHChttp://www.stat.wisc.edu/~ifischer/Intro_Stat/UWHC

4 Right-cick on image for full.pdf article Links in article to access datasets

5 POPULATION “Statistical Inference” Women in the U.S. who have given birth

6 Study Question: Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. Population Distribution X POPULATION “Statistical Inference” But what does that mean (at least in principle )? ? ? ? ?

7 Study Question: Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. Population Distribution X POPULATION “Statistical Inference” Individual ages from the population tend to collect around a single center with a certain amount of spread, but occasional “outliers” are present in left and right symmetric tails. More precisely… ad infinitum…

8 ~ The Normal Distribution ~  symmetric about its mean  unimodal (i.e., one peak), with left and right “tails”  models many (but not all) naturally-occurring systems  useful mathematical properties… “population mean” “population standard deviation” 

9 ~ The Normal Distribution ~ “population mean” “population standard deviation”  symmetric about its mean  unimodal (i.e., one peak), with left and right “tails”  models many (but not all) naturally-occurring systems   useful mathematical properties…

10 ~ The Normal Distribution ~ “population standard deviation”  symmetric about its mean  unimodal (i.e., one peak), with left and right “tails”  models many (but not all) naturally-occurring systems Approximately 95% of the population values are contained between  – 2 σ and  + 2 σ. 95% is called the confidence level. 5% is called the significance level. 95% 2.5% ≈ 2 σ “population mean”   useful mathematical properties…

11 POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? H 0 : pop mean age  = 25.4 (i.e., no change since 2010) “Statistical Inference” Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.  cannot be found with 100% certainty, but can be estimated with high confidence (e.g., 95%). Population Distribution X

12 Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? FORM ULA x1x1 x4x4 x3x3 x2x2 x5x5 x 400 … etc… H 0 : pop mean age  = 25.4 (i.e., no change since 2010) sample mean age Do the data tend to support or refute the null hypothesis? “Statistical Inference” Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. Population Distribution T-test X

13 Actually, this is a special case of… Samples, size n ~ The Normal Distribution ~ … etc… Population Distribution (of ages) “Sampling Distribution” (of mean ages) via mathematical proof… X

14 Actually, this is a special case of… (of ages) Population Distribution Samples, size n ~ The Normal Distribution ~ … etc… “Sampling Distribution” (of mean ages) X CENTRAL LIMIT THEOREM … as n gets larger Population Distribution (of ages)

15 ~ The Normal Distribution ~ Population Distribution (of ages) “Sampling Distribution” (of mean ages) X The sample mean values have much less variability about  than the population values!

16 “Sampling Distribution” (of mean ages) Approximately 95% of the sample mean values are contained between and 95% 2.5% ≈ 2 σ ~ The Normal Distribution ~ Approximately 95% of the population values are contained between  – 2 σ and  + 2 σ. Population Distribution (of ages)

17 Approximately 95% of the sample mean values are contained between and Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 is called the 95% margin of error etc…  In principle…

18 Approximately 95% of the sample mean values are contained between and Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 is called the 95% margin of error  But from the samples’ point of view…

19 Approximately 95% of the sample mean values are contained between and Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Approximately 95% of the intervals from to contain , and approx 5% do not. is called the 95% margin of error  But from the samples’ point of view…

20 “Sampling Distribution” (of mean ages) Approximately 95% of the sample mean values are contained between and 95% 2.5% ≈ 2 σ ~ The Normal Distribution ~ Approximately 95% of the population values are contained between  – 2 σ and  + 2 σ. Population Distribution (of ages) Approximately 95% of the intervals from to contain , and approx 5% do not.

21 Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. “Null Hypothesis” via… “Hypothesis Testing” H 0 : pop mean age  = 25.4 (i.e., no change since 2010) sample mean = 25.6 “Statistical Inference” POPULATION Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? FORM ULA SAMPLE n = 400 ages x3x3 x2x2 x5x5 x 400 … etc… x1x1 x4x4 95% margin of error Approximately 95% of the intervals from to contain , and approx 5% do not. PROBLEM! σ is unknown the vast majority of the time!PROBLEM!

22 Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? FORM ULA SAMPLE n = 400 ages H 0 : pop mean age  = 25.4 (i.e., no change since 2010) sample mean = 25.6 “Statistical Inference” x3x3 x2x2 x5x5 x 400 … etc… x1x1 x4x4 sample standard deviation sample variance 95% margin of error = modified average of the squared deviations from the mean

23 = 1.61.6 Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? FORM ULA SAMPLE n = 400 ages sample mean = 25.6 “Statistical Inference” x3x3 x2x2 x5x5 x 400 … etc… x1x1 x4x4 sample variance sample standard deviation = 0.16 95% margin of error H 0 : pop mean age  = 25.4 (i.e., no change since 2010) 400

24 Approximately 95% of the intervals from to contain , and approx 5% do not.

25 25.7625.44 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”). = 0.16 95% margin of error = 0.16

26 25.7625.44 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”). 95% CONFIDENCE INTERVAL FOR µ “P-VALUE” of our sample Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p <.0001). Two main ways to conduct a formal hypothesis test:

27 Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. 25.7625.44 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”). 95% CONFIDENCE INTERVAL FOR µ IF H 0 is true, then we would expect a random sample mean that is at least 0.2 years away from  = 25.4 (as ours was), to occur with probability 1.28%. Two main ways to conduct a formal hypothesis test: “P-VALUE” of our sample However, one problem remains… FORMAL CONCLUSIONS:  The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4 years.  The p-value of our sample,.0128, is less than the predetermined α =.05 significance level. Based on our sample data, we may (moderately) reject the null hypothesis H 0 : μ = 25.4 in favor of the two-sided alternative hypothesis H A : μ ≠ 25.4, at the α =.05 significance level. INTERPRETATION: According to the results of this study, there exists a statistically significant difference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data would suggest that the population mean age today is significantly older than in 2010, rather than significantly younger. FORMAL CONCLUSIONS:  The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4 years.  The p-value of our sample,.0128, is less than the predetermined α =.05 significance level. Based on our sample data, we may (moderately) reject the null hypothesis H 0 : μ = 25.4 in favor of the two-sided alternative hypothesis H A : μ ≠ 25.4, at the α =.05 significance level. INTERPRETATION: According to the results of this study, there exists a statistically significant difference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data would suggest that the population mean age today is significantly older than in 2010, rather than significantly younger.

28 (mean ages) “Sampling Distribution” Approximately 95% of the sample mean values are contained between and 95% 2.5% ≈ 2 σ Normal Distribution Approximately 95% of the population values are contained between  – 2 σ and  + 2 σ. Approximately 95% of the intervals from to contain , and approx 5% do not. Population Distribution (of ages) Normal Distribution

29 (mean ages) “Sampling Distribution” Approximately 95% of the sample mean values are contained between and 95% 2.5% ≈ 2 σ Normal Distribution Approximately 95% of the population values are contained between  – 2 s and  + 2 s. Approximately 95% of the intervals from to contain , and approx 5% do not. Population Distribution (of ages) Normal Distribution T …IF n is large, e.g.,  30 Alas, this introduces “sampling variability.”

30 Edited R code: y = rnorm(400, 0, 1) z = (y - mean(y)) / sd(y) x = 25.6 + 1.6*z sort(round(x, 1)) [1] 19.6 20.2 20.4 20.5 21.2 22.3 22.3 22.4 22.4 22.4 22.6 22.7 22.7 22.7 22.8 [16] 23.0 23.0 23.1 23.1 23.2 23.2 23.2 23.2 23.2 23.3 23.4 23.4 23.4 23.5 23.5 etc... [391] 28.7 28.7 28.9 29.2 29.3 29.4 29.6 29.7 29.9 30.2 c(mean(x), sd(x)) [1] 25.6 1.6 t.test(x, mu = 25.4) One Sample t-test data: x t = 2.5, df = 399, p-value = 0.01282 alternative hypothesis: true mean is not equal to 25.4 95 percent confidence interval: 25.44273 25.75727 sample estimates: mean of x 25.6 t.test(x, mu = 25.4) One Sample t-test data: x t = 2.5, df = 399, p-value = 0.01282 alternative hypothesis: true mean is not equal to 25.4 95 percent confidence interval: 25.44273 25.75727 sample estimates: mean of x 25.6 Generates a normally-distributed random sample of 400 age values. Calculates sample mean and standard deviation.

31 (mean ages) “Sampling Distribution” (mean ages) “Sampling Distribution” Approximately 95% of the sample mean values are contained between and 95% 2.5% ≈ 2 σ Normal Distribution Approximately 95% of the population values are contained between  – 2 s and  + 2 s. Approximately 95% of the intervals from to contain , and approx 5% do not. Population Distribution (of ages) Normal Distribution T …IF n is large, e.g.,  30 But if n is small…

32 If n is large, T-score ≈ 2. If n is small, T-score > 2. … the “T-score" increases (from ≈ 2 to a max of 12.706 for a 95% confidence level) as n decreases  larger margin of error  less power to reject, even if a genuine statistically significant difference exists!

33 POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? FORM ULA x1x1 x4x4 x3x3 x2x2 x5x5 x 400 … etc… H 0 : pop mean age  = 25.4 (i.e., no change since 2010) sample mean age Do the data tend to support or refute the null hypothesis? Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? “Statistical Inference” Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. T-test Two loose ends

34 POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? H 0 : pop mean age  = 25.4 (i.e., no change since 2010) “Statistical Inference” Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. T-test The reasonableness of the normality assumption is empirically verifiable, and in fact formally testable from the sample data. If violated (e.g., skewed) or inconclusive (e.g., small sample size), then “distribution-free” nonparametric tests can be used instead of the T-test. Examples: Sign Test, Wilcoxon Signed Rank Test (= Mann-Whitney Test) Two loose ends Check?

35 POPULATION “Null Hypothesis” via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? x1x1 x4x4 x3x3 x2x2 x5x5 x 400 … etc… H 0 : pop mean age  = 25.4 (i.e., no change since 2010) “Statistical Inference” Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. T-test Two loose ends Sample size n partially depends on the power of the test, i.e., the desired probability of correctly rejecting a false null hypothesis (80% or more).

36 Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu Part 2: Overview of Biostatistics: “Which Test Do I Use??” Sincere thanks to… Judith Payne Judith Payne Heidi Miller Heidi Miller Samantha Goodrich Samantha Goodrich Troy Lawrence Troy Lawrence YOU! YOU!


Download ppt "Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics."

Similar presentations


Ads by Google