Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

Slides:



Advertisements
Similar presentations
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Advertisements

Hypothesis Testing Steps in Hypothesis Testing:
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Inferential Statistics
Parametric/Nonparametric Tests. Chi-Square Test It is a technique through the use of which it is possible for all researchers to:  test the goodness.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
AP Statistics – Chapter 9 Test Review
Elementary hypothesis testing
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Stat Day 16 Observations (Topic 16 and Topic 14)
BCOR 1020 Business Statistics Lecture 22 – April 10, 2008.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Chapter 3 Hypothesis Testing. Curriculum Object Specified the problem based the form of hypothesis Student can arrange for hypothesis step Analyze a problem.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Stat 217 – Day 15 Statistical Inference (Topics 17 and 18)
Chapter 9 Hypothesis Testing.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Chapter 13: Inference in Regression
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
8 - 1 © 2003 Pearson Prentice Hall Chi-Square (  2 ) Test of Variance.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistical Methods 5.
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
CHAPTER 17: Tests of Significance: The Basics
Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests.
Confidence intervals and hypothesis testing Petter Mostad
Chapter 8 Introduction to Hypothesis Testing ©. Chapter 8 - Chapter Outcomes After studying the material in this chapter, you should be able to: 4 Formulate.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter Outline Goodness of Fit test Test of Independence.
Chapter 10 The t Test for Two Independent Samples
CHAPTERS HYPOTHESIS TESTING, AND DETERMINING AND INTERPRETING BETWEEN TWO VARIABLES.
© Copyright McGraw-Hill 2004
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 6.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 4.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
1 Testing Statistical Hypothesis The One Sample t-Test Heibatollah Baghi, and Mastee Badii.
Module 25: Confidence Intervals and Hypothesis Tests for Variances for One Sample This module discusses confidence intervals and hypothesis tests.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5.
Created by Erin Hodgess, Houston, Texas Section 7-1 & 7-2 Overview and Basics of Hypothesis Testing.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Section 9-3 Inferences About Two Means:
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
1 Underlying population distribution is continuous. No other assumptions. Data need not be quantitative, but may be categorical or rank data. Very quick.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 9: Hypothesis Tests for One Population Mean 9.5 P-Values.
©2003 Thomson/South-Western 1 Chapter 8 – Hypothesis Testing for the Mean and Variance of a Population Slides prepared by Jeff Heyl, Lincoln University.
Chapter 9 -Hypothesis Testing
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Presentation transcript:

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3

Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions (continued) Bootstrap (first part) Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

Statistical Data Analysis 3 Today’s topics: Exploring distributions (Chapter 3: 3.5.2, ) Bootstrap (Chapter 4: 4.1, 4.2) Exploring distributions 3.5. Tests for goodness of fit Shapiro-Wilk test for normal distribution (last week) Kolmogorov-Smirnov test for general distribution Chi-square test for goodness of fit for general distribution Bootstrap 4.1. Simulation (read yourself) 4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

Statistical Data Analysis Exploring distributions: reminder Testing Ingredients of test? n Hypotheses H 0 and H 1 n Test statistic T n Distribution of T under H 0 and know how it is changed/shifted under H 1 n Rule for when H 0 will be rejected: u Rejection rule either based on critical region or on p-value How to perform test? n Describe the above n Choose significance level α n Calculate and report value t of T n Report whether t is in critical region, or whether p-value < α n Formulate conclusion of test: “H 0 rejected” or “H 0 not rejected” n If possible translate conclusion to practical context NB. When asked to perform a test, you have to do all 6 steps!

Statistical Data Analysis 5 Tests for goodness of fit: for (one) general distribution Situation independent realizations from unknown distribution F now:, one specific distribution: Which statistic gives information about distribution F?

Statistical Data Analysis Kolmogorov-Smirnov test (1) independent realizations from unknown distribution F Idea: use empirical distribution function Makes sense: is r.v., ~ binom(n, F(x)) so that for n → ∞, Then also under H 0, for n → ∞, Base test on distance between and

Statistical Data Analysis 7 Kolmogorov-Smirnov test (2) Test statistic: Distribution of D n under H 0 : same for all continuous F 0 : D n is distribution free over class of continuous distribution functions K-S test is nonparametric test Because When is H 0 rejected? For large values of D n

Statistical Data Analysis 8 Kolmogorov-Smirnov test (3) Test statistic: p-values from tables or computer package. Note: standard K-S test with these p-values not suitable for composite H 0 Then adjusted K-S test with adjusted p-values Example: for “H 0 : F is normal” adjusted test statistic for K-S test is What is difference? adj Additional stochasticity!

Statistical Data Analysis 9 Kolmogorov-Smirnov test (4) Data: x H 0 : F = N(0,1) H 1 : F ≠ N(0,1) Test statistic: R: > ks.test(x,pnorm) One-sample Kolmogorov-Smirnov test data: x D = , p-value = alternative hypothesis: two-sided H 0 rejected? of x Example

Statistical Data Analysis 10 Kolmogorov-Smirnov test (5) Data: y H 0 : F is normal ← composite null hypothesis H 1 : F is not normal Test statistic: R: > ks.test(y,pnorm) D = , p-value = 6.661e-16 > ks.test(y,pnorm,mean=mean(y),sd=sd(y)) D = , p-value = > mean(y) [1] > sd(y) [1] adj Incorrect: this is test for H 0 : F = N(0,1) H 1 : F ≠ N(0,1) Incorrect : this is test for H 0 : F = N( ,( ) 2 ) H 1 : F ≠ N( ,( ) 2 ) of y Example We have not used D adj ! ! p-value should be (next week) Correct?

Statistical Data Analysis Chi-square test for goodness of fit (1) independent realizations from unknown distribution F Idea: use empirical distribution in different way: divide real line in intervals I 1, …,I k and compare number of data in intervals with expected number in intervals under H 0 N i = number of observations in I i p i = probability of observation in I i under F 0 Then np i = expected number in intervals under H 0 Test statistic:

Statistical Data Analysis 12 Chi-square test for goodness of fit (2) Test statistic: Distribution of X 2 under H 0 : different for different F 0, but for n → ∞ distribution of X 2 under H 0 : chi-square with k-1 df same for all F 0 For large enough n, X 2 distribution free chi-square test nonparametric test When is H 0 rejected? For large values of X 2

Statistical Data Analysis 13 Chi-square test for goodness of fit (3) Test statistic: How to choose intervals I 1, …,I k ? How many? More is better, but not too many Rule of Thumb: at least 5 observations expected in each interval under H 0 Where? About same number expected in each interval under H 0

Statistical Data Analysis 14 Chi-square test for goodness of fit (4) Data: y H 0 : F = N(4,9) H 1 : F ≠ N(4,9) Test statistic: R: > chisquare(y,pnorm,k=8, lb=0, ub=16,mean=4,sd=3) $chisquare [1] $pr [1] $N (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,16] $np [1] #Expected numbers under H 0 do not satisfy rule of thumb Better: choose suitable vector b of `breaks’ > chisquare(y,pnorm,breaks=b,mean=4,sd=3) of y Example

Statistical Data Analysis 15 Chi-square test for goodness of fit (5) Test statistic: under H 0 : χ 2 k-1 Standard chi-square test not suitable for composite H 0 Then adjusted chi-square test with adjusted chi-square distribution Example: for “H 0 : F is normal” adjusted chi-square test statistic is under H 0 : χ 2 k-m-1 only for one specific type of estimators

Statistical Data Analysis 16 Recap Exploring distributions 3.5. Tests for goodness of fit Kolmogorov-Smirnov test for general distribution Chi-square test for goodness of fit for general distribution

Statistical Data Analysis Bootstrap

Statistical Data Analysis 18 Bootstrap: Introduction(1) Data: 59 melting temperatures of beewax P unknown true underlying distribution of beewax data Estimator of location of P? T n = (sample) Mean Estimate of location of P? t n = mean(beewax) = How accurate is estimate? How good is estimator? Distribution of T n ? Broad/narrow? Main question: How to estimate unknown distribution of estimator T n Notation: Q P Example R: > beewax [1] …. > mean(beewax) [1] > sd(beewax) [1] > var(beewax) [1] R: > beewax [1] …. > mean(beewax) [1] > sd(beewax) [1] > var(beewax) [1]

Statistical Data Analysis 19 Bootstrap: Introduction(2) ( Continued) Simple case: assume P ~ N(μ,σ 2 ) T n = (sample) Mean What is distribution Q P of T n ? We estimate: N(63.589, 0.121/59) How did we find this? i) Estimator of P: N((sample) Mean, (sample) Variance) ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : N((sample) Mean, (sample)Variance/59) v) Estimate of Q P : N(63.589, 0.121/59) Example R: > beewax [1] …. > mean(beewax) [1] > sd(beewax) [1] > var(beewax) [1] > length(beewax) [1] 59

Statistical Data Analysis 20 Bootstrap: Introduction(3) ( Continued) Other case: assume P ~ N(μ,σ 2 ) Now T n = (sample) Median What is distribution Q P of T n ? How to proceed now? i) Estimator of P: N((sample) Mean, (sample) Variance) ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Median of 59 independent observations from P iv) Estimator of Q P : ? v) Estimate of Q P : ? This is what bootstrap is about: estimate distribution Q P of function T n of 59 independent observations from unknown P n Example R: > beewax [1] …. > mean(beewax) [1] > sd(beewax) [1] > var(beewax) [1] > length(beewax) [1] 59

Statistical Data Analysis 21 Bootstrap: Introduction(4) ( Continued) Again other case: no assumption about P T n = (sample) Mean What is distribution Q P of T n ? How to proceed now? i) Estimator of P: ? ii) Estimate: ? iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : ? v) Estimate of Q P : ? This is what bootstrap is about: estimate distribution Q P of function T n of 59 independent observations from unknown P n Example R: > beewax [1] …. > mean(beewax) [1] > sd(beewax) [1] > var(beewax) [1] > length(beewax) [1] 59

Statistical Data Analysis Bootstrap estimators for a distribution This is what bootstrap is about: estimate distribution Q P of function T n of n independent observations from unknown P Situation realizations of, independent, unknown distr. P Goal Estimate distribution of estimator Cases 1. Assume P is some parametric distribution with unknown parameters 2. Assume nothing about P

Statistical Data Analysis 23 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (1) ( Beewax; case 1) Case 1: Assume P ~ N(μ,σ 2 ) ; T n = (sample) Median What is distribution Q P of T n ? How to proceed? i) Estimator of P: N( X,S 2 ) = P ii) Estimate: N(63.589,0.121) iii) Q P is distribution of Median of 59 independent observations from P iv) Estimator of Q P : distribution of Median of 59 independent observations from N( X,S 2 ) = P v) Estimate of Q P : distribution of Median of 59 independent observations from N(63.589,0.121) Unknown: use computer to generate realizations from estimate of Q P Empirical distribution of generated set is parametric bootstrap estimate of Q P Example ^ θnθn ^ θnθn Which distribution is this?

Statistical Data Analysis 24 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (2) ( Continued; case 1) How to generate realizations from estimate of Q P : i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)? # 1. Generate one bootstrap sample: > xstar=rnorm(59, ,sqrt(0.121)) # Check: > xstar [1] ….. [57] #Note: xstar is of same length as beewax # 2. Now compute one bootstrap value tstar from xstar: > tstar=median(xstar) > tstar [1] # 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of Q P Example

Statistical Data Analysis 25 Bootstrap estimators for a distribution; Case 1: parametric bootstrap estimator (3) ( Continued; case 1) The B values tstar are generated realizations from estimate of Q P i.e. from distribution of Median of 59 independent observations from N(63.589,0.121) Recall: empirical distribution of generated set is parametric bootstrap estimate of Q P Also: sample variance of generated set is parametric bootstrap estimate of variance of Q P Example

Statistical Data Analysis 26 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (1) ( Beewax; case 2) Case 2: Assume nothing about P; T n = (sample) Mean What is distribution Q P of T n ? How to proceed? i) Estimator of P: empirical distribution of data = P n ii) Estimate: empirical distribution of beewax data iii) Q P is distribution of Mean of 59 independent observations from P iv) Estimator of Q P : distribution of Mean of 59 independent observations from empirical distribution of data v) Estimate of Q P : distribution of Mean of 59 independent observations from empirical distribution of beewax data Unknown: use computer to generate realizations from estimate of Q P Empirical distribution of generated set is Empirical bootstrap estimate of Q P Example Which distribution is this? ^

Statistical Data Analysis 27 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (2) ( Continued; case 2) How to generate realizations from estimate of Q P : i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data ? # 1. Generate one bootstrap sample: > xstar=sample(beewax, replace = TRUE) # Check: > xstar [1] ….. #Note: xstar is of same length as beewax and consists of values sampled from the set of #beewax values. # 2. Now compute one bootstrap value tstar from xstar: > tstar=mean(xstar) > tstar [1] # 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of Q P Example

Statistical Data Analysis 28 Bootstrap estimators for a distribution; Case 2: empirical bootstrap estimator (3) ( Continued; case 2) The B values tstar are generated realizations from estimate of Q P i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data Recall: empirical distribution of this generated set is empirical bootstrap estimate of Q P Also: sample variance of this generated set is empirical bootstrap estimate of variance of Q P Note: this value is comparable to value 0.121/59 = of estimate of variance of Q P under normality assumption for P Example

Statistical Data Analysis 29 Empirical bootstrap with R # Can be done in one go with local R-function bootstrap: > bootstrap = function(x, statistic, B = 100.,...) { # returns a vector of B bootstrap values of real-valued statistic. # statistic(x) should be R-function ; arguments of # statistic can be inserted on... # resampling is done from empirical distribution of x y <- numeric(B) for(j in 1.:B) y[j] <- statistic(sample(x, replace = TRUE),...) y } # Compute 1000 bootstrap values tstar: > tstarvector=bootstrap(beewax,mean,B=1000)

Statistical Data Analysis 30 Bootstrap: two errors Recall goal: to estimate distribution Q P of function T n of n independent observations from unknown P Note: in bootstrap estimation procedure two types of “errors” are made; Which ones? Given the data: - Estimate of Q P : distribution of function T n of n independent observations from estimate of P - Which is estimated in turn by empirical distribution of computer generated realizations of this distribution How can we make these errors small? Size first error depends on quality estimator of P Size second error can be made small by taking B large First error Second error

Statistical Data Analysis 31 Recap Bootstrap 4.1. Simulation (read yourself) 4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

Statistical Data Analysis 32 Exploring distributions/Bootstrap The end