Lecture 5, Goodness of Fit Test

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
BCOR 1020 Business Statistics Lecture 17 – March 18, 2008.
Chapter 16 Chi Squared Tests.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
Chapter Topics Confidence Interval Estimation for the Mean (s Known)
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
5-3 Inference on the Means of Two Populations, Variances Unknown
Hypothesis Testing Using The One-Sample t-Test
AS 737 Categorical Data Analysis For Multivariate
AM Recitation 2/10/11.
Two Sample Tests Ho Ho Ha Ha TEST FOR EQUAL VARIANCES
1 Level of Significance α is a predetermined value by convention usually 0.05 α = 0.05 corresponds to the 95% confidence level We are accepting the risk.
Dan Piett STAT West Virginia University
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
1 G Lect 6b G Lecture 6b Generalizing from tests of quantitative variables to tests of categorical variables Testing a hypothesis about a.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Confidence intervals and hypothesis testing Petter Mostad
Fitting probability models to frequency data. Review - proportions Data: discrete nominal variable with two states (“success” and “failure”) You can do.
Copyright © 2010 Pearson Education, Inc. Slide
The final exam solutions. Part I, #1, Central limit theorem Let X1,X2, …, Xn be a sequence of i.i.d. random variables each having mean μ and variance.
Section 6.4 Inferences for Variances. Chi-square probability densities.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
ESTIMATION OF THE MEAN. 2 INTRO :: ESTIMATION Definition The assignment of plausible value(s) to a population parameter based on a value of a sample statistic.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
6-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Log-linear Models Please read Chapter Two. We are interested in relationships between variables White VictimBlack Victim White Prisoner151 (151/160=0.94)
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Estimation and Confidence Intervals Chapter 9.
Theoretical distributions: the Normal distribution.
Estimating standard error using bootstrap
Virtual University of Pakistan
March 28 Analyses of binary outcomes 2 x 2 tables
Continuous Probability Distributions
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
ESTIMATION.
Chapter 9: Non-parametric Tests
Chapter 4: Sampling and Statistical Inference
Statistical Analysis Professor Lynne Stokes
Chapter 4. Inference about Process Quality
Sample Mean Distributions
Lecture 6 Comparing Proportions (II)
Inferences About Means from Two Groups
Statistics in Applied Science and Technology
Goodness of Fit Tests The goal of χ2 goodness of fit tests is to test is the data comes from a certain distribution. There are various situations to which.
Lecture 2. The Binomial Distribution
Lecture 13 The Quantile Test
Lecture 4. The Multinomial Distribution (II)
Analysis of count data 1.
CONCEPTS OF ESTIMATION
Lecture 7 The Odds/ Log Odds Ratios
Discrete Event Simulation - 4
Lecture 10 Comparing 2xk Tables
STAT 312 Introduction Z-Tests and Confidence Intervals for a
Categorical Data Analysis
Lecture 9 Sampling Procedures and Testing Independence
Lecture 1. Introduction Outlines for Today 1.Types of Variables
Tutorial 4 For the Seat Belt Data, the Death Penalty Data, and the University Admission Data, (1). Identify the response variable and the explanatory.
Lecture 3. The Multinomial Distribution
Sampling Distributions (§ )
SA3202, Solution for Tutorial 3
Chapter 7 Estimation: Single Population
Presentation transcript:

Lecture 5, Goodness of Fit Test Outline of today: Goodness of Fit Test for Discrete Distributions Example Goodness of Fit Test for Continuous Distributions 5. Comparing Proportions 12/3/2018 SA3202, Lecture 5

Goodness of Fit Test for Discrete Distributions Goodness of Fit Test is to test whether a sample of observations come from a given distribution. For this hypothesis, we usually use the Pearson’s Goodness of Fit Test statistic or the Wilk’s Likelihood Ratio Test statistic. A previous example given in last lecture is that “ to test if the number of boys among the first 4 children follows a binomial distribution, which is a discrete distribution. Here is another example. Example 1 The number of accidents, Y, at a certain intersection was checked for n=50 weeks. The results Are Y 0 1 2 3 or more Total Frequency 32 12 6 0 50 12/3/2018 SA3202, Lecture 5

Problem of interest: whether the number of accidents follows a Poisson distribution: P(Y=k )= exp(-a) a^k/k!, k=0,1,2,…., a= the parameter We have E(Y)=a, p0=exp(-a), p1=a exp(-a) , p2=a^2/2 exp(-a), p3+=1-p0-p1-p2=1-(1+a+a^2/2) exp(-a). 12/3/2018 SA3202, Lecture 5

Since a is unknown, we may estimate it by the sample mean: a=Y_bar=total number of accidents/ total weeks=(0*32+1*12+2*6+0)/50=24/50=.48 Then m0=n p0=50*exp(-.48)=30.95 m1=np1=50*.48*exp(-.48)=14.85 m2=np2=50*.48^2/2*exp(-.48)=3.56 m3+=50-(m0+m1+m2)=.64<1 Y 0 1 2 3 or more Total Frequency 32 12 6 0 50 12/3/2018 SA3202, Lecture 5

Since the last expected frequency is less than 1, we combine the last two categories to obtain: Y 0 1 2 or more Total Frequency 32 12 6 50 Expected Frequency 30.95 14.85 4.2 50 We have the Pearson’s Goodness of Fit Test T=1.354 with df=3-1-1=1, the 95% table value with 1 df is 3.841. Thus, H0 is not rejected. That is the Poisson distribution seems to fit the data well. 12/3/2018 SA3202, Lecture 5

Goodness of Fit Test for Continuous Distributions Goodness of Fit Test can also be used for continuous distributions: Step 1: Partition the range of the possible values of the variable into several classes Step 2: Apply the Goodness of Test to the grouped data Example 2 Test if the following 25 observations come from a uniform distribution on the interval 0 to 10: 2.85 8.59 5.95 7.94 1.70 7.14 7.05 5.53 8.01 5.12 7.07 7.88 6.77 9.89 4.27 9.07 4.54 3.86 9.27 9.22 3.99 .01 7.83 6.34 6.23 12/3/2018 SA3202, Lecture 5

Step 1 : Partition [0,10] into 5 intervals [0,2), [2,4), [4,6), [6,8), [8,10]. Under H0, each interval has probability .2 and expected frequency=25*.2=5. Interval [0,2), [2,4), [4,6), [6,8), [8,10]. Observed Freq. 2 3 5 9 6 Expected Freq. 5 5 5 5 5 Step2: Apply the Goodness of Fit Test: The Pearson’s Goodness of Fit test statistic T=(5-2)^2/2+…+(5-6)^2/6=6.0, df=5-1=4, 95% table value with df=4 is 9.488. Thus, H0 is not rejected. That is the uniform distribution fits the data well. 12/3/2018 SA3202, Lecture 5

Goodness of Fit Test for Continuous Distributions Example 3 A Testing Institute designs tests for selection of applicants for high level industrial positions. A test designed by the institute was given to a sample of 300 executives, and the results were organized into the following frequency distribution: Score [0,50), [50,60), [60,70), [70,80), [80,90). [90,100), 100 and over Observed Freq. 0 24 64 120 73 19 0 Expected Freq. 14.25 33.36 63.60 77.58 63.60 33.36 14.25 Probability .0475 .1112 .2120 .2586 .2120 .1112 .0475 Problem of Interest: test H0: the data are distributed as N(75,15^2), with mean 75, and standard error 15. Under H0, the probabilities and expected frequencies of the 7 intervals are computed and listed in the 4th and 3th row of the above table, respectively. For example, the 5th interval probability and expected frequency are computed as p5= P(80<Y<90)=P((80-75)/15<Z<(90-75)/15)=P(.333<Z<1.0)=.2120, Z=(Y-75)/15 ~N(0,1) m6=n*p5=300*.2120=63.60 The Pearson’s Goodness of Fit test statistic T=61.89, df=7-1=6. The 95% table value with 6 dfs is 12.592. Thus, H0 is rejected. That is, the data do not follow N(75,15^2). 12/3/2018 SA3202, Lecture 5

Remarks: If the proposed distribution involves parameters that need to be estimated, the number of degrees of freedom need to be adjusted: df=k-1- number of parameters estimated e.g. For a normal distribution with both mean and variance to be specified, we lose 2 df’s. The number of classes is somewhat arbitrary, usually between 5---12. But this is limited by the rule “no expected frequency is less than 1”. For example, for a sample of size 25, the number of classes should be less than 5. 3. Theoretically, It is better to choose the class boundaries so that the resulting classes have equal probability, I.e. Following the equ-probability model. In practice, however, it may be convenient to use more “natural” boundaries provided that none of the expected frequencies are too small. For example, for example marks, natural intervals may be [0,50),[50,60), ….[80,100] for F, D-, D,D+,… A+ etc. 12/3/2018 SA3202, Lecture 5

Comparing the Proportions Binary Response variable Suppose we are interested in comparing two groups (populations) with respect to some binary response variable R: Values: “Positive Response”, “Negative Response” Probability p1 p2 Explanatory Variable The two groups usually correspond to the categories (levels, classes) of an explanatory variable, C, which is thought to affect the response R. Problem of Interest We are interested in analyzing the relationship between R and C, by comparing p1 and p2: “p1=p2” means “C has no effect on R” means “R and C are uncorrelated or independent” Notation Suppose two samples are drawn: Sample 1, from Group 1, size n1, X1 positive responses Sample 2, from group 2, size n2, X2 positive responses 12/3/2018 SA3202, Lecture 5

Estimation Two-way Contingency Table The data are usually presented in a 2x2 contingency table: Response Group 1 Group 2 Positive X1 X2 Negative n1-X1 n2-X2 Total n1 n2 Distributions Note that X1 and X2 are independent Binomial random variables: X1~Binom(n1, p1), X2~Binom(n2, p2) Estimation The natural estimator for p1, and p2 are the sample proportions: =X1/n1, =X2/n2 With E( )=p1, Var( )=p1(1-p1)/n1 E( )=p2, Var( )=p2(1-p2)/n2 12/3/2018 SA3202, Lecture 5

Estimation Population Proportion Difference p1-p2 can be estimated by the sample proportion difference , with E( )=p1-p2 Var( )= p1(1-p1)/n1+p2(1-p2)/n2 s.e. ( )= 12/3/2018 SA3202, Lecture 5

Asymptotic Normality When both n1 and n2 are large, ~ N(p1-p2, ) Confidence Interval This asymptotic normality can be used to do hypothesis testing and construct approximate confidence interval: The 95 % CI is ( ) +/- 1.96 s.e. ( ) The 90 % CI is ( ) +/- 1.645 s.e. ( ) 12/3/2018 SA3202, Lecture 5

Example The Vitamin C Data: The following table is based on 1961 French study regarding the therapeutic value of ascorbic (Vitamin C). The study was double blind, with one group of 140 skiers receiving a placebo while a second group of 139 received 1 gram of ascorbic acid per day. Of interest is the relative occurrence of colds for the two groups Placebo Vitamin C Cold 31 17 Not Cold 109 122 Total 140 139 12/3/2018 SA3202, Lecture 5

For this data set, we use “Cold” , “Not Cold “ as “Positive Response”, “Negative Response” respectively, and “Placebo” and “ Vitamin C” as “Group 1” and “Group 2” respectively. Then p1 =X1/n1=31/140=.2214, p1 (1-p1)/n1=1.2314e-3 p2 =X2/n2=17/139=.1223, p2 (1-p2 )/n2=7.7226e-4 p1 -p2 =.2214-.1223=.0991, s.e.(p1 -p2 )=(1.2314e-3+7.7226e-4)^(1/2)=.04476 The 95% CI=.0991+/- 1.96*.04476=.0991+/-.0878=[.01139,.1869], which doesn’t contain 0. This means that p1>p2. It seems that the Vitamin C reduces the proportion of the “Cold” response. 12/3/2018 SA3202, Lecture 5