Download presentation
Presentation is loading. Please wait.
1
Lecture 5, Goodness of Fit Test
Outline of today: Goodness of Fit Test for Discrete Distributions Example Goodness of Fit Test for Continuous Distributions Comparing Proportions 12/3/2018 SA3202, Lecture 5
2
Goodness of Fit Test for Discrete Distributions
Goodness of Fit Test is to test whether a sample of observations come from a given distribution. For this hypothesis, we usually use the Pearson’s Goodness of Fit Test statistic or the Wilk’s Likelihood Ratio Test statistic. A previous example given in last lecture is that “ to test if the number of boys among the first 4 children follows a binomial distribution, which is a discrete distribution. Here is another example. Example 1 The number of accidents, Y, at a certain intersection was checked for n=50 weeks. The results Are Y or more Total Frequency 12/3/2018 SA3202, Lecture 5
3
Problem of interest: whether the number of accidents follows a Poisson distribution:
P(Y=k )= exp(-a) a^k/k!, k=0,1,2,…., a= the parameter We have E(Y)=a, p0=exp(-a), p1=a exp(-a) , p2=a^2/2 exp(-a), p3+=1-p0-p1-p2=1-(1+a+a^2/2) exp(-a). 12/3/2018 SA3202, Lecture 5
4
Since a is unknown, we may estimate it by the sample mean:
a=Y_bar=total number of accidents/ total weeks=(0*32+1*12+2*6+0)/50=24/50=.48 Then m0=n p0=50*exp(-.48)=30.95 m1=np1=50*.48*exp(-.48)=14.85 m2=np2=50*.48^2/2*exp(-.48)=3.56 m3+=50-(m0+m1+m2)=.64<1 Y or more Total Frequency 12/3/2018 SA3202, Lecture 5
5
Since the last expected frequency is less than 1, we combine the last two categories to obtain:
Y or more Total Frequency Expected Frequency We have the Pearson’s Goodness of Fit Test T= with df=3-1-1=1, the 95% table value with 1 df is Thus, H0 is not rejected. That is the Poisson distribution seems to fit the data well. 12/3/2018 SA3202, Lecture 5
6
Goodness of Fit Test for Continuous Distributions
Goodness of Fit Test can also be used for continuous distributions: Step 1: Partition the range of the possible values of the variable into several classes Step 2: Apply the Goodness of Test to the grouped data Example 2 Test if the following 25 observations come from a uniform distribution on the interval 0 to 10: 12/3/2018 SA3202, Lecture 5
7
Step 1 : Partition [0,10] into 5 intervals [0,2), [2,4), [4,6), [6,8), [8,10]. Under H0, each interval has probability .2 and expected frequency=25*.2=5. Interval [0,2), [2,4), [4,6), [6,8), [8,10]. Observed Freq Expected Freq Step2: Apply the Goodness of Fit Test: The Pearson’s Goodness of Fit test statistic T=(5-2)^2/2+…+(5-6)^2/6=6.0, df=5-1=4, 95% table value with df=4 is Thus, H0 is not rejected. That is the uniform distribution fits the data well. 12/3/2018 SA3202, Lecture 5
8
Goodness of Fit Test for Continuous Distributions
Example 3 A Testing Institute designs tests for selection of applicants for high level industrial positions. A test designed by the institute was given to a sample of 300 executives, and the results were organized into the following frequency distribution: Score [0,50), [50,60), [60,70), [70,80), [80,90). [90,100), and over Observed Freq Expected Freq Probability Problem of Interest: test H0: the data are distributed as N(75,15^2), with mean 75, and standard error 15. Under H0, the probabilities and expected frequencies of the 7 intervals are computed and listed in the 4th and 3th row of the above table, respectively. For example, the 5th interval probability and expected frequency are computed as p5= P(80<Y<90)=P((80-75)/15<Z<(90-75)/15)=P(.333<Z<1.0)=.2120, Z=(Y-75)/15 ~N(0,1) m6=n*p5=300*.2120=63.60 The Pearson’s Goodness of Fit test statistic T=61.89, df=7-1=6. The 95% table value with 6 dfs is Thus, H0 is rejected. That is, the data do not follow N(75,15^2). 12/3/2018 SA3202, Lecture 5
9
Remarks: If the proposed distribution involves parameters that need to be estimated, the number of degrees of freedom need to be adjusted: df=k-1- number of parameters estimated e.g. For a normal distribution with both mean and variance to be specified, we lose 2 df’s. The number of classes is somewhat arbitrary, usually between But this is limited by the rule “no expected frequency is less than 1”. For example, for a sample of size 25, the number of classes should be less than 5. Theoretically, It is better to choose the class boundaries so that the resulting classes have equal probability, I.e. Following the equ-probability model. In practice, however, it may be convenient to use more “natural” boundaries provided that none of the expected frequencies are too small. For example, for example marks, natural intervals may be [0,50),[50,60), ….[80,100] for F, D-, D,D+,… A+ etc. 12/3/2018 SA3202, Lecture 5
10
Comparing the Proportions
Binary Response variable Suppose we are interested in comparing two groups (populations) with respect to some binary response variable R: Values: “Positive Response”, “Negative Response” Probability p p2 Explanatory Variable The two groups usually correspond to the categories (levels, classes) of an explanatory variable, C, which is thought to affect the response R. Problem of Interest We are interested in analyzing the relationship between R and C, by comparing p1 and p2: “p1=p2” means “C has no effect on R” means “R and C are uncorrelated or independent” Notation Suppose two samples are drawn: Sample 1, from Group 1, size n1, X1 positive responses Sample 2, from group 2, size n2, X2 positive responses 12/3/2018 SA3202, Lecture 5
11
Estimation Two-way Contingency Table The data are usually presented in a 2x2 contingency table: Response Group Group 2 Positive X X2 Negative n1-X n2-X2 Total n n2 Distributions Note that X1 and X2 are independent Binomial random variables: X1~Binom(n1, p1), X2~Binom(n2, p2) Estimation The natural estimator for p1, and p2 are the sample proportions: =X1/n1, =X2/n2 With E( )=p1, Var( )=p1(1-p1)/n1 E( )=p2, Var( )=p2(1-p2)/n2 12/3/2018 SA3202, Lecture 5
12
Estimation Population Proportion Difference p1-p2 can be estimated by the sample proportion difference , with E( )=p1-p2 Var( )= p1(1-p1)/n1+p2(1-p2)/n2 s.e. ( )= 12/3/2018 SA3202, Lecture 5
13
Asymptotic Normality When both n1 and n2 are large,
~ N(p1-p2, ) Confidence Interval This asymptotic normality can be used to do hypothesis testing and construct approximate confidence interval: The 95 % CI is ( ) +/ s.e. ( ) The 90 % CI is ( ) +/ s.e. ( ) 12/3/2018 SA3202, Lecture 5
14
Example The Vitamin C Data: The following table is based on 1961 French study regarding the therapeutic value of ascorbic (Vitamin C). The study was double blind, with one group of 140 skiers receiving a placebo while a second group of 139 received 1 gram of ascorbic acid per day. Of interest is the relative occurrence of colds for the two groups Placebo Vitamin C Cold Not Cold Total 12/3/2018 SA3202, Lecture 5
15
For this data set, we use “Cold” , “Not Cold “ as “Positive Response”, “Negative Response” respectively, and “Placebo” and “ Vitamin C” as “Group 1” and “Group 2” respectively. Then p1 =X1/n1=31/140=.2214, p1 (1-p1)/n1=1.2314e-3 p2 =X2/n2=17/139=.1223, p2 (1-p2 )/n2=7.7226e-4 p1 -p2 = =.0991, s.e.(p1 -p2 )=(1.2314e e-4)^(1/2)=.04476 The 95% CI=.0991+/ *.04476=.0991+/-.0878=[.01139,.1869], which doesn’t contain 0. This means that p1>p2. It seems that the Vitamin C reduces the proportion of the “Cold” response. 12/3/2018 SA3202, Lecture 5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.