Contingency tables and goodness of fit

Slides:

Advertisements

Similar presentations

Chapter 11 Other Chi-Squared Tests

Advertisements

CHI-SQUARE(X2) DISTRIBUTION

Basic Statistics The Chi Square Test of Independence.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

BCOR 1020 Business Statistics

The Chi-square Statistic. Goodness of fit 0 This test is used to decide whether there is any difference between the observed (experimental) value and.

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

Chapter 26 Chi-Square Testing

Introduction Many experiments result in measurements that are qualitative or categorical rather than quantitative. Humans classified by ethnic origin Hair.

© Copyright McGraw-Hill CHAPTER 11 Other Chi-Square Tests.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.

Chi-Square Analyses.

1 Chi-square Test Dr. T. T. Kachwala. Using the Chi-Square Test 2 The following are the two Applications: 1. Chi square as a test of Independence 2.Chi.

Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:

Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.

Chi Square Test for Goodness of Fit Determining if our sample fits the way it should be.

11.1 Chi-Square Tests for Goodness of Fit Objectives SWBAT: STATE appropriate hypotheses and COMPUTE expected counts for a chi- square test for goodness.

Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.

Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &

CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,

Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.

11.1 Chi-Square Tests for Goodness of Fit

Basic Statistics The Chi Square Test of Independence.

The Chi-square Statistic

Chi-Square hypothesis testing

Presentation 12 Chi-Square test.

CHAPTER 11 Inference for Distributions of Categorical Data

Test for Goodness of Fit

Chapter 12 Tests with Qualitative Data

Active Learning Lecture Slides

CHAPTER 11 Inference for Distributions of Categorical Data

Qualitative data – tests of association

Chapter 5 Sampling Distributions

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Goodness of Fit Tests The goal of χ2 goodness of fit tests is to test is the data comes from a certain distribution. There are various situations to which.

Data Analysis for Two-Way Tables

Chapter 11 Goodness-of-Fit and Contingency Tables

Chapter 9 Hypothesis Testing.

Goodness-of-Fit Tests

CONCEPTS OF ESTIMATION

We’ll now consider 2x2 contingency tables, a table which has only 2 rows and 2 columns along with a special way to analyze it called Fisher’s Exact Test.

Consider this table: The Χ2 Test of Independence

Chi Square Two-way Tables

Part IV Significantly Different Using Inferential Statistics

Discrete Event Simulation - 4

Chapter 11: Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 10 Analyzing the Association Between Categorical Variables

Contingency Tables (cross tabs)

Lecture 36 Section 14.1 – 14.3 Mon, Nov 27, 2006

Chi Square (2) Dr. Richard Jackson

Inference on Categorical Data

Chapter 5 Sampling Distributions

Lecture 41 Section 14.1 – 14.3 Wed, Nov 14, 2007

Analyzing the Association Between Categorical Variables

Non-Parametric Statistics Part I: Chi-Square

CHAPTER 11 Inference for Distributions of Categorical Data

Exact Test Fisher’s Statistics

CHAPTER 11 Inference for Distributions of Categorical Data

Section 11-1 Review and Preview

CHAPTER 11 Inference for Distributions of Categorical Data

Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.

Chapter 26 Comparing Counts.

CHAPTER 11 Inference for Distributions of Categorical Data

Contingency Tables (cross tabs)

15 Chi-Square Tests Chi-Square Test for Independence

Lecture 46 Section 14.5 Wed, Apr 13, 2005

Lecture 43 Section 14.1 – 14.3 Mon, Nov 28, 2005

Presentation transcript:

Contingency tables and goodness of fit Chapter 13 Contingency tables and goodness of fit

Part 1 - Contingency Tables We’ll begin with a simple example and then extend it to create a very general test.

Example Xi ~ i.i.d. BER(p) H0: p = p0 H1: p ≠ p0 Derive an asymptotic test and show that it is of the form: sum of (observed – expected)2/expected Identify the distribution of the test statistic (under the null). Find the p-value with the following data (p0 = ½): Outcome of test statistic = 5/2. P-value = 1 - pchisq(5/2, df = 1) = 0.11 Observed (expected) Successes Failures 15 (20) 25 (20)

Example 13.4.1 A 6-sided die is rolled 60 times, and we want to test if it is fair. Find the test statistic and its distribution under the null. Discuss the term ‘degrees of freedom.’ Compute the p-value. Number Observed (expected) 1 8 (10) 2 11 (10) 3 5 (10) 4 12 (10) 5 15 (10) 6 9 (10) Outcome of test statistic = 6 P-value = 1-pchisq(6,df=5) = 0.31

Example 13.3.1 A certain characteristic is believed to be present in 20% of Race 1 20% of Race 2 20% of Race 3 From a sample we find the characteristic present in 20/50 from Race 1 25/100 from Race 2 15/50 from Race 3 Write the 2-way contingency table (race by characteristic status) with expected an actual counts. Identify the test statistic and its distribution under the null. Calculate the outcome of the test statistic and the p-value. Outcome of test statistic is 17.8. P-value is 1-pchisq(17.8, df = 3) = 0.00048.

Example 13.3.2 Repeat the previous example with the null hypothesis that the presence of the characteristic is that same in each race (but not any value in particular). H0: p1 = p2 = p3 Demonstrate how to estimate p and use it to obtain expected counts. Demonstrate that both rows and columns add to zero, so df = 2. You can also think about this as losing a degree of freedom for each parameter estimated. Outcome of test statistic = 3.57. P-value = 1 - pchisq(3.57, df=2) = 0.17

Example 13.6.1 Is political affiliation associated with the space program? A survey was given asking political affiliation and when we should increase, decrease or maintain current support for the space program. This data is fake. Increase Same Decrease Total Republican 8 12 10 30 Democrat 17 13 40 Independent 6 35

Example 13.6.1 Is political affiliation associated with the space program? A survey was given asking political affiliation and when we should increase, decrease or maintain current support for the space program. This data is fake. Increase Same Decrease Total Republican 8 (9) 12 (10.5) 10 (10.5) 30 Democrat 10 (12) 17 (14) 13 (14) 40 Independent 12 (9) 6 (10.5) 35 Demonstrate that rows and columns sum to zero, giving 4 degrees of freedom. Outcome of test statistic = 4.54 P-value = 1 - pchisq(4.54, df=4) = 0.34

A real example Remind the class that the chi-squared test statistic has a distribution under the null that is only APPROXIMATELY chi-squared. Provide the rule of thumb that the expected cell count in 75% or more of the cells should be 5 or more.

Fisher’s exact test When the chi-squared test is appropriate, it will give results very similar to Fisher’s exact test.

Goodness of fit We can test whether observed data came from a particular distribution. Until now, we have assumed we knew the family that the population distribution belonged to. For example, to find the MLE, we must know the likelihood as a function of the unknown parameter. That assumes that the only thing we don’t know about the distribution is the parameter.

Example 13.7.1 Let Xi be the repair times for an airplane part. H0: Xi ~ i.i.d. POI(3) How can we use a contingency table to test this hypothesis? How do we determine expected cell counts? Repair time (Days) 1 2 3 4 5 6 7+ 7 10

Example 13.7.1 Expected counts can be determined based on the null hypothesis and the associated mass function. H0: Xi ~ i.i.d. POI(3) There’s a problem. What is it? How do we fix it? Repair time (Days) 1 2 3 4 5 6 7+ 1 (2.00) 3 (5.96) 7 (8.96) 6 (8.96) 10 (6.72) 7 (4.04) 6 (2.00) 0 (1.36) Rule of thumb for approximation not met. Too many cells with low expected cell count.

Example 13.7.1 Cells can be combined to create cells with larger expected counts. There’s a problem. What is it? How do we fix it? Compute the p-value. Repair time (Days) 0 - 1 2 3 4 5+ 4 (7.96) 7 (8.96) 6 (8.96) 10 (6.72) 13 (7.40) By arbitrarily choosing which bins to combine, I can get different p-values. I could choose bins in order to for the p-value to be lower or higher depending on what I want to show. Fix this by specifying the bins a priori. P-value = 1-pchisq(9.22, df = 4) = 0.0558. Note that observed - expected sum to zero, because there are 40 observations and each one will be in a bin.

Example 13.7.2 Repeat the previous example with a composite null hypothesis. H0: Xi ~ i.i.d. POI(μ) How do we determine expected cell counts? Don’t forget to combine cells as needed. (How and when to choose?) Compute the p-value. What conclusions can we make about the distribution at α = 0.05? Repair time (Days) 0 - 1 2 3 4 5+ 4 (4.84) 7 (6.92) 6 (8.44) 10 (7.68) 13 (12.12) By arbitrarily choosing which bins to combine, I can get different p-values. I could choose bins in order to for the p-value to be lower or higher depending on what I want to show. Fix this by specifying the bins a priori. P-value = 1-pchisq(1.62, df = 3) = 0.6549. Note that a degree of freedom is lost for the parameter we estimated. Technically the approximate p-value is between 1-pchisq(1.62, df = 3) = 0.6549 and 1-pchisq(1.62, df = 4) = 0.8052, since the expected counts are based on the regular MLE rather than the MLE for binned data.

MLE for binned data The MLE for binned data is a bit different. In Example 13.7.2, the likelihood function for binned data (in 5 categories) is L(μ) = n!/(4!7!6!10!13!) ⋅ [p1(μ)]4[p2(μ)]7[p3(μ)]6[p4(μ)]10[p5(μ)]13 Log[L(μ)] ∝ 4log(p1(μ)) + … = 4log(e-μ(1+μ)) + ... The derivatives may not be quite as bad as they look, but with more than a few bins, it would be unpleasant to maximize this. You will not be asked to do this on quizzes or exams. On Hw 13.15, assume that observed data are always at the median of the bin, and on Hw 13.16, assume that the observation that is ≥5 is actually 5.

Other goodness of fit tests Kolmogorov-Smirnov See image to the right Shapiro-Wilk For testing normality