1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and.

Slides:



Advertisements
Similar presentations
Three or more categorical variables
Advertisements

Brief introduction on Logistic Regression
Goodness Of Fit. For example, suppose there are four entrances to a building. You want to know if the four entrances are equally used. You observe 400.
Simple Logistic Regression
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Statistical Inference for Frequency Data Chapter 16.
6.1.4 AIC, Model Selection, and the Correct Model oAny model is a simplification of reality oIf a model has relatively little bias, it tends to provide.
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
Chi-square Basics. The Chi-square distribution Positively skewed but becomes symmetrical with increasing degrees of freedom Mean = k where k = degrees.
1 STA 517 – Introduction: Distribution and Inference 1.5 STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS  Recall multi(n, =( 1,  2, …,  c ))  Suppose.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
© 2010 Pearson Prentice Hall. All rights reserved The Chi-Square Test of Independence.
Adjusting for extraneous factors Topics for today Stratified analysis of 2x2 tables Regression Readings Jewell Chapter 9.
Chapter Goals After completing this chapter, you should be able to:
Chi Square Test Dealing with categorical dependant variable.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
EPI 809/Spring Multiple Logistic Regression.
1 Modeling Ordinal Associations Section 9.4 Roanna Gee.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
An Introduction to Logistic Regression
WLS for Categorical Data
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Chi-Square Tests and the F-Distribution
Poisson Regression Caution Flags (Crashes) in NASCAR Winston Cup Races L. Winner (2006). “NASCAR Winston Cup Race Results for ,” Journal.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
Topic 1 Binary Logit Models.
AS 737 Categorical Data Analysis For Multivariate
Xuhua Xia Smoking and Lung Cancer This chest radiograph demonstrates a large squamous cell carcinoma of the right upper lobe. This is a larger squamous.
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Regression Analysis (2)
Bivariate Relationships Analyzing two variables at a time, usually the Independent & Dependent Variables Like one variable at a time, this can be done.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 10.7.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on Categorical Data 12.
Chapter 11 Chi-Square Procedures 11.3 Chi-Square Test for Independence; Homogeneity of Proportions.
EIPB 698E Lecture 10 Raul Cruz-Cano Fall Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.
A Course In Business Statistics 4th © 2006 Prentice-Hall, Inc. Chap 9-1 A Course In Business Statistics 4 th Edition Chapter 9 Estimation and Hypothesis.
Linear vs. Logistic Regression Log has a slightly better ability to represent the data Dichotomous Prefer Don’t Prefer Linear vs. Logistic Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chi-Square Procedures Chi-Square Test for Goodness of Fit, Independence of Variables, and Homogeneity of Proportions.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
CHI SQUARE TESTS.
Copyright © 2010 Pearson Education, Inc. Slide
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
4 normal probability plots at once par(mfrow=c(2,2)) for(i in 1:4) { qqnorm(dataframe[,1] [dataframe[,2]==i],ylab=“Data quantiles”) title(paste(“yourchoice”,i,sep=“”))}
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
We’ll now look at the relationship between a survival variable Y and an explanatory variable X; e.g., Y could be remission time in a leukemia study and.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Logistic Regression Analysis Gerrit Rooks
Other Types of t-tests Recapitulation Recapitulation 1. Still dealing with random samples. 2. However, they are partitioned into two subsamples. 3. Interest.
Analysis of matched data Analysis of matched data.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
Chi-square Basics.
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
Consider this table: The Χ2 Test of Independence
Inference on Categorical Data
Presentation transcript:

1 Topic 2 LOGIT analysis of contingency tables

2 Contingency table a cross classification Table containing two or more variables of classification, and the purpose is to determin if these variables are related. Change in stock prices in year Change in stock prices in January UP DOWN TOTAL UP DOWN TOTAL 22 (16.1) 1 (6.9) 23 6 (11.9) 11 (5.1)

3 A table of this sort can be used to test whether, as some financial analysts suggest, January is a good prediction of whether stock prices will go up or down in the entire year H 0 : whether or not stock prices go up in the entire year is the same regardless of the behaviour in January H 1 : otherwise Expected frequencies are shown in parentheses in the table

4 Pearson’s Chi-square statistic where r and c are respectively the numbers of rows and columns in the table

5 In our example, Now  we rejected the null. In other words, based on this evidence the probability that stock prices will go up during the whole year does not seem to be independent of whether or not they go up in January

6 DATA STOCK; INPUT F YP JP; DATALINES; ; PROC FREQ DATA=STOCK; WEIGHT F; TABLES YP*JP/CHISQ CMH; RUN;

7

8

9 Two Way Table Consider the following SAS program and OUTPUT: DATA PENALTY; INFILE 'D:\TEACHING\MS4225\PENALTY.TXT'; INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC GENMOD DATA=PENALTY DESCENDING; MODEL DEATH=BLACKD/D=B; RUN;

10

11 But suppose we don’t have individual level data. All we have is the following table BlacksNonblacksTotal Death Life Total

12 DATA CONT1; INPUT F BLACKD DEATH; DATALINES; ; PROC GENMOD DATA=CONT1 DESCENDING; FREQ F; MODEL DEATH=BLACKD/D=B; RUN;

13

14 Results are identical to those obtained previously Alternatively, we can run the program DATA CONT1; INPUT DEATH TOTAL BLACKD; DATALINES; ; PROC GENMOD DATA=CONT1; MODEL DEATH/TOTAL=BLACKD/D=B; RUN;

15 And obtain output

16 Points to note: Instead of replicating the observations, GENMOD treats the variable DEATH as having a Binomial distribution with the number of trials given by TOTAL. Deviance is 0. Why? Note that the deviance is a likelihood ratio test that compares the fitted model with a saturated model. In the previous case, the saturated model is also the fitted model, with two parameter for two data lines.

17 Three Way Table Consider the cross classification table of race, gender and possession of a driver’s license for a sample of 17 and 18 year old kids. Drivers’ License RaceGenderYesNo WhiteMale43134 Female26149 BlackMale2923 Female2236

18 DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE/D=B; RUN;

19

20 Deviance = with a p-value of It can be obtained by executing the SAS program: DATA; CHI = 1 – PROBCHI(0.0583,1); PUT CHI; RUN; So there is no evidence of an interaction between the explanatory variables.

21 To see this more explicitly, let us fit the model with interaction DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B; RUN;

22

23 Interpretation Coefficient of MALE is Exponentiating the coefficient yields 1.91 => the estimated odds of having a driver’s license are nearly twice as large for males as for females, after adjusting for racial differences.

24 For WHITE, the highly significant, adjusted odds ratio is exp[ ]=0.269, indicating that the odds of having a driver’s license for whites is a little more than ¼ the odds of blacks.

25 Four Way Table Slightly more complicated with four-way tables because more interactions are possible Consider the following table Our goal is to estimate a LOGIT model for the dependence of working class identification on the other three variables.

26 Identifies with the Working class CountryOccupationFathers’ OccupationYesNoTotal FranceManual Non-Manual Non-ManualManual Non-Manual U.S.Manual Non-Manual Non-ManualManual18485 Non-Manual

27 DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL/D=B; RUN;

28

29 The missing variables are the interaction terms: 3 2-way interactions and 1 3-way interaction. Because 3-way interactions cannot be interpreted easily, let’s see if we can get by with just the 2-way interactions.

30 DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*MANUAL FRANCE*FAMANUAL MANUAL*FAMANUAL/D=B; RUN;

31

32 Examining the Wald Chi-squares, we find that FRANCE*FAMANUAL is highly significant, but other interaction variables are not so significant.

33 DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*FAMANUAL/D=B; RUN;

34

35 Interpretations of results Coefficient for MANUAL: exp(2.5155) = 12.4 => Manual workers have an odds of identification with the working class that is more than 12 times the odds for non-manual workers Coefficient for FRANCE*FAMANUAL:

36 If FRANCE=0, then f(.)[ ] represents the effect of FAMANUAL when the respondent lives in the U.S. If FRANCE=1, then f(.)[1.13] represents the effect of RAMANUAL when the respondent lives in France, exp[1.13]=3.1 In France, the men whose fathers had a manual occupation have an odds of identification that is more than three times the odds for men whose fathers did not have a manual occupation.

37 Overdispersion Refers to the situation of lack of fit Causes of overdispersion: Incorrectly specified model: more interactions or nonlinearity are needed in the model. Lack of independence of observations due to unobserved heterogeneity at group level.

38 DATA POSTDOC; INPUT NIH DOCS PDOCS; DATALINES; ; PROC GENMOD DATA=POSTDOC; MODEL PDOCS/DOCS=NIH /D=B; RUN;

39

40 Note that the deviance and Pearson  2 clearly indicate model mis-specification Because there’s only one independent variable, we don’t have the option of putting in interactions One can try allowing for nonlinearity by including powers of NIH in the model by that won’t help. It is quite possible that lack of fit is due to a lack of independence in the observations

41 There are many characteristics of biochemistry departments besides NIH funding that may have some bearings on whether their graduates seek and get postdoctoral training Examples are prestiage of the department, whether the department is in an agricultural or medical school, the age of the department and so on. Lack of independence of this kind produces what is called extra-binomial variation. The variance of the dependent variable will be greater than what is expected under the assumption of a binomial distribution.

42 Besides producing a large deviance, extra- binomial variation can result in underestimates of the standard errors and overestimates of the Chi- square statistics. Method of adjustment: take the square root of the Pearson Chi-square statistic and multiply all the standard errors by that number.

43 DATA POSTDOC; INPUT NIH DOCS PDOCS; DATALINES; ; PROC GENMOD DATA=POSTDOC; MODEL PDOCS/DOCS=NIH /D=B PSCALE; RUN;

44