Logistic Regression Binary response variable Y (1 – Success, 0 – Failure) Continuous, Categorical independent Variables –Similar to Multiple Regression.

Slides:



Advertisements
Similar presentations
Testing for Marginal Independence Between Two Categorical Variables with Multiple Responses Robert Jeutong.
Advertisements

Sociology 690 Multivariate Analysis Log Linear Models.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Logistic Regression Psy 524 Ainsworth.
The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Loglinear Models for Independence and Interaction in Three-way Tables Veronica Estrada Robert Lagier.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Loglinear Models for Contingency Tables. Consider an IxJ contingency table that cross- classifies a multinomial sample of n subjects on two categorical.
Log-linear Analysis - Analysing Categorical Data
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Chapter 4 Discrete Random Variables and Probability Distributions
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Linear statistical models 2008 Count data, contingency tables and log-linear models Expected frequency: Log-linear models are linear models of the log.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Linear statistical models 2009 Count data  Contingency tables and log-linear models  Poisson regression.
Log-linear analysis Summary. Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data.
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Categorical Data Test of Independence.
Categorical Data Prof. Andy Field.
Chapter 13: Inference in Regression
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
Categorical Data Analysis School of Nursing “Categorical Data Analysis 2x2 Chi-Square Tests and Beyond (Multiple Categorical Variable Models)” Melinda.
LOG-LINEAR MODEL FOR CONTIGENCY TABLES Mohd Tahir Ismail School of Mathematical Sciences Universiti Sains Malaysia.
A. Analysis of count data
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-5 Multiple Regression.
Multinomial Distribution
Discrete Multivariate Analysis Analysis of Multivariate Categorical Data.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
Logistic (regression) single and multiple. Overview  Defined: A model for predicting one variable from other variable(s).  Variables:IV(s) is continuous/categorical,
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Multivariate Data Summary. Linear Regression and Correlation.
Chapter 13 Multiple Regression
Inference for Distributions of Categorical Variables (C26 BVD)
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
1 STA 617 – Chp11 Models for repeated data Analyzing Repeated Categorical Response Data  Repeated categorical responses may come from  repeated measurements.
Chapter Outline Goodness of Fit test Test of Independence.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
The p-value approach to Hypothesis Testing
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture #1 Chi-square Contingency Table Test.
Categorical Data Analysis
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Multivariate Data Summary. Linear Regression and Correlation.
Log-linear Models Please read Chapter Two. We are interested in relationships between variables White VictimBlack Victim White Prisoner151 (151/160=0.94)
Other tests of significance. Independent variables: continuous Dependent variable: continuous Correlation: Relationship between variables Regression:
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 22/11/ :12 AM 1 Contingency tables and log-linear models.
BINARY LOGISTIC REGRESSION
Discrete Multivariate Analysis
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Discrete Multivariate Analysis
Categorical Data Aims Loglinear models Categorical data
Discrete Multivariate Analysis
Comparing k Populations
Multiple logistic regression
Multivariate Data Summary
Comparing k Populations
Test of Independence in 3 Variables
Chapter 2 Looking at Data— Relationships
Introduction to log-linear models
Comparing k Populations
Joyful mood is a meritorious deed that cheers up people around you
Multiple Testing Tukey’s Multiple comparison procedure
Modeling Ordinal Associations Bin Hu
Presentation transcript:

Logistic Regression Binary response variable Y (1 – Success, 0 – Failure) Continuous, Categorical independent Variables –Similar to Multiple Regression –Can use Dummy variables to handle categorical independent variables –Various selection procedures for determining the “best” model – forward, backward, stepwise

Discrete Multivariate Analysis Analysis of Multivariate Categorical Data

References 1.Fienberg, S. (1980), Analysis of Cross-Classified Data, MIT Press, Cambridge, Mass. 2.Fingelton, B. (1984), Models for Category Counts, Cambridge University Press. 3.Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.

Example 1 In this study we examine n = 1237 individuals measuring X, Systolic Blood Pressure and Y, Serum Cholesterol

Example 2 The following data was taken from a study of parole success involving 5587 parolees in Ohio between 1965 and 1972 (a ten percent sample of all parolees during this period).

The study involved a dichotomous response Y –Success (no major parole violation) or –Failure (returned to prison either as technical violators or with a new conviction) based on a one-year follow-up. The predictors of parole success included are: 1.type of committed offence (Person offense or Other offense), 2.Age (25 or Older or Under 25), 3.Prior Record (No prior sentence or Prior Sentence), and 4.Drug or Alcohol Dependency (No drug or Alcohol dependency or Drug and/or Alcohol dependency).

The data were randomly split into two parts. The counts for each part are displayed in the table, with those for the second part in parentheses. The second part of the data was set aside for a validation study of the model to be fitted in the first part.

Table

Analysis of a Two-way Frequency Table:

Frequency Distribution (Serum Cholesterol and Systolic Blood Pressure)

Joint and Marginal Distributions (Serum Cholesterol and Systolic Blood Pressure) The Marginal distributions allow you to look at the effect of one variable, ignoring the other. The joint distribution allows you to look at the two variables simultaneously.

Conditional Distributions ( Systolic Blood Pressure given Serum Cholesterol ) The conditional distribution allows you to look at the effect of one variable, when the other variable is held fixed or known.

Conditional Distributions (Serum Cholesterol given Systolic Blood Pressure)

GRAPH: Conditional distributions of Systolic Blood Pressure given Serum Cholesterol

Notation: Let x ij denote the frequency (no. of cases) where X (row variable) is i and Y (row variable) is j.

Different Models The Multinomial Model: Here the total number of cases N is fixed and x ij follows a multinomial distribution with parameters  ij

The Product Multinomial Model: Here the row (or column) totals R i are fixed and for a given row i, x ij follows a multinomial distribution with parameters  j|i

The Poisson Model: In this case we observe over a fixed period of time and all counts in the table (including Row, Column and overall totals) follow a Poisson distribution. Let  ij denote the mean of x ij.

Independence

Multinomial Model if independent and The estimated expected frequency in cell (i,j) in the case of independence is:

The same can be shown for the other two models – the Product Multinomial model and the Poisson model namely The estimated expected frequency in cell (i,j) in the case of independence is: Standardized residuals are defined for each cell:

The Chi-Square Statistic The Chi-Square test for independence Reject H 0 : independence if

Table Expected frequencies, Observed frequencies, Standardized Residuals  2 = (p = )

Example In the example N = 57,407 cases in which individuals were victimized twice by crimes were studied. The crime of the first victimization (X) and the crime of the second victimization (Y) were noted. The data were tabulated on the following slide

Table 1: Frequencies

Table 2: Standardized residuals

Table 3: Conditional distribution of second victimization given the first victimization (%)

Log Linear Model

Recall, if the two variables, rows (X) and columns (Y) are independent then and

In general let then where (1) Equation (1) is called the log-linear model for the frequencies x ij.

Note: X and Y are independent if In this case the log-linear model becomes

Three-way Frequency Tables

Example Data from the Framingham Longitudinal Study of Coronary Heart Disease (Cornfield [1962]) Variables 1.Systolic Blood Pressure (X) –< 127, , , Serum Cholesterol –<200, , , Heart Disease –Present, Absent The data is tabulated on the next slide

Three-way Frequency Table

Log-Linear model for three-way tables Let  ijk denote the expected frequency in cell (i,j,k) of the table then in general where

Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

1.Model: (All Main effects model) ln  ijk = u + u 1(i) + u 2(j) + u 3(k) i.e. u 12(i,j) = u 13(i,k) = u 23(j,k) = u 123(i,j,k) = 0. Notation: [1][2][3] Description: Mutual independence between all three variables.

2.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 12(i,j) i.e. u 13(i,k) = u 23(j,k) = u 123(i,j,k) = 0. Notation: [12][3] Description: Independence of Variable 3 with variables 1 and 2.

3.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 13(i,k) i.e. u 12(i,j) = u 23(j,k) = u 123(i,j,k) = 0. Notation: [13][2] Description: Independence of Variable 2 with variables 1 and 3.

4.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 23(j,k) i.e. u 12(i,j) = u 13(i,k) = u 123(i,j,k) = 0. Notation: [23][1] Description: Independence of Variable 3 with variables 1 and 2.

5.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 12(i,j) + u 13(i,k) i.e. u 23(j,k) = u 123(i,j,k) = 0. Notation: [12][13] Description: Conditional independence between variables 2 and 3 given variable 1.

6.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 12(i,j) + u 23(j,k) i.e. u 13(i,k) = u 123(i,j,k) = 0. Notation: [12][23] Description: Conditional independence between variables 1 and 3 given variable 2.

7.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 13(i,k) + u 23(j,k) i.e. u 12(i,j) = u 123(i,j,k) = 0. Notation: [13][23] Description: Conditional independence between variables 1 and 2 given variable 3.

8.Model: ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 12(i,j) + u 13(i,k) + u 23(j,k) i.e. u 123(i,j,k) = 0. Notation: [12][13][23] Description: Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

9.Model: (the saturated model) ln  ijk = u + u 1(i) + u 2(j) + u 3(k) + u 12(i,j) + u 13(i,k) + u 23(j,k) + u 123(i,j,k) Notation: [123] Description: No simplifying dependence structure.

Hierarchical Log-linear models for 3 way table ModelDescription [1][2][3] Mutual independence between all three variables. [1][23] Independence of Variable 1 with variables 2 and 3. [2][13] Independence of Variable 2 with variables 1 and 3. [3][12] Independence of Variable 3 with variables 1 and 2. [12][13] Conditional independence between variables 2 and 3 given variable 1. [12][23] Conditional independence between variables 1 and 3 given variable 2. [13][23] Conditional independence between variables 1 and 2 given variable 3. [12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. [123] The saturated model

Maximum Likelihood Estimation Log-Linear Model

For any Model it is possible to determine the maximum Likelihood Estimators of the parameters Example Two-way table – independence – multinomial model or

Log-likelihood where With the model of independence

and with also

Let Now

Since

Now or

Hence and Similarly Finally

Hence Now and

Hence Note or

Comments Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables) In certain situations the equations need to be solved numerically For the saturated model (all interactions and main effects), the estimate of  ijk… is x ijk….

Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table

Goodness of Fit Statistics The Chi-squared statistic The Likelihood Ratio statistic: d.f. = # cells - # parameters fitted We reject the model if  2 or G 2 is greater than

Example: Variables 1.Systolic Blood Pressure (B) Serum Cholesterol (C) Coronary Heart Disease (H)

MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ B,C,H B,CH C,BH H,BC BC,BH BH,CH n.s. CH,BC BC,BH,CH n.s. Goodness of fit testing of Models Possible Models: 1. [BH][CH] – B and C independent given H. 2. [BC][BH][CH] – all two factor interaction model

Model 1: [BH][CH] Log-linear parameters Heart disease -Blood Pressure Interaction

Multiplicative effect Log-Linear Model

Heart Disease - Cholesterol Interaction

Multiplicative effect

Model 2: [BC][BH][CH] Log-linear parameters Blood pressure-Cholesterol interaction:

Multiplicative effect

Heart disease -Blood Pressure Interaction

Multiplicative effect

Heart Disease - Cholesterol Interaction

Multiplicative effect

Log Linear Model

Two-way table where Note: X and Y are independent if In this case the log-linear model becomes

Three-way Frequency Tables

Log-Linear model for three-way tables Let  ijk denote the expected frequency in cell (i,j,k) of the table then in general where

Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

Hierarchical Log-linear models for 3 way table ModelDescription [1][2][3] Mutual independence between all three variables. [1][23] Independence of Variable 1 with variables 2 and 3. [2][13] Independence of Variable 2 with variables 1 and 3. [3][12] Independence of Variable 3 with variables 1 and 2. [12][13] Conditional independence between variables 2 and 3 given variable 1. [12][23] Conditional independence between variables 1 and 3 given variable 2. [13][23] Conditional independence between variables 1 and 2 given variable 3. [12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. [123]

Maximum Likelihood Estimation Log-Linear Model

For any Model it is possible to determine the maximum Likelihood Estimators of the parameters Example Two-way table – independence – multinomial model or

Log-likelihood where With the model of independence

and with also

Let Now

Since

Now or

Hence and Similarly Finally

Hence Now and

Hence Note or

Comments Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables) In certain situations the equations need to be solved numerically For the saturated model (all interactions and main effects)

Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table

Goodness of Fit Statistics The Chi-squared statistic The Likelihood Ratio statistic: d.f. = # cells - # parameters fitted We reject the model if  2 or G 2 is greater than

Example: Variables 1.Systolic Blood Pressure (B) Serum Cholesterol (C) Coronary Heart Disease (H)

MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ B,C,H B,CH C,BH H,BC BC,BH BH,CH n.s. CH,BC BC,BH,CH n.s. Goodness of fit testing of Models Possible Models: 1. [BH][CH] – B and C independent given H. 2. [BC][BH][CH] – all two factor interaction model

Model 1: [BH][CH] Log-linear parameters Heart disease -Blood Pressure Interaction

Multiplicative effect Log-Linear Model

Heart Disease - Cholesterol Interaction

Multiplicative effect

Model 2: [BC][BH][CH] Log-linear parameters Blood pressure-Cholesterol interaction:

Multiplicative effect

Heart disease -Blood Pressure Interaction

Multiplicative effect

Heart Disease - Cholesterol Interaction

Multiplicative effect

Another Example In this study it was determined for N = 4353 males 1.Occupation category 2.Educational Level 3.Academic Aptidude

1.Occupation categories a.Self-employed Business b.Teacher\Education c.Self-employed Professional d.Salaried Employed 2.Education levels a.Low b.Low/Med c.Med d.High/Med e.High

3.Academic Aptitude a.Low b.Low/Med c.High/Med d.High

It is common to handle a Multiway table by testing for independence in all two way tables. This is similar to looking at all the bivariate correlations In this example we learn that: 1.Education is related to Aptitude 2.Education is related to Occupational category 3.Education is related to Aptitude Can we do better than this?

Fitting various log-linear models Simplest model that fits is: [Apt,Ed][Occ,Ed] This model implies conditional independence between Aptitude and Occupation given Education.

Log-linear Parameters Aptitude – Education Interaction

Aptitude – Education Interaction (Multiplicative)

Occupation – Education Interaction

Occupation – Education Interaction (Multiplicative)