1 GCRC Data Analysis with SPSS Workshop Session 5 Follow Up on FEV data Binary and Categorical Outcomes 2x2 tables 2xK tables JxK tables Logistic Regression.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the.
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
PSY 340 Statistics for the Social Sciences Chi-Squared Test of Independence Statistics for the Social Sciences Psychology 340 Spring 2010.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Statistical Methods Chichang Jou Tamkang University.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Chapter 11 Multiple Regression.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Correlation and Regression Analysis
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
AS 737 Categorical Data Analysis For Multivariate
Logistic Regression.
Categorical Data Prof. Andy Field.
Inferential Statistics: SPSS
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Simple Linear Regression
ANOVA (Analysis of Variance) by Aziza Munir
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Copyright © 2010 Pearson Education, Inc. Slide
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Chapter Outline Goodness of Fit test Test of Independence.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Logistic Regression Analysis Gerrit Rooks
Qualitative and Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Logistic Regression: Regression with a Binary Dependent Variable.
Chapter 13 LOGISTIC REGRESSION. Set of independent variables Categorical outcome measure, generally dichotomous.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Notes on Logistic Regression
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Hypothesis Testing Review
Multiple logistic regression
Nonparametric Statistics
Chapter 10 Analyzing the Association Between Categorical Variables
Analyzing the Association Between Categorical Variables
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Presentation transcript:

1 GCRC Data Analysis with SPSS Workshop Session 5 Follow Up on FEV data Binary and Categorical Outcomes 2x2 tables 2xK tables JxK tables Logistic Regression Definitions Model selection Assessing the Model Fit Low Birth Weight data

2 Regression of LOG(FEV) on 4 predictors Full Dataset (N=654)Subset on >8 yrs (N=439) bSEbp STD b VIFbSEbp STD b VIF Age Ht Smoke Sex rsq.81.67

3 Contingency Tables X and Y are categorical response variables (with I and J categories) The probability distribution {π ij } is the joint distribution of X and Y If X is not random, joint distribution is not meaningful, but the distribution of Y is conditional on X Marginal Distribution: the row {π i. } and column {π.j } totals obtained by summing the joint probabilities Conditional Distribution: Given a subject is in row i of X, π j|i is the probability of classification into column j of Y Prospective studies: the totals {n i. } for X are usually fixed, and each row of J counts is an independent multinomial sample on Y Retrospective studies: the totals {n.j } for Y are usually fixed and each column of I counts is an independent multinomial sample on X Cross-sectional studies: the total sample size is fixed, and the IJ cell counts are a multinomial sample

4 Contingency Tables (continued) Joint (Conditional) and Marginal Probability Y X12Total 1π 11 (π 1|1 ) π 12 (π 2|1 ) π 1. (1.0) 2π 21 (π 1|2 ) π 22 (π 2|2 ) π 2. (1.0) Totalπ.1 π.2 1.0

5 In any case, if X and Y are independent, π ij = π i. π.i Maximum Likelihood estimates for π ij are the cell proportions p ij =n ij /n Under the assumption of independence, the expected cell counts are And the chi square statistic with (I-1)(J-1) degrees of freedom, can be used to test the null hypothesis of independence

6 For a single multinomial variable, the analogous statistic, constructed similarly: with (I-1) degrees of freedom can be used to compare the observed cell proportions to a distribution with fixed values {π i0 } (also known as goodness-of-fit) Here m i =nπ i0

7 SPSS output (complete list in SPSS Notes) analyze>descriptive statistics>crosstab select STATISTICS, chi-square Likelihood Ratio is a goodness-of-fit statistic similar to Pearson's chi-square. For large sample sizes, the two statistics are equivalent. The advantage of the likelihood-ratio chi-square is that it can be subdivided into interpretable parts that add up to the total. For smaller sample sizes, this is the statistic to report; since it approaches the Pearson as n increases, it can be reported in either case. Fisher’s Exact Test is a test for independence in a 2 X 2 table. It is most useful when the total sample size and the expected values are small. The test holds the marginal totals fixed and computes the hypergeometric probability that n 11 is at least as large as the observed value

8 Y X The output consists of three p-values: Left: Use this when the alternative to independence is that there is negative association between the variables. That is, the observations tend to lie in lower left and upper right. Right: Use this when the alternative to independence is that there is positive association between the variables. That is, the observations tend to lie in upper left and lower right. 2-Tail: Use this when there is no prior alternative. TABLE = [ 3, 7, 5, 10 ] Left : p-value = Right : p-value = Tail : p-value = 1 yesnototal yes3710 no51015 total817

9 Multiple Logistic Regression E(Y|X)=P(Y=1|x) = Π(X) = The relationship between π i and X is S shaped The logit (log-odds) transformation (link function) Has many of the desirable properties of the linear regression model, while relaxing some of the assumptions. Maximum Likelihood (ML) model parameters are estimated by iteration

10 Assumptions for Logistic Regression The independent variables are liner in the logit. It is also possible to add explicit interaction and power terms, as in OLS regression. The dependent variable need not be normally distributed (it is assumed to be distributed within the range of the exponential family of distributions, such as normal, Poisson, binomial, gamma). The dependent variable need not be homoscedastic for each level of the independents; that is, there is no homogeneity of variance assumption. Normally distributed error terms are not assumed. The independent variables may be binary, categorical, continuous

11 Applications Identify risk factors H o : β 0 = 0 while controlling for confounders and other important determinants of the event Classification: Predict outcome for a new observation with a particular constellation of risk factors (a form of discriminant analysis)

12 Design Variables (coding) In SPSS, designate Categorical to get k-1 indicators for a k-level factor design variable D 1 D 2 RACE White00 Black10 Other01

13 Interpretation of the parameters If p is the probability of an event and O is the odds for that event then … the link function in logistic regression gives the log-odds

14 …and the odds ratio, OR, is Y=1Y=0 X=1 X=0

15 Definitions and Annotated SPSS output for Logistic Regression Virtually any sin that can be committed with least squares regression can be committed with logistic regression. These include stepwise procedures and arriving at a final model by looking at the data. All of the warnings and recommendations made for least squares regression apply to logistic regression as well... Gerard Dallal

16 Assessing the Model Fit There are several R 2 -like measures; they are not goodness-of-fit tests but rather attempt to measure strength of association Cox and Snell's R-Square is an attempt to imitate the interpretation of multiple R-Square based on the likelihood, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. It is part of SPSS output. Nagelkerke's R-Square is a further modification of the Cox and Snell coefficient to assure that it can vary from 0 to 1. That is, Nagelkerke's R2 divides Cox and Snell's R2 by its maximum in order to achieve a measure that ranges from 0 to 1. Therefore Nagelkerke's R-Square will normally be higher than the Cox and Snell measure. It is part of SPSS output and is the most- reported of the R-squared estimates. See Nagelkerke (1991).

17 Hosmer and Lemeshow's Goodness of Fit Test tests the null hypothesis that the data were generated by the fitted model 1.divide subjects into deciles based on predicted probabilities 2.compute a chi-square from observed and expected frequencies 3.compute a probability (p) value from the chi-square distribution with 8 degrees of freedom to test the fit of the logistic model If the Hosmer and Lemeshow Goodness-of-Fit test statistic has p =.05 or less, we reject the null hypothesis that there is no difference between the observed and model-predicted values of the dependent. (This means the model predicts values significantly different from the observed values).

18 observed expected Observed vs. Predicted This particular model performs better when the event rate is low

19 Check for Linearity in the LOGIT Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm [(X)ln(X)]. If these terms are significant, then there is nonlinearity in the logit. This method is not sensitive to small nonlinearities. Orthogonal polynomial contrasts, an option in SPSS, may be used. This option treats each independent as a categorical variable and computes logit (effect) coefficients for each category, testing for linear, quadratic, cubic, or higher-order effects. The logit should not change over the contrasts. This method is not appropriate when the independent has a large number of values, inflating the standard errors of the contrasts.

20 Residual Plots Plot the Cook’s distance against Several other plots suggested in Hosmer & Lemishow (p177) involve further manipulation of the statistics produced by SPSS External Validation a new sample a hold-out sample Cross Validation (classification) n-fold (leave 1 out) V-fold (divide data into V subsets)

21 Pitfalls 1.Multiple comparisons (data driven model/data dredging) 2.Over fitting -complex models fit to a small dataset good fit in THIS dataset, but not generalize: you’re modeling the random error at least 10 events per independent variable -validation new data to check predictive ability, calibration hold-out sample -look for sensitivity to a single observation (residuals) 3. Violating the assumptions more serious in prediction models than association 4.There are many strategies: don’t try them all -chose one based on the structure of the question -draw primary conclusions based on that one -examine robustness to other strategies

22 CASE STUDY 1.Develop a strategy for analyzing Hosmer & Lemishow’s Low Birth weight data using LOW as the dependent variable 2.Try ANCOVA for the same data with BWT (birth weight in grams) as the dependent variable LBW.SAV is on the S drive under GCRC data analysis

23 References Hosmer, D.W. and Lemishow, S, (2000) Applied Logistic Regression, 2 nd ed., John Wiley & Sons, New York, NY Harrell, F. E., Lee, K. L., Mark, D. B. (1996) “Multivariable Prognostic models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors”, Statistics in Medicine, 15, Nagelkerke, N. J. D. (1991). “A note on a general definition of the coefficient of determination” Biometrika, Vol. 78, No. 3: Covers the two measures of R-square for logistic regression which are found in SPSS output. Agresti, A. (1990) Categorical Data Analysis, John Wiley & Sons, New York, NY