1. Analyzing Non-Normal Data with Generalized Linear Models 2010 LISA Short Course Series Sai Wang, Dept. of Statistics.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Logistic Regression Example: Horseshoe Crab Data
PSY 307 – Statistics for the Behavioral Sciences Chapter 20 – Tests for Ranked Data, Choosing Statistical Tests.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
Generalised linear models
Final Review Session.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Generalised linear models Generalised linear model Exponential family Example: Log-linear model - Poisson distribution Example: logistic model- Binomial.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Inferences About Process Quality
Today Concepts underlying inferential statistics
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Generalized Linear Models
April 6, 2010 Generalized Linear Models 2010 LISA Short Course Series Mark Seiss, Dept. of Statistics.
Presentation and Data  Short Courses  Regression Analysis Using JMP  Download Data to Desktop 1.
Survival Analysis A Brief Introduction Survival Function, Hazard Function In many medical studies, the primary endpoint is time until an event.
1 Overview of Major Statistical Tools UAPP 702 Research Methods for Urban & Public Policy Based on notes by Steven W. Peuquet, Ph.D.
Nonparametrics and goodness of fit Petter Mostad
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Leedy and Ormrod Ch. 11 Gray Ch. 14
MODELS OF QUALITATIVE CHOICE by Bambang Juanda.  Models in which the dependent variable involves two ore more qualitative choices.  Valuable for the.
Simple Linear Regression
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Linear correlation and linear regression + summary of tests
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,
Multiple Logistic Regression STAT E-150 Statistical Methods.
Generalized Linear Models (GLMs) and Their Applications.
Discrepancy between Data and Fit. Introduction What is Deviance? Deviance for Binary Responses and Proportions Deviance as measure of the goodness of.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Logistic Regression Analysis Gerrit Rooks
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
Generalized Linear Models
Nonparametric Statistics
Generalized Additive Model
Presentation transcript:

1

Analyzing Non-Normal Data with Generalized Linear Models 2010 LISA Short Course Series Sai Wang, Dept. of Statistics

Presentation Outline 1.Introduction to Generalized Linear Models 2.Binary Response Data - Logistic Regression Model Ex. Teaching Method 3.Count Response Data - Poisson Regression Model Ex. Mining Example 4.Non-parametric Tests 3

Normal: continuous, symmetric, mean μ and var σ 2 Bernoulli: 0 or 1, mean p and var p(1-p) special case of Binomial Poisson: non-negative integer, 0, 1, 2, …, mean λ var λ # of events in a fixed time interval 4

Generalized Linear Models Generalized linear models (GLM) extend ordinary regression to non-normal response distributions. Response distribution must come from the Exponential Family of Distributions Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc. 3 Components Random – Identifies response Y and its probability distribution Systematic – Explanatory variables in a linear predictor function (Xβ) Link function – Invertible function (g(.)) that links the mean of the response (E[Y i ]=μ i ) to the systematic component. 5

Generalized Linear Models Model for i =1 to n, where n is # of obs j= 1 to k, where k is # of predictors Equivalently, 6

Generalized Linear Models Why do we use GLM’s? Linear regression assumes that the response is distributed normally GLM’s allow us to analyze the linear relationship between predictor variables and the mean of the response variable when it is not reasonable to assume the data is distributed normally. 7

Generalized Linear Models Connection Between GLM’s and Multiple Linear Regression Multiple linear regression is a special case of the GLM Response is normally distributed with variance σ 2 Identity link function μ i = g(μ i ) = x i T β 8

Generalized Linear Models Predictor Variables Two Types: Continuous and Categorical Continuous Predictor Variables Examples – Time, Grade Point Average, Test Score, etc. Coded with one parameter – β j x j Categorical Predictor Variables Examples – Sex, Political Affiliation, Marital Status, etc. Actual value assigned to Category not important Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. Coded Differently than continuous variables 9

Generalized Linear Models Predictor Variables cont. Consider a categorical predictor variable with L categories One category selected as reference category Assignment of reference category is arbitrary Some suggest assign category with most observations Variable represented by L-1 dummy variables Model Identifiability 10

Generalized Linear Models Predictor Variables cont. Two types of coding Dummy Coding (Used in R) x k = 1 If predictor variable is equal to category k 0 Otherwise x k = 0 For all k if reference category Effect Coding (Used in JMP) x k =1 If predictor variable is equal to category k 0 Otherwise x k = -1 For all k if predictor variable is reference category 11

Generalized Linear Models Model Evaluation - -2 Log Likelihood Specified by the random component of the GLM model For independent observations, the likelihood is the product of the probability distribution functions of the observations. -2 Log likelihood is -2 times the log of the likelihood function -2 Log likelihood is used due to its distributional properties – Chi-square 12

Generalized Linear Models Saturated Model (Perfect Fit Model) Contains a separate indicator parameter for each observation Perfect fit μ i = y i Not useful since there is no data reduction i.e. number of parameters equals number of observations Maximum achievable log likelihood (minimum -2 Log L) – baseline for comparison to other model fits 13

Generalized Linear Models Deviance Let L(β|y) = Maximum of the log likelihood for a proposed model L(y|y) = Maximum of the log likelihood for the saturated model Deviance = D(β) = -2 [L(β|y) - L(y|y)] 14

Generalized Linear Models Deviance cont. Model Chi-Square 15

Generalized Linear Models Deviance cont. Lack of Fit test Likelihood Ratio Statistic for testing the null hypothes is that the model is a good alternative to the saturated model Has an asymptotic chi-squared distribution with N – p degrees of freedom, where p is the number of parameters in the model. Also allows for the comparison of one model to another using the likelihood ratio test. 16

Generalized Linear Models Nested Models Model 1 - Model with p predictor variables {X 1, X 2 …,X p } and vector of fitted values μ 1 Model 2 - Model with q<p predictor variables {X 1, X 2,…,X q } and vector of fitted values μ 2 Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1. i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1 17

Generalized Linear Models Nested Models Model 2 is a special case of Model 1 - all the coefficients corresponding to X q+1, X q+2, X q+3,….,X p are equal to zero 18

Generalized Linear Models Likelihood Ratio Test Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. 19

Generalized Linear Models Likelihood Ratio Test Likelihood Ratio Statistic =-2L(y, μ 2 ) - (-2L(y, μ 1 )) =D(y,μ 2 ) - D(y, μ 1 ) Difference of the deviances of the two models Always D(y,μ 2 ) > D(y,μ 1 ) implies LRT > 0 LRT is distributed Chi-Squared with p-q degrees of freedom Later, the Likelihood Ratio Test will be used to test the significance of variables in Logistic and Poisson regression models. 20

Generalized Linear Models Theoretical Example of Likelihood Ratio Test 3 predictor variables – 1 Continuous (X 1 : GPA), 1 Categorical with 4 Categories (X 2, X 3, X 4, Year in college), 1 Categorical with 2 Category (X 5 : Sex) Model 1 - predictor variables {X 1, X 2, X 3, X 4, X 5 } Model 2 - predictor variables {X 1, X 5 } Null Hypothesis – Variables with 4 categories is not significant to the model (β 2 = β 3 = β 4 = 0) Alternate Hypothesis - Variable with 4 categories is significant 21

Generalized Linear Models Theoretical Example of Likelihood Ratio Test Cont. Likelihood Ratio Test Statistic = D(y,μ 2 ) - D(y, μ 1 ) Difference of the deviance statistics from the two models Equivalently, the difference of the -2 Log L from the two models Chi-Squared Distribution with 5-2=3 degrees of freedom 22

Generalized Linear Models Model Comparison Determining Model Fit cont. Akaike Information Criterion (AIC) – Penalizes model for having many parameters – AIC = -2 Log L +2*p where p is the number of parameters in model, small is better Bayesian Information Criterion (BIC) – BIC = -2 Log L + ln(n)*p where p is the number of parameters in model and n is the number of observations – Usually stronger penalization for additional parameter than AIC 23

Generalized Linear Models Summary Setup of the Generalized Linear Model Continuous and Categorical Predictor Variables Log Likelihood Deviance and Likelihood Ratio Test Test lack of fit of the model Test the significance of a predictor variable or set of predictor variables in the model. Model Comparison 24

Generalized Linear Models Questions/Comments 25

Logistic Regression Consider a binary response variable. Variable with two outcomes One outcome represented by a 1 and the other represented by a 0 Examples: Does the person have a disease? Yes or No Outcome of a baseball game? Win or loss 26

Logistic Regression Teaching Method Data Set Found in Aldrich and Nelson (Sage Publications, 1984) Researcher would like to examine the effect of a new teaching method – Personalized System of Instruction (PSI) Response variable is whether the student received an A in a statistics class (1 = yes, 0 = no) Other data collected: GPA of the student Score on test entering knowledge of statistics (TUCE) 27

Logistic Regression Consider the linear probability model where y i = response for observation i x i = 1 x p vector of covariates for observation i p =1+k, number of parameters 28

Logistic Regression GLM with binomial random component and identity link g(μ) = μ Issues: p i can take on values less than 0 or greater than 1 Predicted probability for some subjects fall outside of the [0,1] range. 29

Logistic Regression Consider the logistic regression model GLM with binomial random component and logit link g(μ) = logit(μ) Range of values for p i is 0 to 1 30

Logistic Regression Interpretation of Coefficient β – Odds Ratio Odds: fraction of Prob(event)=p vs Prob(not event)=1-p The odds ratio is a statistic that measures the odds of an event compared to the odds of another event. Ex. Say the probability of Event 1 is p 1 and the probability of Event 2 is p 2. Then the odds ratio of Event 2 to Event 1 is: 31

Logistic Regression Interpretation of Coefficient β – Odds Ratio Cont. 32

Logistic Regression Interpretation of Coefficient β – Odds Ratio Cont. Value of Odds Ratio range from 0 to Infinity Value between 0 and 1 indicate the odds of Event 1 are greater Value between 1 and infinity indicate odds of Event 2 are greater Value equal to 1 indicates events are equally likely 33

Logistic Regression Interpretation of Coefficient β – Odds Ratio Cont. Link to Logistic Regression : Thus the odds ratio of event 2 to event 1 is Note: One should take caution when interpreting parameter estimates Multicollinearity can change the sign, size, and significance of parameters 34

Logistic Regression Interpretation of Coefficient β – Odds Ratio Cont. Consider Event 1 is Y=1 given X (prob=p 1 ) and Event 2 is Y=1 given X+1 (prob=p 2 ) From our logistic regression model Thus the odds ratio of Y=1 for per unit increase in X is 35

Logistic Regression Interpretation for a Continuous Predictor Variable Consider the following JMP output: Parameter Estimates Term EstimateStd ErrorL-R ChiSquareProb>ChiSqLower CLUpper CL Intercept * GPA * TUCE PSI[0] * Interpretation of the Parameter Estimate: exp = = Odds ratio between the odds at x+1 and odds at x for any gpa score The ratio of the odds of getting an A between a person with a 3.0 gpa and 2.0 gpa is equal to or in other words the odds of the person with the 3.0 is times the odds of the person with the 2.0. Equivalently, the odds of NOT getting an A for a person with a 3.0 gpa is equal to 1/ = times the odds of NOT getting an A for a person with a 2.0 gpa. 36

Logistic Regression Single Categorical Predictor Variable Consider the following JMP output: Parameter Estimates Term EstimateStd ErrorL-R ChiSquareProb>ChiSqLower CLUpper CL Intercept * GPA * TUCE PSI[0] * Interpretation of the Parameter Estimate: exp 2* = = Odds ratio between the odds of getting an A for a student that was not subject to the teaching method and for a student that was subject to the teaching method. The odds of NOT getting an A without the teaching method is 1/0.0928= times the odds of NOT getting an A with the teaching method. I 37

Logistic Regression ROC Curve Receiver Operating Curve Sensitivity – Proportion of positive cases (Y=1) that were classified as a positive case by the model Specificity - Proportion of negative cases (Y=0) that were classified as a negative case by the model 38

Logistic Regression ROC Curve Cont. Cutoff Value - Selected probability where all cases in which predicted probabilities are above the cutoff are classified as positive (Y=1) and all cases in which the predicted probabilities are below the cutoff are classified as negative (Y=0) 0.5 cutoff is commonly used ROC Curve – Plot of the sensitivity versus one minus the specificity for various cutoff values False positives (1-specificity) on the x-axis and True positives (sensitivity) on the y-axis 39

Logistic Regression ROC Curve Cont. Measure the area under the ROC curve Poor fit – area under the ROC curve approximately equal to 0.5 Good fit – area under the ROC curve approximately equal to

Logistic Regression Summary Introduction to the Logistic Regression Model Interpretation of the Parameter Estimates β – Odds Ratio ROC Curves Teaching Method Example 41

Logistic Regression Questions/Comments 42

Poisson Regression Consider a count response variable. Response variable is the number of occurrences in a given time frame. Outcomes equal to 0, 1, 2, …. Examples: Number of penalties during a football game. Number of customers shop at a store on a given day. Number of car accidents at an intersection. 43

Poisson Regression Mining Data Set Found in Myers (1990) Response of interest is the number of fractures that occur in upper seam mines in the coal fields of the Appalachian region of western Virginia Want to determine if fractures is a function of the material in the land and mining area Four possible predictors Inner burden thickness Percent extraction of the lower previously mined seam Lower seam height Years the mine has been open 44

Poisson Regression Mining Data Set Cont. Coal Mine Seam 45

Poisson Regression Mining Data Set Cont. Coal Mine Upper and Lower Seams Prevalence of overburden fracturing may lead to collapse 46

Poisson Regression Consider the model where Y i = Response for observation i x i = 1x(k+1) vector of covariates for observation i p =Number of covariates μ i = Expected number of events given x i GLM with Normal random component and identity link g(μ) = μ Issue: Predicted values range from -∞ to +∞ 47

Poisson Regression Consider the Poisson log-linear model GLM with Poisson random component and log link g(μ) = log(μ) Predicted response values fall between 0 and +∞ In the case of a single predictor, An increase by one unit in x results an multiple of in μ 48

Poisson Regression Continuous Predictor Variable Consider the JMP output Term EstimateStd ErrorL-R ChiSquareProb>ChiSqLower CLUpper CL Intercept * Thickness Pct_Extraction <.0001* Height Age * Interpretation of the parameter estimate: exp =.9697 = multiplicative effect on the expected number of fractures for an increase of 1 in the years the mine has been opened 49

Poisson Regression Overdispersion for Poisson Regression Models More variability in the response than the model allows For Y i ~Poisson(λ i ), E [Y i ] = Var [Y i ] = λ i The variance of the response is much larger than the mean. Consequences:Parameter estimates are still consistent Standard errors are inconsistent Detection:D(β)/n-p Large if overdispersion is present 50

Poisson Regression Overdispersion for Poisson Regression Models Cont. Remedies 1.Change linear predictor – X T β – Add or subtract regressors, transform regressors, add interaction terms, etc. 2.Change link function – g(X T β) 3.Change Random Component – Use Negative Binomial Distribution 51

Poisson Regression Summary Introduction to the Poisson Regression Model Interpretation of β Overdispersion Mining Example 52

Poisson Regression Questions/Comments 53

Non-parametric Tests Mann–Whitney U test or Wilcoxon rank-sum test Alternative to 2-sample T test for comparing measurements in two samples of indep obs Measurement is not interval, distribution is unclear Rather than using original values, test statistic based on ranks Pros: no normality assumption, robust to outliers Cons: less powerful than t-test if normality holds 54

Non-parametric Tests Kruskal–Wallis test Alternative to ANOVA for comparing >2 groups To compare measurements in >2 samples of indep obs It is an extension of the Mann–Whitney U test to 3 or more groups. If Kruskal-Wallis test is significant, perform pair-wise multiple-comparisons using Mann–Whitney U with adjusted significance level 55

Non-parametric Models Objective is to find a unknown non-linear relationship between a pair of random variables X and Y Different from parametric models in that the model structure is not specified a priori but is instead determined from data. ‘Non-parametric’ does not imply absolute absence of parameters 56

Non-parametric Models Ex. Kernel Regression Estimation based on localized weighting average 57

Non-parametric Models Brief introductions: Future LISA short course on Non-parametric methods? 58