EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
The %LRpowerCorr10 SAS Macro Power Estimation for Logistic Regression Models with Several Predictors of Interest in the Presence of Covariates D. Keith.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Inference for Regression
Simple Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Topic 15: General Linear Tests and Extra Sum of Squares.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Multiple Linear Regression
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 12 Multiple Regression
Chi-square Test of Independence
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
EPI 809/Spring Multiple Logistic Regression.
Logistic Regression Biostatistics 510 March 15, 2007 Vanessa Perez.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
STAT E-150 Statistical Methods
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Example of Simple and Multiple Regression
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Lecture 15 Basics of Regression Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Hierarchical Binary Logistic Regression
April 11 Logistic Regression –Modeling interactions –Analysis of case-control studies –Data presentation.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
HLTH 653 Lecture 2 Raul Cruz-Cano Spring Statistical analysis procedures Proc univariate Proc t test Proc corr Proc reg.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Warsaw Summer School 2015, OSU Study Abroad Program Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and.
Chapter 13 Multiple Regression
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 25.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
EIPB 698D Lecture 5 Raul Cruz-Cano Spring Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
Analysis of matched data Analysis of matched data.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
John Loucks St. Edward’s University . SLIDES . BY.
Multiple logistic regression
CHAPTER 29: Multiple Regression*
ביצוע רגרסיה לוגיסטית. פרק ה-2
Presentation transcript:

EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013

Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also Equal Variance p-value is necessary) and state conclusions clearly Everybody can attend next week’s review

Proc Reg The REG procedure is one of many regression procedures in the SAS System. PROC REGPROC REG ; MODELMODEL dependents= ; BYBY variables ; OUTPUTOUTPUT keyword=names ;

data blood; INFILE ‘F:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run; proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ; run;

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, is 655.

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Sum of Squares - associated with the three sources of variance, total, model and residual. SSTotal The total variability around the mean. Sum(Y - Ybar) 2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted) 2. SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar) 2. Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.

Proc reg output Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

Proc reg output Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Dependent Mean - This is the mean of the dependent variable. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90). How much variability is explained by the model

Proc reg output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 sex typeA typeB typeAB Age_old RBC WBC t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing the null hypothesis that the coefficient/parameter is 0.

Logistic regression For binary response models, the response, Y, of an individual or an experimental unit can take on one of two possible values, denoted for convenience by 1 and 0 (for example, Y=1 if a disease is present, otherwise Y=0). Suppose x is a vector of explanatory variables and is the response probability to be modeled. The logistic regression model has the form Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β 0 + β 1 x

Proc logistic The following statements are available in PROC LOGISTIC: PROC LOGISTICPROC LOGISTIC ; BYBY variables ; CLASSCLASS variable ; MODELMODEL response = ; MODELMODEL events/trials = ; OUTPUTOUTPUT / ; The PROC LOGISTIC and MODEL statements are required; only one MODEL statement can be specified. The CLASS statement (if used) must precede the MODEL statement.

High school data The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and science studies. The response variable is high writing test score (high_write), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender, reading test score (read), and science test score (science).

High school data data new ; set d.hsb2; if write>=60 then high_write=1; else high_write=0; keep ID female math read science write high_write; run; proc logistic data= new descending; model high_write = female read science; run;

Logistic output Model Information Data Set WORK.NEW Response Variable high_write Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 200 Number of Observations Used 200 This the data set used in this procedure. This is the type of regression model that was fit to our data. The term logit and logistic are exchangeable.

Logistic output Response Profile Ordered high_ Total Value write Frequency Probability modeled is high_write=1. Ordered value refers to how SAS models the levels of the dependent variable. When we specified the descending option, SAS treats the levels in a descending order (high to low), such that when the regression coefficients are estimated, a positive coefficient corresponds to a positive relationship for high write status. By default SAS models the lower level This is a note informing which level of the response variable we are modeling.

Logistic output Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC SC Log L This describes whether the maximum- likehood algorithm has converged or not, and what kind of convergence criterion is used to asses convergence. Model with no predictors just intercept tem These are various measurements used to assess the model fit. The smaller values the better fit. The fitted model

Logistic output Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio <.0001 Score <.0001 Wald <.0001 These are three asymptotically equivalent Chi-Square tests. They test against the null hypothesis that all of the predictors' regression coefficient are equal to zero in the model. With P<0.001, we will reject Ho and conclude that at least one of the predictors' regression coefficient is not equal to zero.

Logistic output Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept <.0001 female read <.0001 science Here are the parameter estimates along with their P-value. Base on the estimates, our model is log[ p / (1-p) ] = *female *read *science.

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female read science The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can interpret the odds ratio as follows: for a one unit change in the predictor variable, the odds ratio for a positive outcome is expected to change by the respective coefficient, given the other variables in the model are held constant. Logistic output

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female read science If the 95% CI does not cover 1, it suggests the estimate is statistically significant Logistic output

Weighted Example Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome. Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. – The outcome of interest is intercourse. – The predictors are race (white and black) and gender (male and female). Example from Agresti, A. Categorical Data Analysis, 2 nd ed

Here is a table of the data: Intercourse RaceGenderYesNo WhiteMale43134 Female26149 BlackMale2923 Female2236 Raul Cruz-Cano, HLTH653 Spring 2013

Data Set Intercourse DATA intercourse; INPUT white male intercourse count; DATALINES; ; RUN;

SAS: “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no). “rsquare” requests the R 2 value from SAS; it is interpreted the same way as the R 2 from linear regression. “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data. PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male/rsquare lackfit; RUN;

SAS Output: R 2

Interpreting the R 2 value The R 2 value is This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model.

PROC LOGISTIC Output The odds of having intercourse is times greater for males versus females.

Hosmer and Lemeshow GOF Test

H-L GOF Test The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses: H o : the model is a good fit, vs. H a : the model is NOT a good fit With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen). Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit. In this case, the pvalue = , so we do NOT reject the null hypothesis, and we conclude the model is a good fit.

Model Selection in SAS Can be applied to both Linear and Logistic Models Often, if you have multiple predictors and interactions in your model, SAS can systematically select significant predictors using forward selection, backwards selection, or stepwise selection. In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest p-value and adds it to the model. It then selects another predictor from the remaining variables with the smallest p-value and adds it to the model. It continues doing this until no more predictors have p-values less than In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant.

Forward Selection in SAS We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male white*male/selection = forward lackfit; RUN;

Output from Forward Selection: “white” is added to the model

“male” is added to the model

No more predictors are found to be statistically significant

The Final Model:

Hosmer and Lemeshow GOF Test: The model is a good fit