Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG

Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG
Multiple logistic and linear regression Introduction to SAS macros Programs 21-22 Welcome to lesson 11. In this lesson we will look at more SAS procedures used to perform statistical analyses. These include linear and logistic regression and an introduction to time-to-event methods including life-table plots and cox-regression.

Logistic Regression Model a binary factor (yes/no) as a function of one or more independent variables. TOMHS Example: Smoking as a function of age, gender, race, and education Log(p/1-p) = b0 +b1x1 + b2x2 + bkxk We will first look at logistic regression. Logistic regression models a binary outcome as a function of one or more independent variables. The function is the log-odds, i.e. we model the log-odds of the outcome as a linear function of the independent variables. In our illustration we will look at factors related to cigarette smoking. Specifically, we will look at age, gender, race, and education and their association with reported cigarette smoking. In logistic regression the independent variables can be continuous or discreet. As indicated the outcome must be binary. The binary factor can be smoking as we have here or could be other binary factors like disease status.

DATA stat ; INFILE '~/SAS_Files/tomhsfull.data' ; INPUT @1 ptid $10.
@ age @ sex @ race @ educ @ eversmk @ nowsmk @ energy ; if race = 2 then aa = 1; else aa = 0; if sex = 2 then women = 1; else women = 0; if educ in(1,2,3,4,5,6) then collgrad = 0; else if educ in(7,8,9) then collgrad = 1; if eversmk = 2 then currsmk = 2; else currsmk = nowsmk; Here we read in the variables from the TOMHS dataset using the full dataset version (n=902). We read in age, sex, race, education, 2 questions on smoking status, and the energy intake taken from diet records. We then create a binary variable for African American race status, one for female gender, and one for college graduate status, coding them as 1 or 0. For regression analysis coding as 0 and 1 makes the interpretation of the regression coefficients easier. The coding for the smoking status we will look at in the next slide.

Note: Second question only answered if first question is answered yes.
if eversmk = 2 then currsmk = 2; else currsmk = nowsmk; Did you ever smoke cigarettes? 1 = yes, 2= no Do you now smoke cigarettes? 1 = yes, 2= no Var: eversmk Var: nowsmk Two question were asked in TOMHS related to cigarette smoking; the first was “Have you ever smoked cigarettes” and if “yes” then “Do you currently smoke cigarettes” We want a variable to indicate current smoking status that includes the never smokers. We can get that with the if-then-else statement shown here. If the variable EVERSMK equals 2 (never smoker) than set the (new) variable CURRSMK to 2, otherwise set CURRSMK to the variable NOWSMK. See if you can follow the logic for the different possible answers to the two questions. We now have our outcome variable defined: variable CURRSMK, = to 1 for current smokers and = 2 for current non-smokers. Note: Second question only answered if first question is answered yes.

VAR age women collgrad aa ; CLASS currsmk; RUN;
PROC MEANS; VAR age women collgrad aa ; CLASS currsmk; RUN; N currsmk Obs Variable N Mean age women collgrad aa age women collgrad aa Before we do a fancy regression model let’s do a simple PROC MEANS to see if the independent variables differ between smokers and non-smokers. We see here that there are 98 current smokers and 801 non-smokers. Current smokers tended to be younger and more likely to be women, non-college graduates and African American. Now we will put these variables into a logistics regression model to determine their multivariate associations with smoking status.

OR = exp(estimate) OR (age) = exp(-0.07) = 0.93 PROC LOGIST;
ODS SELECT ParameterEstimates OddsRatios PROC LOGIST; MODEL currsmk = age women collgrad aa ; RUN; Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept age women collgrad aa <.0001 Odds Ratio Estimates Point % Wald Effect Estimate Confidence Limits age women collgrad aa The syntax for logistic regression is the PROC LOGIST statement followed by a MODEL statement. The model statement list the dependent variable, here the variable CURRSMK, and equal sign, followed by the list of independent variables. We list each of the four factors we are interested in. We preface the procedure with an ODS SELECT statement that will tell SAS to include only the output tables that display the parameter estimates and the odds ratios. The names of these tables can be determined by using the ODS listing statement and running the procedure. The output is shown below the procedure call. We see that age and college graduate status are significantly inversely related to smoking and that AA status is positively related to smoking. Female gender is not significantly related to smoking status. Since all four of the factors are in the model these relationships are simultaneously adjusted for the other factors. The odds ratio table gives us the estimated odds ratio and 95% confidence interval for each factor. The point estimate, computed as the exponential of the regression coefficient, is the odds of smoking for a difference of one unit in the factor. So for age the odds ratio of 0.93 is the effect of one year of age; for education the odds ratio of is the odds of smoking for a college graduate versus a non-college graduate, and for AA the odds ratio of 3.8 tells us that AA are nearly 4-times as likely to smoke compared to other ethnicities. OR = exp(estimate) OR (age) = exp(-0.07) = 0.93

Univariate (Separate regression runs)
Comparison of univariate versus multivariate results Multivariate Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept age women collgrad aa <.0001 Univariate (Separate regression runs) age <.0001 women collgrad aa <.0001 Note: Women more likely to be AA then men in TOMHS and AA more likely to be smokers. This slide compares the multivariate results with all four factors in the model with the univariate results where each factor is entered in the model alone. The code for the univariate results is not shown here, only the results. For the variables age, college graduate, and AA ethnicity, the regression coefficients from the univariate and multivariate models are nearly identical. However, for the gender factor (female status) the coefficient reverses from being positive (higher rate of smoking among women compared to men) to negative (lower rate in women), although in each case the coefficient is not significantly different from zero. This can happen in any regression results because of confounding related to correlation among the independent variables and their associations with the dependent variable. Here we note, in TOMHS, women were more likely to be AA then men and AA were more likely to be smokers than non AA. This explains the change in the relationship between gender and smoking once race is in the model.

Linear Regression Model a continuous factor as a function of one or more independent variables. TOMHS Example: Energy (calories) intake as a function of age, gender, race, and education We will now look at an example using linear regression. Linear regression models a continuous factor as a linear function of one or more independent variables. As in logistics regression the independent variables can be continuous or discrete. In our illustration we will like at factors related to energy (caloric) intake. Specifically, we will look at age, gender, race, and education and their association with energy intake as calculated from food records..

MODEL energy = age women collgrad aa ; RUN;
ODS SELECT ParameterEstimates ; PROC REG; MODEL energy = age women collgrad aa ; RUN; The REG Procedure Model: MODEL1 Dependent Variable: energy Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 age <.0001 women <.0001 collgrad aa <.0001 Energy = *age – 570*women – 109*collgrad – 253*aa We have seen the syntax for linear regression earlier in the course. As with logistic regression there is a model statement where the dependent variable is listed followed (after an = sign) by the list of independent variables. Here we are modeling energy intake as a function of age, gender, education, and AA status, the same variables used before to relate to smoking status. We use the ODS SELECT statement to limit our output to the parameter estimates, i.e. the estimated regression coefficients. The parameter estimates with their standard errors and significance levels are given under the SAS code. The parameter estimates give the estimated effect of one unit of the independent variable on energy intake. So from the output we see that older persons intake less energy (by 20 calories per year of age), women intake 570 calories less than men, college graduates intake 110 calories less than non-college graduates and AA intake 254 calories less than non AA. These estimates are “adjusted” for the other factors in the model.

Multivariate Analysis Parameter Standard
Variable DF Estimate Error t Value Pr > |t| age <.0001 women <.0001 collgrad aa <.0001 Univariate Analysis (Separate regression runs) age <.0001 women <.0001 collgrad aa <.0001 Women less likely to be college graduates and also to have lower coloric intake. Here we display the parameter estimates from the full model with the results from regression models with each variable entered alone in the model. The coefficient for college graduate changed from a non-significant positive value of 41 to a significant negative value of -109 in the multivariate model. This change is due, in part, because in TOMHS women were less likely to be college graduates than men and also to have lower caloric intake than men.

CLASS women aa collgrad; RUN;
PROC MEANS; VAR energy; CLASS women aa collgrad; RUN; Analysis Variable : energy N women aa collgrad Obs N Mean Sometimes to understand multivariate relationships you can compute means of the dependent variables by combinations of the independent factors. Here we compute the mean energy intake by the gender, race, and education categories. The regression coefficient for education will essentially be the pooled difference between graduate and non-graduates energy intake across the four gender-race categories. This breakdown does not account for any confounding by age.

Macro Variables and Use
LIBNAME t ‘C:\SAS_Files'; %let nut = kcalbl dcholbl calcbl sodbl; %let cat = clinic; DATA temp; SET t.tomhs (KEEP=ptid &nut &cat); RUN; PROC MEANS DATA=temp ; VAR &nut ; CLASS &cat; TITLE "PROC Means results for variables &nut by &cat"; * Makes it easy to modify code; This example gives an illustration of using what are called macro variables. SAS Macros is usually considered an advanced topic, but using macro variables is pretty easy and can be used to shorten and add flexibility to your program. As with many things in SAS it is best shown through an example. Note the three %let statements at the top of the program. These define three macro variables called nut, ilist, and options. Each macro variable is assigned the value of the text after the equals sign. Note macro variable nut is set to a list of variables names related to nutrition. The macro variable ilist is set to the two variables education and income. We set the macro variable options to be the three statistics listed after the equals sign. Once defined we can use these names as shortcuts later in our program. Instead of typing the list of nutrition variables each time we can type the shortcut name. To tell SAS we are using the shortcut name we place an ampersand before the macro variable name. Look at the different places where &nut is used. When SAS processes your program it will insert the value of &nut, the four nutrition variables, every time it sees &nut. Using macro variables can make it easy to modify code and it can reduce the possibility of errors because you will be typing less.

Macro Variables %let macrovarname = characters ;
Defined using %LET statement Referenced by using &macrovarname SAS substitutes the value of macrovarname when it encounters &macrovarname Useful for making a program easy to modify Usually put near top of program Here is the general syntax for defining a macro variable and how it is referenced. The keyword %let is followed by the name of the macro variable you are defining, an equals sign, followed by the characters you are assigning the name. The characters will often be a list of variables. To reference the macro variable in your program place an ampersand followed immediately by the name. SAS will substitute the value of the macro variable every place it sees the & reference. This can make it much easier to modify your programs. Suppose you wanted to add to the list of nutrient variables you wanted to analyze. Just change the macro variable nut to the new list. The rest of the program can remain the same. In SAS jargon we usually talk of a macro variable called nut (say) as &nut.

Simple Macro to Shorten Code
Suppose I want to compute the change in 4 variables at 3 time points. Can use macro to help you. Variables: Dbp12,24,36 and dbpbl Sbp12,24,36 and sbpbl Chol12,24,36 and cholbl Gluc12,24,36 and glucbl %macro change(v); dbpdif&v = dbp&v - dbpbl; sbpdif&v = sbp&v - sbpbl; choldif&v = chol&v - cholbl; glucdif&v = gluc&v - glucbl; %mend change; option mprint; * Shows code generated in the log; data temp; set temp; %change(12); %change(24); %change(36); run; So before we quit for the semester I want to show you one example of a macro program to illustrate how macros can be used to shorten and automate programs. Suppose I need to compute the change in several variables at multiple points in time (at 12, 24, and 36 months). I could write repeated blocks of code for the different visits or I could use a macro to help me. Let’s write a simple macro to reduce the coding required. You start by creating the macro with the keyword MACRO followed by the name of the macro – here called change. In parenthesis we add any parameters we will be changing in each call to the macro – here we have just one parameter call V to represent the visit. After the macro definition we write SAS statements we want using &v as a placeholder for the visit. So looking at the macro we see we want to compute the difference in four variables at a certain time represented by &v. We close the macro with a %MEND statement. This code doesn’t do anything until it is called. We will call it in a data step. It is called by using the macro name preceded by a % sign, and giving a value to the macro parameter. We call it three times, once for 12-months (&v=12), once for 24-months (&v=24) and once for 36-months (&v=36). That’s all there is to it. To see the code the macro generated we add an option mprint at the top of our program.

Simple Macro to Shorten Code
%macro change(v); %change(12); MPRINT(CHANGE): dbpdif12 = dbp12 - dbpbl; MPRINT(CHANGE): sbpdif12 = sbp12 - sbpbl; MPRINT(CHANGE): choldif12 = chol12 - cholbl; MPRINT(CHANGE): glucdif12 = gluc12 - glucbl; %change(24); MPRINT(CHANGE): dbpdif24 = dbp24 - dbpbl; MPRINT(CHANGE): sbpdif24 = sbp24 - sbpbl; MPRINT(CHANGE): choldif24 = chol24 - cholbl; MPRINT(CHANGE): glucdif24 = gluc24 - glucbl; %change(36); MPRINT(CHANGE): dbpdif36 = dbp36 - dbpbl; MPRINT(CHANGE): sbpdif36 = sbp36 - sbpbl; MPRINT(CHANGE): choldif36 = chol36 - cholbl; MPRINT(CHANGE): glucdif36 = gluc36 - glucbl; run; SAS substitutes the value of v everywhere there is an &v Here is the code generated. We see 3 blocks of code, one for each visit. Everyplace in the macro that has the value &v SAS will substitute the value of &v you gave it, either 12, 24, or 36. Chapter 7 of the LSB will give you more information on using macros.

Another Macro Example Goal of Macro named Summary:
For a given dataset give summary statistics using PROC CONTENTS, MEANS and FREQ and (optionally) display the data using PROC PRINT. Instead of having to write the code each time, write a macro. 16

Parameters to Macro = defaults
Name of macro %macro summary ( dataset=, mvar=_numeric_, fvar = _character_, print=N, pvar=_all_); Parameters to Macro = defaults dataset: Name of dataset used mvar: List of variables to run for PROC MEANS (default is all numeric var) fvar: List of variables to run for PROC FREQ (default is all character var) print: If set to Y then run PROC PRINT (default is N) pvar: List of variables to run for PROC PRINT Remember: SAS Macros generate SAS code when you call it 17

Parameters to Macro Name of macro %macro summary ( dataset=, mvar=_numeric_, fvar = _character_, print=N, pvar=_all_); proc contents data=&dataset varnum; run; proc means data=&dataset; var &mvar; proc freq data=&dataset; tables &fvar; %if &print = Y %then %do; proc print data=&dataset; var &pvar; %end; %mend summary; This will generate the proc print code only if the macro variable print equals Y. 18

proc contents data=&dataset varnum; run; proc means data=&dataset;
* This is the macro; proc contents data=&dataset varnum; run; proc means data=&dataset; var &mvar; proc freq data=&dataset; tables &fvar; %if &print = Y %then %do; proc print data=&dataset; var &pvar; %end; CALL TO MACRO: libname t ‘C:/PH6420/data/'; data tomhs; set t.tomhs; run; option mprint; * Call with only dataset given; %summary (dataset=tomhs); Code Generated: MPRINT(SUMMARY): proc contents data=tomhs varnum; MPRINT(SUMMARY): run; MPRINT(SUMMARY): proc means data=tomhs; MPRINT(SUMMARY): var _numeric_; MPRINT(SUMMARY): proc freq data=tomhs; MPRINT(SUMMARY): tables _character_; 19

* This is the macro; proc contents data=&dataset varnum; run; proc means data=&dataset; var &mvar; proc freq data=&dataset; tables &fvar; %if &print = Y %then %do; proc print data=&dataset; var &pvar; %end; CALL TO MACRO: libname t ‘C:/PH6420/data/'; data tomhs; set t.tomhs; run; option mprint; %summary (dataset=tomhs, print=Y); Code Generated: MPRINT(SUMMARY): proc contents data=state varnum; MPRINT(SUMMARY): run; MPRINT(SUMMARY): proc means data=state; MPRINT(SUMMARY): var _numeric_; MPRINT(SUMMARY): proc freq data=state; MPRINT(SUMMARY): tables _character_; MPRINT(SUMMARY): proc print data=state; MPRINT(SUMMARY): var _all_;

proc contents data=&dataset varnum; run; proc means data=&dataset;
* This is the macro; proc contents data=&dataset varnum; run; proc means data=&dataset; var &mvar; proc freq data=&dataset; tables &fvar; %if &print = Y %then %do; proc print data=&dataset; var &pvar; %end; CALL TO MACRO: libname t ‘C:/PH6420/data/'; data tomhs; set t.tomhs; run; option mprint; %summary (dataset=tomhs, fvar=clinic sex); Code Generated: MPRINT(SUMMARY): proc contents data=tomhs varnum; MPRINT(SUMMARY): run; MPRINT(SUMMARY): proc means data=tomhs; MPRINT(SUMMARY): var _numeric_; MPRINT(SUMMARY): proc freq data=tomhs; MPRINT(SUMMARY): tables clinic sex; 21

proc tabulate data=_last_ noseps; class group; var sbp12;
| | Diastolic BP at 12-Months | | | | | | N | Mean | Std | Min | Max | | | |Study Group (1-6) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |All | | | | | | proc tabulate data=_last_ noseps; class group; var sbp12; table (group all), (sbp12)*(n*f=7.0 mean*f=7.1 std*f=7.1 min*f=7.1 max*f=7.1)/rts=30; run;

MACRO BRKSPSS: Creates tabulate table for each var in dlist by group
%macro brkspss (grp,dlist,data=_last_,dec=3,all=all); %do I = 1 %to 100; %let depvar = %scan(&dlist,&i); %let %length(&depvar) = 0 %then %goto done; proc tabulate data=&data noseps; class &grp; var &depvar; table (&grp &all), (&depvar)*(n*f=7.0 mean*f=7.&dec std*f=7.&dec min*f=7.&dec max*f=7.&dec)/rts=30; run; %end; %done: %mend brkspss; %brkspss(group,dbp12 sbp12 chol12);

MACRO BRKSPSS: Creates tabulate table for each var by group
LIBNAME t '~/PH6420/2017/Data/'; DATA stat; set t.tomhs; RUN; * Example calls; %brkspss(group,dbp12 sbp12 chol12); %brkspss(group,dbp12 sbp12 chol12, dec=1); * Just 1-decimal; %brkspss(group,dbp12 sbp12 chol12, all=); * No totals;

Output from last call: First 2 variables.
| | Diastolic BP at 12-Months | | | | | | N | Mean | Std | Min | Max | | | |Study Group (1-6) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Systolic BP at 12-Months | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

At beginning of program before you call it
Where to put macro? At beginning of program before you call it %macro brkspss(parameters); … macro code %mend brkspss; %brkspss (group, dbp12 sbp12, data=tomhs); Save as separate sas file and %include file on top of program. %include ‘/folderpath/brkspss.sas’; %brkspss(group, dbp12 sbp12, data=tomhs);

Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG

Similar presentations

Presentation on theme: "Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG

Similar presentations

Presentation on theme: "Lesson 11 - Topics Statistical procedures: PROC LOGIST, REG"— Presentation transcript:

Similar presentations

About project

Feedback