Analysis of Complex Survey Data Day 3: Regression
Today’s schedule Part I: Basic review of common regressions and when to use them PART II: Introduction to – PROC REGRESS – PROC RLOGIST – PROC LOGLINK – PROC MULTILOG
Regression Typically in epidemiologic research, our outcomes fall into four major types: – Continuous Normally distributed Skewed – Counts – Binary – Ordinal – Nominal
Continuous outcome, normally distributed Linear regression
Continuous outcome, right skewed Poisson regression
Counts Poisson regression
Binary outcome Logistic regression
Ordinal Polytomous regression, cumulative logit link function Likert scales Ordered categorical scales (age, income) The cumulative logit link function assumes that the effect of going from 1 to 2 is the same as the effect of going from 2 to 3
Nominal Polytomous regression, general logit link function Race Diagnosis (depression versus anxiety versus substance use disorder) The general logit link function gives a different estimate for the effect of going from 1 to 2 and the effect of going from 2 to 3
Categorizing your exposure Check assumptions regarding the functional form of the relationship between the exposure and the outcome – E.g., relationship between age and alcohol use disorders. We would not want to enter age as a continuous variable because we do not think age is linearly related to risk of alcohol use disorders If you decide to categorize a continuous variable, decision on cutpoints can best be made if there is literature precedent – Relying on data driven cutpoints will make your work incomparable with other work in the literature If there is no precedent: – Use quartiles or – Break up the exposure into small categories, and examine the relationship with the outcome in a regression model with no predictors (on the log scale if using logistic regression).
Choosing covariates Most important: DO NOT SKIP THE GOUNDWORK! – Check associations with exposure and outcome – Check associations among covariates – Categorize the covariates appropriately When should something be evaluated as a moderator, and when should it be a confounder/covariate? – Most of the time, it is clear: do you think that the relationship between exposure and outcome will be the same across levels of the third variable, or do you think it will be different? – If you do not have an a priori hypothesis and are just trying to build a solid statistical model, try as a moderator first. If significant, leave in as a moderator. – Because interaction terms are sometimes difficult to interpret on their own, think about just creating subset statistical models.
LAB 3: Regression in SUDAAN