Download presentation
Presentation is loading. Please wait.
Published byYuliani Hardja Modified over 6 years ago
1
Lecture 10: cohort analysis (part 1): Intro to regression and dummy variables
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II Department of Public Health Sciences Medical University of South Carolina Spring 2015
2
Measures of association: linked to study design
e.g. if you have a dichotomous outcome: Linear regression – not okay Assumes continuous normal outcome Logistic regression – okay Assumes follow-up is the same for all subjects Predictor variables can be continuous, dichotomous, ordinal, nominal Produces odds ratio for each variable Regression coefficient is natural log of adjusted OR
3
Measures of association: linked to study design
If you have a dichotomous outcome (cont): Proportional hazards regression – usually better Also called “Cox models”, “survival analysis” Takes into account different follow-up possible for each subject; different risk sets for each case Predictor variables can be continuous, dichotomous, ordinal, nominal Produces hazard ratio for each variable Assumes that the hazard ratio is constant over time
4
Measures of association: linked to study design
If you have a dichotomous outcome (cont): Log binomial or negative binomial Can directly estimate relative risk Preferable to logistic regression when outcome is common Good for causal models in cohort studies where you are not modeling time-to-event
5
Measures of association
Odds ratio Can be obtained from cross-sectional, case-control, cohort study, randomized controlled trial Rate ratio, risk ratio, hazard ratio Poisson regression, log binomial, survival analysis Can be obtained from cohort study, randomized controlled trial Cross-sectional study can give you risk ratio, prevalence ratio, odds ratio
6
Measures of association
Analysis strategy should progress in stages Univariate analysis Bivariate analysis Modeling Analysis methods are linked to: Study design Question of interest Structure of data and variable coding Appropriate measure of association
7
Linear regression Assume continuous normal outcome (e.g. hypertension)
Model the actual outcome (no link function) Coefficient is the weight for each variable Model predicts value of Y among unexposed (i.e. all X=0), and change in Y associated with a one-unit increase for each variable, adjusted for other variables Can obtain predicted risk for an individual
8
Logistic regression Assume dichotomous outcome (e.g. preterm birth)
Model the “logit” link function (log odds) Coefficient is the weight for each variable Intercept is log odds of outcome among unexposed Other betas are the increased “log odds” associated with a one-unit increase for each variable, adjusted for the other variables
9
Logistic regression This works out so that each beta (except for the intercept) is the adjusted log odds ratio for a one-unit increase in the exposure variable of interest To obtain adjusted odds ratio, exponentiate the beta
10
Why that works Odds:
11
Why that works Odds ratio: estimate odds of disease in exposed (X1=1) versus unexposed (X1=0)
12
Why that works Exponential mathematics: e(x+y) = ex * ey
13
Why that works Adjusted OR for X1 (hold X2 steady)
14
Why that works If X1 is (1=yes, 0=no), then: β1X1e= β1 β1X1u=0
Note that e0=1
15
Why that works So therefore, in logistic regression:
16
Why that works Note: exponentiated beta is simply the adjusted OR for a one-unit increase in the variable, no matter how it is coded Dichotomous 1=yes, 0=no Dichotomous 0=yes, 1=no Dichotomous 85=yes, 0=no Continuous ranging from 85 to 450 Ordinal (1, 2, 3, 4, 5)
17
Why that works One-unit increase for continuous variable:
18
Implication Odds ratio is assumed to be the same for a one-unit increase anywhere along the scale of the exposure variable e.g. 1 versus 0, 2 versus 1, 3 versus 2, etc. i.e. the relationship is linear in the logit This is why it is called a “multiplicative model” Each unit increase in exposure is assumed to multiply risk, rather than add to the risk, by a defined amount
19
Linearity in the logit Dichotomous variable: okay
Two points always form a perfect line Continuous or ordinal variables: maybe Need to test the assumption by constructing “dummy variables” Categorize the variable with regular intervals Define reference category All other categories will be compared to it Assess whether dose-response curve is smooth
20
Dummy variables: example
Outcome variable: Type II diabetes Continuous exposure variable: age Fit as continuous variable: logistic regression model will assume the OR is the same for each one-unit increase in age e.g. OR=1.14 (95% CI: – 1.19) Does this fit the data well? Find out by testing dummy variables for age groups
21
Dummy variables: example
Age range: Choose categories with regular intervals to test linearity in logit Reference category should be chosen as described earlier Robust sample size May be convenient to have it be the category with expected lowest (or highest) risk
22
Dummy variables: example
Make 7 categories: 18-29 (reference) 30-39 40-49 50-59 60-69 70-79 80-85
23
Dummy variables: example
6 dummy variables for 7 categories: each dichotomous (1=yes, 0=no) 18-29 (reference) 30-39: dummy variable 1 (AGE2) 40-49: dummy variable 2 (AGE3) 50-59: dummy variable 3 (AGE4) 60-69: dummy variable 4 (AGE5) 70-79: dummy variable 5 (AGE6) 80-85: dummy variable 6 (AGE7)
24
Dummy variables: example
AGE1 AGE2 AGE3 AGE4 AGE5 AGE6 18-29 30-39 1 40-49 50-59 60-69 70-79 80-85
25
Dummy variables: example
This coding strategy results in independent comparisons of risk between the reference group and each other category Each dummy variable corresponds to the comparison between that category (dummy variable = 1) and the reference category
26
Dummy variables: example
Possible results (logistic regression): 18-29 (reference) 30-39: AGE2: OR=1.5 (0.7 – 2.2) 40-49: AGE3: OR=2.0 (0.9 – 2.9) 50-59: AGE4: OR=2.5 (1.3 – 3.1) 60-69: AGE5: OR=3.0 (1.8 – 4.1) 70-79: AGE6: OR=3.5 (2.0 – 4.9) 80-85: AGE7: OR=4.0 (2.4 – 6.8)
27
Dummy variables: example
Possible results (logistic regression): 18-29 (reference) 30-39: AGE2: OR=1.5 (0.7 – 2.2) 40-49: AGE3: OR=2.0 (0.9 – 2.9) 50-59: AGE4: OR=2.5 (1.3 – 3.1) 60-69: AGE5: OR=2.6 (1.2 – 3.7) 70-79: AGE6: OR=2.4 (1.1 – 3.3) 80-85: AGE7: OR=2.5 (1.1 – 3.4)
28
Dummy variables: example
Possible results (logistic regression): 18-29 (reference) 30-39: AGE2: OR=0.92 (0.64 – 1.4) 40-49: AGE3: OR=1.2 (0.76 – 1.5) 50-59: AGE4: OR=1.6 (1.0 – 2.4) 60-69: AGE5: OR=2.6 (1.2 – 3.7) 70-79: AGE6: OR=4.8 (2.1 – 7.1) 80-85: AGE7: OR=7.5 (2.9 – 13.4)
29
Dummy variables To confirm linearity in the logit:
Beta coefficients for successive categories should progress in a linear scale (e.g. each category has a beta approximately 0.4 units higher than the previous beta) Odds ratios for successive categories should progress in a multiplicative scale (e.g. each category has an odds ratio approximately 1.8 times higher than the previous odds ratio)
30
Dummy variables If the exposure-disease relationship is not linear in the logit, may be advisable to create new dummy variables to use in modeling Choose categories (including the designation of reference category) in some meaningful way based on theoretical considerations, sample size This is in contrast to choosing standard-width categories to assess log-linearity
31
Dummy variables Final notes:
Any simple dichotomous variable is an example of a dummy variable Also known as “nominal” variables Three types of variables: continuous, ordinal, nominal
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.