POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)

Which method does not control for confounding? A)Stratification B)Exclusion criteria C)Regression modelling D)Objective assessment of outcomes

Observational epidemiology Usually in epidemiology studies are “observational” Myriad factors determine the occurrence of disease Trying to elicit the effects of specific factors from others (confounding variables) is often difficult Regression models (as alternative to stratification) are useful

Often too many confounders – stratifying leads to too many strata e.g. 4 categories of AGE, 2 of sex, 4 of ethnicity = 32 strata. Empty cells problematic Need better, more statistically efficient, way to deal with problem Want to control for (many) possible confounding variables while eliciting effect (relative risk or odds ratio) of an exposure of interest Building statistical models is one solution Stratification difficulty

What is a [statistical regression] model? Usually regarded as a formula that relates an outcome Y to one or more predictors (exposures) X 1 X 2 ….of Y The formula imposes a framework that we assume is the way we think Y is related to X 1, X 2,…. in the real world Model is specified as unknown parameters estimated from data – ‘model fitting’.

Linear (regression) model Often may consider “Y increases with X” e.g blood pressure increases with age May also consider it does so “linearly” Data seems to support this Though with much variability.

Straight line model (simple linear regression model) for how Y “depends on X” X Y 0 This is the model structure, framework, Fitting to data involves drawing a “good fitting” line through the points – line gives mean Y for given X [E(Y|X)]

Regression Relationship between X and Y Y “depends” on X (rather than X depends on Y) Y is dependent (outcome, disease) variable X is independent (exposure, predictor, covariate) variable

Consider 2 potential predictors of Y, say X 1 X 2 Can plot data scatter points in a 3-dimensional space: Y X1X1 X2X2

Analog of a line in 2-D is a plane in 3-D: Y X1X1 X2X2

Straight line model with just X 1 is Extending this to a plane is Or further Here E(Y|…) means “average Y given ….”

Binary Y in epidemiology In epi, Y is often a binary disease/no disease outcome X 1 X 2 etc are risk factors for the disease. One of which may be an exposure of interest, the others confounders.

logistic model: need to modify to account for binary Y, occurrence of disease D Again information on X 1 X 2 …collapsed into a risk score relationship between probability of disease and Q follows now follows logistic formula:

Logistic regression formula is of the form Probability of CHD=e Q /(1+e Q ) where Q is a weighted average of risk factors (a linear score). For example: Q= -5.31+1.09*SMOKE+ 0.41*SEX and SEX=1 if man, 0 if woman SMOKE=1 if smokes, 0 if no The values -5.31, 1.09, 0.41 are estimated from the data and are the “beta-coefficients”.

The model gives a probability for each of the 4 combinations: Smoking man has probability Q=-5.32 + 1.09 x 1 +0.41 x 1= -3.82 Prob = e -3.82 /(1+e -3.82 )=0.0214 Nonsmoking man Q= -5.32+1.09x0 + 0.41 x 1= -4.91, Prob=0.00732 Nonsmoking woman Q=-5.32, Prob=0.00486 Smoking woman Q=-5.32+1.09, Prob=0.01434

Relative risk estimates RR for smoking (in men) is: 0.0214/0.00732=2.92 RR for smoking (in women) is: 0.01434/0.00486=2.95 Notice these are also approximately e 1.09 =2.97 i.e take exponential of beta-coefficient of variable estimates its RR (actually e 1.09 =2.97 is the disease odds ratio, but approx equal to RR when disease is rare)

Why logistic formula? Ans: P(D|…) always between 0 and 1 whatever value of Q i.e. behaves like a probability should.

Can include as many variables in Q as we like: Q=-5.45 +1.23SMOKE+0.31SEX+.124AGE -0.2ETHNIC … but model may be too ambitious. i.e. Can a single model be expected to really accurately account for effects of numerous variables?

Logistic model in epidemiology: controlling for confounding Y is occurrence of disease on a cohort study X 1 is binary exposure of interest X 2 X 3 … are confounding risk factors  1 is effect of X 1 “controlling” for effects of X 2 X 3 etc

Relative risk In fact is approximately relative risk of X 1 (assumed to be same for all values of X 2 X 3 ) - no effect modification/interaction RR is (assumed) same whatever X 2, X 3, etc e.g. if X 1 is smoking, X 2 age, X 3 is alcohol RR for smoking is same whatever age, alcohol

Case-control studies Development is for cohort studies (since probs of disease P( D | …) are estimable in a cohort study)…. ….but can use for case-control studies too (even though probs of disease are not estimable) Can still use as RR estimate.

Logistic Modelling advantages Can adjust for many confounders at once beta coefficients give odds ratio estimates of relative risk, valid if disease rare deals with “interactions” (effect modification) if necessary easy to do on computer gives confidence intervals, P-values etc can apply to case-control data

Disadvantages Model is just a model - not necessarily reality Black box approach, can lose touch with data Requires decisions: what variables in model? How to code variables? Continuous or dichotomised? ORs not valid as RR for non-rare disease (in cohort)

Logistic regression is favoured in epidemiology because: A)It can be used to adjust for many confounders at once B)It enhances statistical power over stratification C)It results in an outcome that is constrained between one and zero (the domain of a probability).

How do you estimate an odds ratio from a logistic model? A)It is equal to the beta coefficient B)It is equal to the exponential of the beta coefficient C)It is equal to the logit of the sum of the product of the variables and the beta coefficients.

Which one of the following statements are true? A)The choice of independent and dependent variable in regression modelling is unimportant B)A regression model estimates the average value of the dependent variable, given the values of a number of independent variables C)Independent variables are outcomes and dependent variables exposures.

POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)

Similar presentations

Presentation on theme: "POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)

Similar presentations

Presentation on theme: "POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)"— Presentation transcript:

Similar presentations

About project

Feedback