Introduction to confounding and DAGs

Introduction to confounding and DAGs
Professor Adrian Esterman

Good afternoon everyone
Good afternoon everyone. I often think of the data coming from a clinical trial as a mixture of signal and noise. As a biostatisticans, one of our most important tasks is to maximise the signal and minimise the noise. But what is this noise? It actually has two components. The first component is RANDOM error. We cannot escape random error, but we can reduce it – primarily by increasing the sample size. [GIVE EXAMPLE OF DETERMINING AGE OF AUDIENCE BY SAMPLE SIZE] The second component is SYSTEMATIC error , or bias. Unfortunately, increasing the sample size does not reduce bias – we just end up with an even larger biased trial. [GIVE EXAMPLE OF DETERMINING AGE OF AUDIENCE IF ONLY WOMEN PICKED] So, we try to design trials to minimise bias if we can, and measure and adjust for it if we can’t make it smaller.

Bias CONFOUNDING Confounding bias There are literally dozens of different types of bias that studies are prone to, and these can occur at any stage in the research process. However, arguably the most important is CONFOUNDING BIAS, the topic of this talk. I note that because of their design, randomised controlled trials are free from confounding bias, but may still be subject to other types. Pannucci C, Wilkins, E. Identifying and Avoiding Bias in Research Plast Reconstr Surg. 2010 Aug; 126(2): 619–625.

Confounding Three definitions Classical Collapsibility Counterfactual
So let’s take a closer look at confounding. There are three common definitions of confounding, namely, the classical, collapsibility, and counterfactual. The classical definition is the one most of us are taught, and seen in textbooks of epidemiology. 4

Classical approach “Bias of the estimated effect of an exposure on an outcome due to the presence of a common cause of the exposure and the outcome” Confounder The word confounding, comes from the Latin confundere, or old French confondre, which mean mixing up. It is a third variable that mixes up the association between exposure and outcome. Exposure Outcome Dictionary of Epidemiology (5th ed) M Porta 2008 4

Classical approach A variable is a confounder if three criteria are met: must be associated with the exposure; must be a causal risk factor for the outcome in the unexposed population; must not be an intermediate cause of the outcome In the classical definition, three things must be true. Firstly, the variable must be associated with the exposure. This is usually shown by demonstrating an inbalance in the confounding variable in the exposure groups. Second, it must be a risk factor for the outcome in the unexposed population. This is often put much less precisely as having to be associated with the outcome. Finally, it cannot be on the causal pathway between exposure and outcome. If it is, then it is a mediating variable rather than a confounding variable. 4

Here is a well-known example of confounding
A study found a strong association between coffee drinking and pancreatic cancer. However, when smoking status was adjusted for, the association disappeared. In this example, if smoking was not a confounder, then the crude effect measure like the rate ratio between drinkers and non-drinkers of coffee would be regarded as a suitable overall measure of effect. On the other hand, if smoking is a confounder, then the choice of an overall measure of effect is not as clear.

Collapsibility A variable is confounding if:
The effect measure is homogeneous across strata defined by the confounder; and the crude and adjusted effect measures are unequal The Collapsibility definition of confounding is a bit less intuitive to understand. It comprises of two parts: The first is the requirement that there is no effect modification. In other words, different strata of the potential confounding variable should have a similar association between exposure and outcome. The second part say that if you compare the crude measure of association between exposure and outcome, with an adjusted measure based on the potential confounding variable, they should not be equal. This is known as lack of collapsibility. 4

Collapsibility Chemical exposure status (E) Lung cancer status (D)
Outcome Exposed Not Exposed Total Cancer 27 14 41 No cancer 48 67 115 75 81 156 Here we look at the ten-year risk for lung cancer in workers who may be exposed to a particular chemical compound. Relative Risk = (27/75) / (14/81) = 2.1 4

Collapsibility Non-smokers Outcome Exposed Not Exposed Total Cancer 1
2 3 No cancer 24 48 72 25 50 75 Now let’s see what happens if we stratify by smoking status. Here are the non-smokers. The relative risk has reduced from 2.1 to 1.0. Now let’s look at the smokers. Relative Risk = (1/25) / (2/50) = 1.0 4

Collapsibility Smokers Outcome Exposed Not Exposed Total Cancer 26 12
38 No cancer 24 19 43 50 31 81 The relative risk is now 1.3. In other words, the relative risk in the two strata was very similar, close to 1, whereas the crude relative risk was 2.1 Smoking status is clearly a confounder by the collapsibility definition. Relative Risk = (26/50) / (12/31) = 1.3 4

Counterfactual We recruit six people with painful osteoarthritis of their knees into a study. We choose three at random, and give them the same dose of Panadol. The other three get nothing. One hour later, we ask them if their knees are still painful. The third definition of confounding is the counterfactual approach. I will try and explain it with an example. We have six patients with osteoarthritis of the knee. We measure their level of pain. We choose three at random, and give them panadol One hour later we measure pain levels in all six subjects. This is a typical randomised controlled trial design. 4

Counterfactual Subject Took Panadol Pain gone Alice No Ben Yes Charlie
David Edward Fred Here are the results after one hour. Ben was given the Panadol, and his pain disappeared. Does that mean that taking Panadol relieves pain? The answer is that we cannot tell from a single individual. If we knew what would have happened if Ben had not taken the Panadol, then we would have a better idea. This is called a counterfactual. In fact, apart from cross-over trials, for any individual, we only know the results of their exposure or non-exposure, and not the counterfactual. That being the case, rather than looking at individuals, we must compare the average effect of those exposed against those not exposed. In a randomised controlled trial, these are comparable, with the only difference between the two groups being the exposure. However, in observational studies, the non-exposed group may not be a good counterfactual population. 4

Counterfactual unexposed cohort
“Confounding is present if the substitute population imperfectly represents what the target would have been like under the counterfactual condition” Counterfactual unexposed cohort Since apart from crossover trials, we do not have a counterfactual for individuals, instead we have to rely on population-level results. For a randomised controlled trial the real unexposed cohort is identical to the counterfactual unexposed cohort because of the randomisation. However, for observational studies, the unexposed cohort might not be a good representation, thus introducing confounding. Substitute, unexposed cohort 14 Maldonado & Greenland, Int J Epi 2002;31:422-29

Control of confounding
Can be controlled: At the design stage As part of the analysis Confounding can be eliminated or minimised by careful choice of study design. It can be adjusted for as part of the statistical analysis.

Minimising confounding at the design stage
Three approaches Randomization Restriction Matching When designing a study, there are three approaches we can use to minimise the possibility of confounding.

Randomization Breaks links between exposure and confounders In a randomised controlled trial, randomisation maximises the chance of an even balance of potential confounding variables between study groups. This has the effect of breaking any link between exposure and confounding variables. Importantly, this is true for both measured and unmeasured confounding variables. However, even in randomised controlled trials, there may still be an imbalance in confounding variables between study groups purely by chance.

Restriction Confounding cannot occur if the distribution of the potential confounding variables is similar across exposure or disease categories We can try and ensure an even balance of a confounding variable across exposure categories levels by restricting study subjects to only those falling within specific levels of a confounding variable. For example, an investigator might only selects subjects of exactly the same age or same sex. Clearly, the main problems with restricting recruitment into a study are the impact on sample size, and whether or not the results can be generalised. Also, a bit tricky when there are many confounding variables. Many drug trials are now using adaptive sampling methods which use sequential restriction to ensure an even balance of covariates.

Matching Confounding cannot occur if the distribution of the potential confounding variables is similar across exposure or disease categories. Matching is commonly used in case-control studies, and sometimes in cohort studies. It can be either pairwise or frequency matching. It also has implications for sample size in that often a suitable match cannot be found. Matched data often require a special type of statistical analysis to allow for the lack of independence of observations. Lie restriction, matching is also problematic when there are many variables that need matching on.

Minimising confounding at the analysis stage
Two simple procedures: Stratified analyses Multivariable analysis As long as bias can be measured, it can be adjusted for, and this is certainly true for confounding bias. There are two relatively easy statistical solutions to adjusting for confounding, stratified analysis and multivariable analysis.

Stratification Confounding cannot occur if the distribution of the potential confounding variables is similar across exposure or disease categories. Here we use the same principle as before, but at the analysis stage, rather than the design stage. Suppose that gender is a confounding variable. We would split the data into two strata, males and females, and then evaluate the exposure- outcome association within each stratum. Then, within each stratum, the confounder cannot confound because it does not vary.

Statistical analysis Stratification Rate of Down’s Syndrome per 1,000
Live births Here we see a graph of the relationship between birth order and the rate of Down’s syndrome births. Clearly, later order babies have a higher risk of being born with Down’s syndrome. However, mother’s age is a confounder here. Let’s repeat the above graph, but for different maternal age groups. Birth order

Cases of Down‘s syndrome by birth order and mother's age
Stratification Cases of Down‘s syndrome by birth order and mother's age Cases per 1000 900 800 700 600 500 400 300 200 100 40+ 35-39 Here we see that within each maternal age stratum, the relationship disappears. We can obtain an adjusted effect measure by undertaking a stratified analysis using a Mantel Haenzel approach. Stratified analysis works best when there are few strata (i.e. if only 1 or 2 confounders have to be controlled) 30-34 25-29 20-24 1 2 3 Birth order 4 5 < 20 Age groups Source:

Multivariable analysis If the number of potential confounders is large, multivariable analysis offers a suitable solution In multivariable analysis, we add all possible confounding variables into the regression model. This method used to be called multivariate analysis, but we changed the name to multivariable, to distinguish it from the situation where you have two or more dependent variables. We now have regression models to handle most situations where multivariable modelling is required. However, the approach is not without its assumptions and limitations. Firstly, there should be an adequate number of observations for each parameter in the model to avoid over-fitting. Secondly, if the covariates are highly correlated amongst themselves, we can run into ill-conditioning. One way round this, as we shall see shortly, is adjustment using a propensity score. 42

Treatment effects Propensity scores Inverse Probability Weights Instrumental variables Another technique which is gaining in popularity is the use of treatment effects. Here, as well as examining the association between exposure and outcome, we look at and use predictors of being in the exposed group. For a randomised controlled trial, this is clearly unnecessary, because randomisation guarantees an even spread of potential confounders between exposed groups. However, for observational studies, there likely are predictors of exposure. Currently, there are three main ways of handling treatment effects: Propensity score adjustment Inverse probability weighting Instrumental variables

Treatment effects Propensity scores
Replace collection of confounding covariates with a function of these covariates, called the propensity score. This score is then used just as if it were the only confounding covariate. Propensity scoring is a method of obtaining a single continuous smoothed summary score that can be used to control for a collection of confounding variables in a study. It is the probability that a person will be exposed given a set of observed covariates. The propensity score is usually obtained by logistic regression, where the dependent variable is whether or not the individual is in the exposed group, and the predictor variables are all the potential confounding variables. The predicted probability of being in the exposed group is the propensity score. An assumption of the propensity score approach is that all confounding variables have been measured. In the setting of an observational cohort study this is a problematic assumption when using exposure at study baseline. The propensity score can be used as a single covariate, or for matched analysis.

Treatment effects Inverse probability weighting
Use the propensity score to undertake a weighted analysis. Inverse probability weighting also used the propensity score, but in a very different way. Suppose that age is a confounding variable, and the average age of the exposed group is 50 years, but only 40 years in the non-exposed group. We first create a propensity score based on a single covariate, age. If an individual in the exposed group has an age of 35 years, well below the expected mean age of 50 years, then their propensity score might be 0.2, in other words, there is a low predicted probability of them being in the exposed group. The weighting for that individual would therefore be 1 divided by 0.2 or 5, so that individual would be represented in the data set 5 times. This would have the effect of lowering the mean age in the exposed group towards that of the non-exposed group, thus making the two groups more comparable. The actual weighting is found by taking the inverse of the propensity score for the exposed group, and the inverse of (1 – propensity score) for the non-exposed group.

Treatment effects Instrumental variables AssocZY =AssocZXAssocXY
An Instrumental variable (Z) is a proxy for the Exposure variable (X). It allows us to estimate the average effect of Exposure (X) on Outcome (Y) in the study population from two effects of Z: the average effect of Z on Y; and the average effect of Z on X AssocZY =AssocZXAssocXY An Instrumental Variable, is one that is associated with the exposure, not associated with confounding variables, and is associated with the outcome only through the exposure variable. If you collect data on the IV and are willing to make some additional assumptions, then you can estimate the average effect of X on Y, regardless of whether you measured the covariates. We can write the Z-Y association as a product of the Z-X and X-Y associations, and solve this equation for the X-Y association. Importantly, IVs can adjust for both measured and unmeasured covariates. For example, in a study of the association between BMI, assessed at age 7 years, on the risk of developing asthma, genotypes in the FTO gene were used as an IV. These are known to be associated with BMI, could only be associated with asthma through BMI, and are not associated with any covariates because of their random distribution in the population. IV analysis is usually undertaken by two-stage least squares. Studies using IVs are not common because of the difficulty of finding a suitable IV.

Directed Acyclic Graphs (DAGs)
DAGs consist of two elements Variables or nodes Unidirectional arrows or paths D Now we come to the main part of my talk, which is an introduction to Directed Acyclic Graphs, or DAGS. Surprisingly, DAGS consist of just two elements, variables (or nodes in mathematical speak), and unidirectional arrows or paths. A C E B

Arrows Arrows go from exposure to outcome E O
Bidirectional arrows are not allowed E O These graphs are called DIRECTED because arrows are only allowed to go in one direction. One could argue that this is not very realistic because of things like reverse causation, but it does keep it simple, and ensures that things go forward over time.

Arrows Circularity is not allowed X E C O
The graphs are called ACYCLIC because no Outcome can be a cause of its own Exposure, and by that route, a cause of itself.

A natural path between two variables
C D Z A natural path between two variables is a sequence of arrows, regardless of their direction, that connects them. Think of it as a path you can walk along, or a bridge. Here we see a natural path between A and Z.

A causal path between two variables (also called “directed path”)
Z A Z A B C Z A causal or directed path between two variables is a natural path in which all of the arrows point in the same direction. For example, in the right hand corner there are 4 causal paths between A and Z. They are: A->Z A->B->Z A->C->D->Z A->B->C->D->Z A->C<-B->Z is not a causal path because one of the arrows is pointing in the wrong direction – but it is a natural path. In this same diagram, B and C are known as children of A. A is the parent of B and C. D and Z are descendants of A A is an ancestor of D and Z C D A B Z A B Z

Types of natural paths between Exposure (E) and Outcome (O)
Causal paths Confounding paths Colliding paths There are three types of natural paths between Exposure and Outcome, and we will at each in turn with an example.

Causal path E C O Here, the exposure E is causally related to the outcome (O) via the mediating variable (C) We have already defined a causal path. Here we see Exposure linked to outcome via the mediating variable C. Mediating variables are those which form part of a causal pathway.

Causal path example Sexual promiscuity Cervical cancer HPV
As an example, sexual promiscuity is a risk factor for Human Papilloma Virus, which itself is a risk factor for cervical cancer. So in this case, HPV is a mediating variable.

Confounding path C E O Here, C is a parent of both Exposure and Outcome. C is known as a confounding variable In a confounding path, the confounding variable is a parent of both Exposure and Outcome

Confounding path example
Smokes Drinks coffee Pancreatic cancer This is the same example we saw earlier. In DAG terminology, Smoking status is a parent of Coffee consumption and Pancreatic cancer

Colliding path C E O Here, C is a child of Exposure and Outcome. C is known as a collider Here we see a Colliding path. C is a child of both Exposure and Outcome, and is known as a collider

Colliding path example
Admitted to hospital Pneumonia Ulcer Here we see that admission to hospital is a collider as it is a child of both Pneumonia and having an Ulcer Coughing Abdominal pains

Backdoor path C O E A backdoor path is a non-causal path from E to O.
It usually starts with an arrow pointing toward the Exposure. It is a path that would remain if we were to remove any arrows pointing away from E (these are the potentially causal paths from E, sometimes called frontdoor paths). Backdoor paths between E and O generally indicate common causes of E and O. The simplest possible backdoor path is the simple confounding situation seen above.

Open path E O Any causal or directed path between Exposure and Outcome implies that Exposure is associated with Outcome. In the above case, we are implying that Exposure causes Outcome. The path is called “open” in the sense that an association between Exposure and Outcome is possible.

Open path E C O The path is also open if there are one or more mediating variables on the causal path.

Open path C E O If there is a backdoor path, this means that there are now two paths open, the direct causal path between Exposure and Outcome, and the indirect backdoor path through the confounding variable. This what causes the bias. We can BLOCK the backdoor path by conditioning on the confounding variable C. We show this by a square around the conditioned variable <CLICK> We condition on C by any of the strategy we discussed before – for example, restriction, covariate adjustment, inverse probability weighting, etc.

Blocked path C O E A collider automatically blocks a path so that the only association is the direct causal path between Exposure and Outcome. However, if you condition on a collider, <click> it has the effect of opening the indirect path, thus potentially introducing bias.

Collider (and confounder) are path-specific terms
B D C A Z Notably, a variable called a collider (or a confounder) on one path need not be a collider (or a confounder) on another path. Here we see that C is a collider on (ABCDZ) and a confounder on (ACZ)

Confounder Collider Main attribute common cause common effect
Association contributes to the association between its effects does not contribute to the association between its causes Type of path open path blocked path Effect of conditioning Bias before conditioning? Yes, confounding bias No Bias after conditioning? colliding bias Here is a summary of what we have learnt so far. A confounder is a common cause of Exposure and Outcome. The association between Exposure and Outcome nay be biased through the backdoor path that is left open, but can be blocked by conditioning on the confounder. A collider is a common effect of Exposure and Outcome, and blocks the indirect path, so cannot cause bias. Conditioning on a collider, opens the backdoor path and may introduce bias.

Example Age SES Junk diet BMI Diabetes
Here, our main interest is the association between Body Mass Index and the development of diabetes. Junk food diet is clearly a confounder on the backdoor path: BMI, Junk food diet, Diabetes. However, Junk food diet is also a collider on the path: BMI, Age, Junk food diet, SES and Diabetes. If we condition on Junk Food diet to block the backdoor path, we then open up a new path between Age and SES, and this may introduce bias. <Click>

Example Age SES Junk diet BMI Diabetes
However, if we adjust for Age and Junk food diet, or SES and Junk food diet, it clearly blocks the backdoor path caused by conditioning on Junk food diet.

Here is the same diagram which I have drawn in the DAG analysis program daggity.

Conclusions Common methods to assess confounding can lead to bias by:
Omitting important confounders Adjusting for colliders DAGs are used to Identify confounders and colliders Clearly state how you believe the system works Omitting important confounders - leaves residual bias Adjusting for non-confounders - can introduce bias

Introduction to confounding and DAGs

Similar presentations

Presentation on theme: "Introduction to confounding and DAGs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to confounding and DAGs

Similar presentations

Presentation on theme: "Introduction to confounding and DAGs"— Presentation transcript:

Similar presentations

About project

Feedback