Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Analysis Lecture 13

Similar presentations


Presentation on theme: "Decision Analysis Lecture 13"— Presentation transcript:

1 Decision Analysis Lecture 13
Tony Cox My Course web site:

2 Five Challenges Learning: How to infer (“learn”) correct DAG models from observational data? Under what conditions is this possible? (“Identifiability”) When possible, how to do it? (Learning algorithms) How to characterize remaining uncertainties? Inference: How to use DAG model to infer probable values of unobserved variables from values of observed variables? Prediction: How to use DAG to predict how changing X would change Y? “Manipulative causality” Attribution: How to use DAG to attribute effects on Y to X? How to define and estimate direct, total, controlled direct, natural direct, natural indirect, and mediated effects of X on Y? Generalization: How to generalize answers from study populations(s) to other populations? “Transportability” or “external validity” question

3 Key steps Use Bayesian network (BN) learning algorithms to identify plausible causal DAG structures What DAG structure(s) best explain the data? What parts of DAG structure are identified by data? Quantify conditional probability tables (CPTs) using BN learning algorithms or CART trees Estimate direct effects of parents at each node Validate quantified causal BNs Use causal BNs to answer practical questions: Prediction, attribution, optimization, explanation, inference, generalization

4 Learning causal DAGs from data (“Structure learning”)

5 Example: What affects what? How much? Which interactions matter?
CDC Behavioral Risk Factor Surveillance System (BRFSS), and EPA data, Cox,

6 Data Model Predictions
Causal analytics: Algorithms to learn causal models from data, apply them to answer queries Data Causal analytics Model Monte Carlo simulation Predictions

7 How to get from data to causal predictions… objectively?
Probabilistic causal prediction: Doing X will change conditional probability distribution of Y, given covariates Z Goal: Manipulative causation (vs. associational, counterfactual, predictive, computational, etc.) Data: Observed (X, Y, Z) values Challenge: How will changing X change the probabilities of Y values?

8 Model uncertainty undermines valid causal predictions from data
How would cutting exposure concentration C in half affect future response rate R? Community Concentration , C Income, I Response rate, R A 4 100 8 B 60 16 C 12 20 24 Model 1: R = 2C If this is a valid structural equation (causal model), then ∆R = 2∆C The corresponding DAG is: C  R

9 Non-identifiability: Multiple models fit data but make different predictions
How would cutting exposure concentration C in half affect future response rate R? No way to determine from historical data Community Concentration , C Income, I Mortality rate, R A 4 100 8 B 60 16 C 12 20 24 Model 1: R = 2C, (I = 140 – 10C) Model 2: R = 35 – 0.5C – 0.25*I Model 3: R = 28 – 0.2*I, (C = 14 – 0.1*I) So, decreasing C could decrease R, increase it, or leave it unchanged.

10 Implications Ambiguous associations obscure dependency of outputs on decision inputs, making sound modeling and causal inference more difficult Causal conclusions are not purely data-driven hypothesis  data  conclusion Instead, they conflate data and modeling assumptions hypothesis/model/assumptions  conclusions  data Undermines sound (objective, trustworthy, well-justified, independently repeatable, verifiable) inference Undermined when conclusions rest on untested assumptions Ambiguous associations are common in practice Wanted: A way to reach valid, robust (model-independent) conclusions from data that can be fully specified before seeing the data.

11 Example: Identifiability of causal DAG structures from data
Principle: Effects are not conditionally independent of their direct causes. Can use this as a screen for possible causes in a multivariate database Suppose we had an “oracle” (e.g., a perfect CART tree or BN learning algorithm) for detecting conditional independence Which of these could it distinguish among? X  Z  Y (e.g., exposure  lifestyle  health) Z  X  Y (e.g., lifestyle  exposure  health) X  Y  Z (e.g., exposure  health  lifestyle) X  Y  Z (e.g., exposure  health  lifestyle) X  Z  Y (e.g., exposure  lifestyle  health)

12 Identifiability of causal DAG structures from data
X  Z  Y (e.g., exposure  lifestyle  health) Z  X  Y (e.g., lifestyle  exposure  health) X  Y  Z (e.g., exposure  health  lifestyle) X  Y  Z (e.g., exposure  health  lifestyle) X  Z  Y (e.g., exposure  lifestyle  health) In 1 and 5, but not the rest, X and Y are conditionally independent given Z Markov equivalence class can be identified In 4, but not the rest, X and Z are conditionally independent given Y In 2, but not the rest, Z and Y are conditionally independent given X In 3, X and Z are unconditionally independent but conditionally dependent given Y

13 Partial identifiability of DAGs based on conditional independence
Conditional independence (CI) typically allows some, but not all, DAG structures to be rejected based on data X  Y  Z and X  Y  Z imply same CI Testable implications of DAG structures based on CI can be enumerated using current software such as dagitty (Textor, 2015).

14 Other principles for identifying DAG structure from data
Likelihood principle (bnlearn score-based algs.) Choose DAG model to maximize likelihood of data Composition principle: If X  Y  Z, then dz/dx = (dz/dy)*(dy/dx) Time series analysis: Information flows from causes to their effects over time Transfer entropy, Yin & Yao, 2016, Measurement error analysis in nonlinear models: effect = f(cause) + error LiNGAM software, Homogeneity/invariance principles (Li et al., 2015, )

15 LiNGAM basic idea Plot of y = f(x) + error looks different from plot of y = f(x + error). The one with well-behaved residuals identifies the correct causal model if it effect = f(cause) + error (and f is nonlinear and/or error is non-Gaussian) Homoskedasticity Heteroskedasticity Shimizu et al., UAI,

16 CAT’s DAG structure learning using bnlearn package (default settings)
Arrows are not causal: Income and Smoking are not causes of Male and Smoking is not a cause of Hispanic.

17 Causal DAG learned when Age, Sex, and Hispanic are specified as sources
Causal DAG structure emerges from knowledge-based constraints on directions of arrows

18 Different learning algorithms produce slightly different DAG models
Age directly affects Income Use ensemble of DAGs produced using different algorithms and random subsets of data for maximum robustness of conclusions

19 Different learning algorithms produce slightly different DAG models
In all models, Heart disease depends on the same parents and Income and Hispanic are ancestors of PM2.5 and Heart disease.

20 Discovering DAG structure resolves ambiguous associations
How would cutting exposure concentration C in half affect future response rate R? No way to determine from historical data Community Concentration , C Income, I Mortality rate, R A 4 100 8 B 60 16 C 12 20 24 Model 1: R = 2C, (I = 140 – 10C), DAG: I  C  R, I  C  R Model 2: R = 35 – 0.5C – 0.25*I, DAG: C  R  I Model 3: R = 28 – 0.2*I, (C = 14 – 0.1*I), DAG: C  I  R So, decreasing C could decrease R, increase it, or leave it unchanged.

21 Discovering DAG structure resolves ambiguous associations
How would cutting exposure concentration C in half affect future response rate R? No way to determine from historical data Community Concentration , C Income, I Mortality rate, R A 4 100 8 B 60 16 C 12 20 24 Model 1: Income  Conc  Mortality: mortality would be halved Model 2: Conc  Mortality  Income: mortality would increase Model 3: Conc  Income  Mortality: mortality would not change

22 Validating DAG structures
Compare results from multiple algorithms using different principles Conditional independence constraints, likelihood scores, LiNGAM, composition Cross-validation Occam’s Window: Bayesian Model-Averaging ensemble modeling for DAG models Internal consistency checks of effects estimates using multiple adjustment sets

23 Wrap-up on causal DAG structure-learning
Practical algorithms are available now to learn BN DAG structures from data For the example data set, run time < 1second To obtain BNs in which arrows have causal interpretations, knowledge-based constraints must be imposed Which variables are sources or sinks? Any arrows required, allowed, or forbidden? Different DAG-learning algorithms give slightly different DAGs, but main results are robust

24 Key steps Use Bayesian network (BN) learning algorithms to identify plausible causal DAG structures What DAG structure(s) best explain the data? What parts of DAG structure are identified by data? Quantify conditional probability tables (CPTs) using BN learning algorithms or CART trees Estimate direct effects of parents at each node Validate quantified causal BNs Use causal BNs to answer practical questions: Prediction, attribution, optimization, explanation, inference, generalization

25 Basics of dagitty software: d-separation and conditional independence

26 dagitty functions List empirically testable implications (pairwise marginal and conditional independencies involving only observed quantities) Identify (minimal) adjustment sets: What must and must not be conditioned in to estimate dependencies along causal paths

27 Three connection types (Pearl, 1988; Charniak, 1991)
Linear: A  B  C Example: family-out  dog-out  hear-bark A and C are conditionally independent, given B Diverging: A  B  C Example: light-on  family-out  dog-out B is a confounder (of association between A & C) Converging (B is a “collider”): A  B  C Example: family-out  dog-out  bowel problem A and C are “conditionally dependent”, given B! “Explaining away” observed value of B

28 d-connection A path from A to C is d-connecting (given the facts/observations in E) if every interior node is of one of the three connection types and is: Not in E if it is linear or diverging. Rationale: Don’t create conditional independence! Not A  E  C Not A  E  C In E (or has descendant in E) if it is converging. Rationale: Create dependence via competing explanations Yes to: A  E  C Two nodes with no d-connecting path between them are d-separated.

29 Interpreting d-connection
“Roughly speaking, two nodes are d-connected if there is a causal path between them, or there is evidence that renders the two nodes correlated with each other.” (Charniak, 1991) X and Y are d-connected if and only if they are not conditionally independent of each other given E Causal path: All arrows point from exposure (X) toward response (Y) Much more general than traditional path analysis or linear structural equation model (SEM) Allows nonlinear dependencies, high-order interactions, non-Gaussian errors, inter-individual heterogeneity

30 Checking d-separation
List all undirected paths from X to Y. If any non-colliding vertex on a path is in evidence set E (or conditioned on), then this path is blocked: A  E  C, A  E  C If any collider is not in E (and has no descendant in E), then the path is blocked. A  E  C X and Y are d-separated (no causal influence flows from X to Y) if every undirected path between them is blocked.

31 Examples of d-separation
X  U  V  W  Y X and V are unconditionally independent X and V are dependent, conditioned on U U and W are unconditionally dependent U and W are CI given V X and Y are unconditionally independent X and Y are not CI | U and W

32 d-connectivity calculations support dagitty functions
List empirically testable implications (pairwise marginal and conditional independencies involving only observed quantities) Identify (minimal) adjustment sets: What to condition on and what not to, to estimate dependencies along causal paths Leave paths unblocked to estimate total effects, path coefficients in linear model

33 Adjustment sets for total effects
Key idea: An adjustment set, Z, for estimating the total causal effect of X on Y must block all non-causal paths but no causal paths. X = exposure, Y = response, Z = covariates to adjust for Adjustment set(s) can be computed automatically from DAG Algorithm solves the variable-selection problem for regression modeling If such a set exists, then adjusting for it (by conditioning on variables in this set as covariates) allows the total effect of X on Y to be non-parametrically identified: Adjustment formula: P(y | do(x)) = z in ZP(y | x, z)P(z) P(z) can be obtained from BN solver such as Netica P(y | x, z) can be estimated by CART trees, random forest, etc.

34 What effects do we want to identify and estimate?
X  Y  Z Total effect of X on Y: How would changing value of X from x to x’ change the distribution of Y? What is Pr(y | do(x’))? Direct effect of X on Y: How would changing X from old x to new x’ change the distribution of Y if Z is held fixed? Pr(y | do(x’), do(z)) Controlled direct effect = E(Y | do(x’), do(z)) – E(Y | do(x), do(z)) Pure or natural direct effect: Effect on E(Y) of changing X from x to x’ (meaning from do(x) to do(x’)) while holding Z fixed at the value it had for do(x). Natural indirect effect: Change in E(Y) if X is held fixed at x (i.e., do(x)) and Z adjusts to the values it would have for do(x’) Mediated effect: Effect of changing X on changing Y via changing Z Pearl,

35 What effects can we identify and estimate?
Some effects can be uniquely identified from associational data and a DAG structure “Identifiable” = uniquely determined by data and assumptions Typical DAG model assumptions: Model shows all relevant variables and conditional independence and dependence relationships Causal Markov and faithfulness assumptions (Spirtes and Zhang, 2014, Other effects may not be identifiable without more data, e.g., time series and interventions Special algorithms and software can enumerate all identifiable effects, given a DAG model Pearl’s do-calculus; dagitty software (Textor, 2015, )

36 Interpret BN structure using dagitty: List testable implications
Click DAG under B Bayesian Network to generate dagitty interpretations of DAG. Automatically shows testable implications, identifiable effects, and minimal adjustment sets for estimating them with minimal bias _||_ = “is conditionally independent of”

37 Interpret BN structure using dagitty: List identifiable path coefficients
Click DAG under B Bayesian Network to generate dagitty interpretations of DAG. identifiable path coefficients

38 Interpret BN structure using dagitty: List path coefficients identifiable via IV
Click DAG under B Bayesian Network to generate dagitty interpretations of a DAG. Lists path coefficients identifiable via instrumental variables (IVs) Instrumental variable (IV): Change X but not Y directly, see how Y moves

39 Dagitty lists total effects identifiable via regression modeling
Click DAG under B Bayesian Network: Identifiable total effects Selects variables to include Solves automated variable selection problem

40 Wrap-up on dagitty and d-separation
The graph property of d-connectivity allows information to flow between variables Links conditional independence to DAG Algorithms and software now allow automatic interpretation of DAG models in terms of which effects can be identified from data, and how What to condition on: Automated variable selection Path coefficients, regression coefficients Non-parametric structural equations and conditional probability tables/CPTs

41 Making DAG learning easier: Causal Analytics Toolkit (CAT) http://cox-associates.com/downloads/

42 Learning goals for this section
See how to apply R packages to carry out causal analytics based on information-theoretic principles and algorithms Causal Analytics Toolkit (CAT) for R packages BN learning algorithms CART trees randomForest ensembles partial dependence plots Apply to example data

43 Principles of most successful causal effect estimation algorithms
Information principle: Causes are informative about (help to predict) their effects So, exploit predictive analytics algorithms! Use DAGs, trees, Random Forests , etc. to find informative variables Propagation principle: Changes in causes help to predict and explain changes in their effects Information flows from causes to their effects over time Use non-parametric effects estimates CART trees estimate conditional probabilities directly from data, no parametric model Avoids errors from modeling biases, specification errors, uncertain assumptions Allow nonlinearities and interactions Average over ensembles of hundreds of non-parametric estimates/predictions

44 CAT applies R and Python packages via Excel
Load data in Excel, click Excel to R to send it to R Los Angeles air basin 1461 days, (Lopiano et al., 2015, thanks to Stan Young for data) PM2.5 data from CARB Elderly mortality (“AllCause75”) from CA Department of Health Daily min and max temps & max relative humidity from ORNL and EPA Risk question: Does PM2.5 exposure increase elderly mortality risk? If so, by how much? P(AllCause75 | PM2.5) = ? E(AllCause75 | PM2.5) = ?

45 Using CAT to examine associations: Plotting the data
Send data from Excel to R Highlight columns Click on “Excel to R” Select columns to analyze Click on column headers Cntrl-click toggles selection Click on Plots to view frequency distributions, scatter plots, correlation, smooth regression curves PM2.5 is slightly negatively associated with mortality

46 Using CAT to examine associations: Plotting more data
Send data from Excel to R Highlight columns Click on “Excel to R” Select columns Click on column heads Cntrl-click toggles selection Click on Plots to view frequency distributions, scatter plots, correlations, smooth regression curves Temperature is positively associated with PM2.5 Temperature is negatively associated with mortality,

47 Other visualizations correlations: Corrgram

48 Other visualizations for partial and total correlations: qgraph network
Red = negative Green = positive Thicker = stronger Closer = stronger Assumes linear relationships.

49 Nonparametric BN networks detect many more dependencies than partial (linear) correlations

50 Basic ideas of information-based Causal Analytics
Use a (DAG) network to show which variables provide direct information about each other Arrows between variables show they are informative about each other Learn network structure directly from data Scoring algorithms, constraint algorithms, hybrids Carefully check conclusions In non-parametric analyses we trust! Do power analyses using simulation Interpret neighbors in network as potential direct causes (satisfying necessary condition) Use partial and total dependence plots learned from data (based on averaging over many trees) to quantify relation between inputs and dependent variables.

51 Run BN structure discovery algorithms
Click B Bayesian Network to generate DAG structure. Only variables connected to response variable by an arrow are identified as potential direct causes Multiple pathways between two variables reveal potential direct and indirect effects Example: Direct and indirect paths between tmax and AllCause75.

52 By contrast, regression estimates total associations, given an assumed model
Click on Automatic under Regression Models CAT selects and runs appropriate regression models, reports results Quasi-Poisson regression model shows significant positive total C-R association between PM2.5 and elderly mortality (AllCause75) Significant negative total association between temperature and elderly mortality

53 Confirm or refute/refine BN structure with additional non-parametric tests
Conditioning on very different values of a direct cause should cause the distribution of the response variable to change If the response variable does not change, then any association between them may be due to indirect pathways (e.g., confounding)

54 Confirm or refute/refine BN structure with additional non-parametric tests
Conditioning on very different values of a direct cause should cause the distribution of the response variable to change If the response variable does not change, then any association between them may be due to indirect pathways (e.g., confounding)

55 Key steps Use Bayesian network (BN) learning algorithms to identify plausible causal DAG structures What DAG structure(s) best explain the data? What parts of DAG structure are identified by data? Quantify conditional probability tables (CPTs) using BN learning algorithms or CART trees Estimate direct effects of parents at each node Validate quantified causal BNs Use causal BNs to answer practical questions: Prediction, attribution, optimization, explanation, inference, generalization

56 Quantify total causal relations using CART trees, Random Forest, etc.
Procedure: Grow tree or random forest for response variable using only its parents in causal DAG model

57 Quantify CPTs using CART trees, Random Forest, etc.
Procedure: Grow tree or random forest for response variable using only its parents in causal DAG model Conditional probability of heart disease varies 10-fold based on age, sex, smoking, and income

58 Identify implications for effects of actions
Reducing smoking most reduces heart disease risk among men over 68 and people in lower income groups (1-4)

59 Quantify direct causal relations
Procedure: To quantify direct (potentially causal) relations after controlling for other variables, use partial dependence plot for response R vs. (potential) cause C. RandomForest algorithm averages multiple independent conditional probability predictions of outcome Y for each value of x Rationale: DAG structure shows that the relation might be causal (X helps to predict Y). Partial dependence estimates size of potential effect. Data-based simulation of conditional expected values generates curve

60 Key steps Use Bayesian network (BN) learning algorithms to identify plausible causal DAG structures What DAG structure(s) best explain the data? What parts of DAG structure are identified by data? Quantify conditional probability tables (CPTs) using BN learning algorithms or CART trees Estimate direct effects of parents at each node Validate quantified causal BNs Use causal BNs to answer practical questions: Prediction, attribution, optimization, explanation, inference, generalization

61 Validating quantified DAG models
Compare results from multiple algorithms using different principles Conditional independence constraints, likelihood scores, LiNGAM, composition Cross-validation Occam’s Window: Bayesian Model-Averaging ensemble modeling for DAG models Internal consistency checks of effects estimates using multiple adjustment sets

62 Validate quantified relations in hold-out samples
CAT currently quantifies uncertainty using bootstrap and cross- validation approaches for Random Forest ensembles Averaging over many trees reduces MSE (mean squared prediction error)

63 Validation test: Do variables not joined by arrows have flat PDPs?
Boston Scale including 0 PM2.5 has positive regression coefficient as predictor of AllCause75 in both data sets PM2.5 is not a significant predictive causal predictor in either data set C-R function learned from Boston does not apply in LA (and vice versa)

64 Key steps Use Bayesian network (BN) learning algorithms to identify plausible causal DAG structures What DAG structure(s) best explain the data? What parts of DAG structure are identified by data? Quantify conditional probability tables (CPTs) using BN learning algorithms or CART trees Estimate direct effects of parents at each node Validate quantified causal BNs Use causal BNs to answer practical questions: Prediction, attribution, optimization, explanation, inference, generalization

65 Use BN/influence diagram (ID) to optimize actions and decisions

66 Estimating implied dependencies
Quantify total effects (or dependency) plots via BN algorithms Procedure: Compose DAG CPTs along causal paths using Monte Carlo simulation Gibbs sampling for E(Y | do(x), z) and P(z) Quantify direct effects via partial dependency plots

67 Extensions Latent variables Measurement error Transportability
Detecting unmeasured confounders Measurement error Transportability Dynamic simulation Decision optimization: Influence diagrams

68 Testing DAGs for hidden (“latent”) variables and confounders
To test for effects of unobserved (“hidden” or “latent”) confounders, partition study population into disjoint subsets Men vs. women Younger vs. older If mortality rate in one appears as direct cause of mortality in the other, then there is probably an omitted confounder that affects both.

69 Detecting hidden/omitted variables

70 Detecting unmeasured confounders
year is not a direct cause of elderly mortality in women (F75)

71 Transportability: Causal laws and mechanisms hold across settings
Example model (or theory) structure for causes of response: Quantify Pr(mortality | age, sex, exposure) (“CPT”) Conditional C-R relation, conditional probability table (CPT) Response is conditionally independent of other variables, given the values of its direct parents in this network (“DAG model”) A valid causal model or law (CPT) describing underlying mechanisms should be the same in all studies Can be “transported” (generalized) across applications Does not change based on arrows into age, sex, exposure Otherwise, the causal theory needs to be expanded A directed acyclic graph (DAG) structure

72 Example: Testing transportability
Partial dependence relations between exposure (PM2.5) and mortality counts in two different cities look very different.

73 Summary of CAT’s causal analytics
Screen for total, partial, and temporal associations and information relations Learn BN network structure from data Estimate quantitative dependence relations among neighboring variables Use partial dependence plots (Random Forest ensemble of non-parametric trees) Use trees to quantify multivariate dependencies on multiple neighbors simultaneously Validate on hold-out samples Check internal consistency (dagitty, transportability, possible omitted variables

74 Wrap-up on CAT Modern software makes it easy to apply information-based causal analytics to data Entire analysis process can be automated Click on “Analyze” in CAT Minimizes roles of modeling choices, p-hacking, confirmation bias, etc. Limited but useful outputs: Possible causal relationships detected and quantified directly from data Predictive causality, not necessarily manipulative

75 Inference engines and influence diagrams based on BN technology

76 Five Challenges Learning: How to infer (“learn”) correct DAG models from observational data? Under what conditions is this possible? (“Identifiability”) When possible, how to do it? (Learning algorithms) How to characterize remaining uncertainties? Inference: How to use DAG model to infer probable values of unobserved variables from values of observed variables? Prediction and optimization: How to use DAG to predict how changing X would change Y? To optimize choice of X? “Manipulative causality” Attribution: How to use DAG to attribute effects on Y to X? How to define and estimate direct, total, controlled direct, natural direct, natural indirect, and mediated effects of X on Y? Generalization: How to generalize answers from study populations(s) to other populations? “Transportability” or “external validity” question

77 Inference: Design of a probabilistic inference expert system BN
The system stores a joint PDF for many variables, Pr(X1, X2, …, Xn) as a causal Bayesian network The user specifies a query (subset of variables, Q) and some evidence (observed or assumed values of other variables, K) System calculates Pr(Q | K) to answer query Method: Pr(Q | K) = Pr(Q and K)/Pr(K) Pr(K) is obtained from joint PDF by marginalizing out (summing over all the values of) other variables Monte Carlo: In practice, sample many times from joint PDF to get ratio Pr(Q and K)/Pr(K)

78 “Marginalizing Out” Given a joint PDF, Pr(K, U), the marginal distribution of K is: Pr(K = k) = uPr(K = k & U = u) K = known/assumed values U = unknown/uncertain values In general, given a joint PDF for many variables, Pr(X1, X2, …, Xn), we can “marginalize out” the PDF for any variable (or the joint PDF for any subset of variables) by summing/integrating over all values of all other variables.

79 Key BN idea: Factoring a Joint PDF
For two variables, we have Pr(X = x, Y = y) = Pr(X = x)Pr(Y = y | X = x) More briefly: Pr(x, y) = Pr(x)Pr(y | x) For n variables, we have (perhaps after renumbering variables so that all of a variable’s direct parents precede it in the new ordering): Pr(x1, x2, …, xn) = Pr(x1)Pr(x2 | x1)Pr(x3 | x1, x2)…Pr(xn | x1, x2, …, xn-1) Most of these terms simplify for sparse DAGs! Example: Find Pr(x1, x2, x3, x3) in the following DAG: X1  X2  X3  X4 A: Pr(x1, x2, x3, x4) = Pr(x1)Pr(x2 | x1)Pr(x4)Pr(x3 | x2 , x4)


Download ppt "Decision Analysis Lecture 13"

Similar presentations


Ads by Google