Workshop Files www.phil.cmu.edu/projects/tetrad_download/download/acis.2018.workshop
Goals Basic working knowledge of graphical causal models Basic working knowledge of Tetrad 6 Basic understanding of search algorithms
Causal Estimation vs. Causal Search Estimation (Potential Outcomes) Causal Question: Effect of Zidovudine on Survival among HIV-positive men (Hernan, et al., 2000) Problem: confounders (CD4 lymphocyte count) vary over time, and they are dependent on previous treatment with Zidovudine Estimation method discussed: marginal structural models Assumptions: Treatment measured reliably Measured covariates sufficient to capture major sources of confounding Model of treatment given the past is accurate Output: Effect estimate with confidence intervals Fundamental Problem: estimation/inference is conditional on the model 1
Causal Estimation vs. Causal Search Search (Causal Graphical Models) Causal Question: which genes regulate flowering in Arbidopsis Problem: over 25,000 potential genes. Method: graphical model search Assumptions: RNA microarray measurement reasonable proxy for gene expression Causal Markov assumption Etc. Output: Suggestions for follow-up experiments Fundamental Problem: model space grows super-exponentially with the number of variables 1
Which genes regulate flowering time in Arabidopsis thaliana? Example* Which genes regulate flowering time in Arabidopsis thaliana? Why is this an interesting question to answer?: Timing of flowering is an important agronomical trait that greatly affects yield. Thus, improved knowledge about genes controlling flowering time is of substantial economic value. * Stekhoven DJ, et al. Causal stability ranking. Bioinformatics 28 (2012) 2819-2823.
Genetic Regulatory Networks Micro-array data ~22,000 variables N = 47 gene expression profiles of 4-day old seedlings for which subsequent flowering time was also measured Affymetrix ATH1 arrays Causal Discovery Candidate Regulators of Flowering time Greenhouse experiments on flowering time
Candidate Regulators of Flowering Time Considered the 25 genes that robustly emerged as “candidate causes” according to a customized discovery procedure based on PC 5 of those 25 genes : known regulators of flowering 13 of those 25 genes: not known regulators, and mutant seeds available
Greenhouse Experiments Seeds were plated on Murashige and Skoog medium, stratified for 2 days at 4o C, grown on plates for 10 days, and transferred into soil in Conviron growth chambers Flowering time was measured in days to bolting Seed types yielding 4 or more plants were considered for analysis Flowering time
Results Flowering time 9 seed types 4 or more plants 4/9 : shorter flowering time (p < 0.05) than the control, wild-type plants Other methods (correlation, high dimensional regression, etc.) performed essentially at chance
Outline Representing/Modeling Causal Systems Causal Graphs Parametric Models Bayes Nets Linear Structural Equation Models
Tetrad: Complete Causal Modeling Tool
Tetrad Demo & Hands-On Smoking Stained_Teeth LC Build and Save two acyclic causal graphs: Build the Smoking graph picture above Build your own graph with 4 variables Build 2 causal graphs - one for smoking, yf, lc
Parametric Models
Instantiated Models
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S)
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S) P(YF | S) P(LC | S) = f() All variables binary [0,1]: = {1, 2,3,4,5, } P(S = 0) = 1 P(S = 1) = 1 - 1 P(YF = 0 | S = 0) = 2 P(LC = 0 | S = 0) = 4 P(YF = 1 | S = 0) = 1- 2 P(LC = 1 | S = 0) = 1- 4 P(YF = 0 | S = 1) = 3 P(LC = 0 | S = 1) = 5 P(YF = 1 | S = 1) = 1- 3 P(LC = 1 | S = 1) = 1- 5
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, LC) = P(S) P(YF | S) P(LC | S) = f() All variables binary [0,1]: = {1, 2,3,4,5, } P(S,YF, LC) = P(S) P(YF | S) P(LC | YF, S) = f() All variables binary [0,1]: = {1, 2,3,4,5, 6,7, }
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S) P(S = 0) = .7 P(S = 1) = .3 P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95 P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05 P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80 P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20 P(S=1,YF=1, LC=1) = ?
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S) P(S = 0) = .7 P(S = 1) = .3 P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95 P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05 P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80 P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20 P(S=1,YF=1, LC=1) = P(S=1) P(YF=1 | S=1) P(LC = 1 | S=1) P(S=1,YF=1, LC=1) = .3 * .80 * .20 = .048
Tetrad Demo & Hands-On Use the DAG you built for Smoking, YF, and LC Define the Bayes PM (# and values of categories for each variable) Attach a Bayes IM to the Bayes PM Fill in the Conditional Probability Tables (make the values plausible). Build 2 causal graphs - one for smoking, yf, lc
Structural Equation Models Causal Graph Structural Equations For each variable X V, an assignment equation: X := fX(immediate-causes(X), eX) 1. example of recursive structural equation model without correlated errors 2. can show that assumption of independence of errors guarantees correctness of probabilitic interpretation 3. this represents both probability and causality Exogenous Distribution: Joint distribution over the exogenous vars : P(e)
Linear Structural Equation Models Path diagram Causal Graph Equations: Education := Education Income :=Educationincome Longevity :=EducationLongevity Exogenous Distribution: P(ed, Income,Income ) - i≠j ei ej (pairwise independence) - no variance is zero 1. example of recursive structural equation model without correlated errors 2. can show that assumption of independence of errors guarantees correctness of probabilitic interpretation 3. this represents both probability and causality Structural Equation Model: V = BV + E E.g. (ed, Income,Income ) ~N(0,2) 2 diagonal, - no variance is zero
Standardized SEMs Attach a SEM PM to your 3-4 variable graph Attach a SEM IM to the SEM PM Change the coefficient values. Attach a Standardized SEM IM to the SEM PM, or the SEM IM Compare the Implied Matrices Build 2 causal graphs - one for smoking, yf, lc
Estimation
Estimation
Tetrad Demo and Hands-on Build the standardized SEM IM shown below Generate simulated data N=1000 Estimate model. Save session as “Estimate1”
Estimation
Coefficient inference vs. Model Fit Coefficient Inference: Null: coefficient = 0, e.g., bX1 X3 = 0 p-value = p(Estimated value bX1 X3 ≥ .4788 | bX1 X3 = 0 & rest of model correct) Reject null (coefficient is “significant”) when p-value < a, a usually = .05
Coefficient inference vs. Model Fit Coefficient Inference: Null: coefficient = 0, e.g., bX1 X3 = 0 p-value = p(Estimated value bX1 X3 ≥ .4788 | bX1 X3 = 0 & rest of model correct) Reject null (coefficient is “significant”) when p-value < < a, a usually = .05, Model fit: Null: Model is correctly specified (constraints true in population) p-value = p(f(Deviation(Sml,S)) ≥ 5.7137 | Model correctly specified)
Tetrad Demo and Hands-on Create two DAGs with the same variables – each with one edge flipped, and attach a SEM PM to each new graph (copy and paste by selecting nodes, Ctl-C to copy, and then Ctl-V to paste) Estimate each new model on the data produced by original graph Check p-values of: Edge coefficients Model fit Save session as: “estimation2”
Charitable Giving What influences giving? Sympathy? Impact? "The Donor is in the Details", Organizational Behavior and Human Decision Processes, Issue 1, 15-23, C. Cryder, with G. Loewenstein, R. Scheines. N = 94 TangibilityCondition [1,0] Randomly assigned experimental condition Imaginability [1..7] How concrete scenario I Sympathy [1..7] How much sympathy for target Impact [1..7] How much impact will my donation have AmountDonated [0..5] How much actually donated 1
Theoretical Hypothesis
Tetrad Demo and Hands-on Load charity.txt (tabular – not covariance data) Build graph of theoretical hypothesis Build SEM PM from graph Estimate PM, check results
Break
Search Models Data Representing/Modeling Causal Systems Estimation and Model fit Hands on with Real Data Models Data PC & FGES (causally sufficient case) FCI (allow latent confounding) Regression vs. Casual Search
Search Methods Scoring Searches Constraint Based Searches PC, FCI Pointwise, but not uniformly consistent Scoring Searches FGES Scores: BIC, AIC, etc. Search: Hill Climb, Genetic Alg., Simulated Annealing Difficult to extend to latent variable models Meek and Chickering Greedy Equivalence Class (GES) Latent Variable Psychometric Model Search BPC, MIMbuild, etc. Linear non-Gaussian models (Lingam) Models with cycles Images, and more!!!
Tetrad Demo and Hands On
Tetrad Demo and Hands-on Go to “estimation2.tet” Add Search node (from Simulation1) - “forbid latent common causes” - choose “FGES” Add a “Graph” node to search result: choose “Dag in Pattern” Add a PM to Graph Estimate the PM on the data Compare model-fit to model fit for true model
Backround Knowledge Tetrad Demo and Hands-on Create new session Select “Simulate from a given graph, and then Search” from Pipeline menu Build graph below, PM, IM, and generate sample data N=1,000. Execute PC search, a = .05
Backround Knowledge Tetrad Demo and Hands-on Add “Knowledge” node Create “Tiers” as shown below. Execute PC search again, a = .05 Compare results (Search2) to previous search (Search1)
Backround Knowledge Direct and Indirect Consequences True Graph PC Output Background Knowledge PC Output No Background Knowledge
Backround Knowledge Direct and Indirect Consequences True Graph Direct Consequence Of Background Knowledge Indirect Consequence Of Background Knowledge PC Output Background Knowledge PC Output No Background Knowledge
College Plans (Search) Load in College_plans_data Use file: College_plans_data Set Variables = discrete Enter Background Knowledge: Tier 1: Sex , SES Tier 2: IQ Tier 3: pe, cp Add search node Choose FCI (Allow Latent Common Causes) Add another search node Set Bootstrap = 100
Regression vs. Causal Search Multiple Regression (X = putative cause , Y = putative effect) Identify all potential confounders Z, such that Z is prior to X Z is associated with X, and Z is associated with Y Regress Y on X, Z Coefficient on X is evidence of X Y Causal Search: Find/compute all the causal models that are indistinguishable given background knowledge and data Represent features common to all such models Multiple Regression is the wrong tool for Causal Discovery
Regression vs. Causal Search
Foreign Investment Does Foreign Investment in 3rd World Countries inhibit Democracy? Timberlake, M. and Williams, K. (1984). Dependence, political exclusion, and government repression: Some cross-national evidence. American Sociological Review 49, 141-146. N = 72 PO degree of political exclusivity CV lack of civil liberties EN energy consumption per capita (economic development) FI level of foreign investment 1
Foreign Investment Correlations po fi en cv po 1.0 fi -.175 1.0
Case Study: Foreign Investment Regression Results po = .217*fi - .176*en + .880*cv SE (.058) (.059) (.060) t 3.941 -2.99 14.6 P .0002 .0044 .0000 Interpretation: foreign investment increases political repression 1
Case Study: Foreign Investment Alternative Models There is no model with testable constraints (df > 0) that is not rejected by the data, in which FI has a positive effect on PO.
Charitable Giving (Search) Load in charity data Add search node Enter Background Knowledge: Tangibility is exogenous Amount Donated is endogenous only Tangibility Imaginability is required Choose and execute one of the “Pattern searches” Add a “Graph” node to search result: “choose Dag in Pattern” Add a PM to Graph Estimate the PM on the data Compare model-fit to hypothetical model
For More https://www.ccd.pitt.edu/