Workshop Files www.phil.cmu.edu/projects/tetrad_download/download/acis.2018.workshop.

Slides:



Advertisements
Similar presentations
Probability models- the Normal especially.
Advertisements

Structural Equation Modeling
Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
Topic Outline Motivation Representing/Modeling Causal Systems
Outline 1)Motivation 2)Representing/Modeling Causal Systems 3)Estimation and Updating 4)Model Search 5)Linear Latent Variable Models 6)Case Study: fMRI.
The IMAP Hybrid Method for Learning Gaussian Bayes Nets Oliver Schulte School of Computing Science Simon Fraser University Vancouver, Canada
Challenges posed by Structural Equation Models Thomas Richardson Department of Statistics University of Washington Joint work with Mathias Drton, UC Berkeley.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Chapter 12 Simple Regression
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
1 gR2002 Peter Spirtes Carnegie Mellon University.
1 Center for Causal Discovery: Summer Workshop June 8-11, 2015 Carnegie Mellon University.
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.
Bayes Net Perspectives on Causation and Causal Inference
1 Part 2 Automatically Identifying and Measuring Latent Variables for Causal Theorizing.
AM Recitation 2/10/11.
1 Tetrad: Machine Learning and Graphcial Causal Models Richard Scheines Joe Ramsey Carnegie Mellon University Peter Spirtes, Clark Glymour.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Understanding Statistics
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Inferential Statistics 2 Maarten Buis January 11, 2006.
Specification Error I.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
1 Tutorial: Causal Model Search Richard Scheines Carnegie Mellon University Peter Spirtes, Clark Glymour, Joe Ramsey, others.
1 Causal Data Mining Richard Scheines Dept. of Philosophy, Machine Learning, & Human-Computer Interaction Carnegie Mellon.
1 Center for Causal Discovery: Summer Workshop June 8-11, 2015 Carnegie Mellon University.
Nov. 13th, Causal Discovery Richard Scheines Peter Spirtes, Clark Glymour, and many others Dept. of Philosophy & CALD Carnegie Mellon.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Discussion of time series and panel models
Talking Points Joseph Ramsey. LiNGAM Most of the algorithms included in Tetrad (other than KPC) assume causal graphs are to be inferred from conditional.
CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.
Chapter 10 Inference for Regression
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
The SweSAT Vocabulary (word): understanding of words and concepts. Data Sufficiency (ds): numerical reasoning ability. Reading Comprehension (read): Swedish.
1 Day 2: Search June 14, 2016 Carnegie Mellon University Center for Causal Discovery.
Tetrad 1)Main website: 2)Download:
Chapter 9 Introduction to the t Statistic
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Step 1: Specify a null hypothesis
Day 3: Search Continued Center for Causal Discovery June 15, 2015
CJT 765: Structural Equation Modeling
Chapter 11 Simple Regression
12 Inferential Analysis.
Workshop Files
CHAPTER 26: Inference for Regression
Elementary Statistics
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Center for Causal Discovery: Summer Short Course/Datathon
Center for Causal Discovery: Summer Short Course/Datathon
1 Department of Engineering, 2 Department of Mathematics,
Causal Data Mining Richard Scheines
Causal Discovery Richard Scheines Peter Spirtes, Clark Glymour,
Extra Slides.
12 Inferential Analysis.
CHAPTER 12 More About Regression
15.1 The Role of Statistics in the Research Process
Searching for Graphical Causal Models of Education Data
Structural Equation Modeling
Presentation transcript:

Workshop Files www.phil.cmu.edu/projects/tetrad_download/download/acis.2018.workshop

Goals Basic working knowledge of graphical causal models Basic working knowledge of Tetrad 6 Basic understanding of search algorithms

Causal Estimation vs. Causal Search Estimation (Potential Outcomes) Causal Question: Effect of Zidovudine on Survival among HIV-positive men (Hernan, et al., 2000) Problem: confounders (CD4 lymphocyte count) vary over time, and they are dependent on previous treatment with Zidovudine Estimation method discussed: marginal structural models Assumptions: Treatment measured reliably Measured covariates sufficient to capture major sources of confounding Model of treatment given the past is accurate Output: Effect estimate with confidence intervals Fundamental Problem: estimation/inference is conditional on the model 1

Causal Estimation vs. Causal Search Search (Causal Graphical Models) Causal Question: which genes regulate flowering in Arbidopsis Problem: over 25,000 potential genes. Method: graphical model search Assumptions: RNA microarray measurement reasonable proxy for gene expression Causal Markov assumption Etc. Output: Suggestions for follow-up experiments Fundamental Problem: model space grows super-exponentially with the number of variables 1

Which genes regulate flowering time in Arabidopsis thaliana? Example* Which genes regulate flowering time in Arabidopsis thaliana? Why is this an interesting question to answer?: Timing of flowering is an important agronomical trait that greatly affects yield. Thus, improved knowledge about genes controlling flowering time is of substantial economic value. * Stekhoven DJ, et al. Causal stability ranking. Bioinformatics 28 (2012) 2819-2823.

Genetic Regulatory Networks Micro-array data ~22,000 variables N = 47 gene expression profiles of 4-day old seedlings for which subsequent flowering time was also measured Affymetrix ATH1 arrays Causal Discovery Candidate Regulators of Flowering time Greenhouse experiments on flowering time

Candidate Regulators of Flowering Time Considered the 25 genes that robustly emerged as “candidate causes” according to a customized discovery procedure based on PC 5 of those 25 genes : known regulators of flowering 13 of those 25 genes: not known regulators, and mutant seeds available

Greenhouse Experiments Seeds were plated on Murashige and Skoog medium, stratified for 2 days at 4o C, grown on plates for 10 days, and transferred into soil in Conviron growth chambers Flowering time was measured in days to bolting Seed types yielding 4 or more plants were considered for analysis Flowering time

Results Flowering time 9 seed types  4 or more plants 4/9 : shorter flowering time (p < 0.05) than the control, wild-type plants Other methods (correlation, high dimensional regression, etc.) performed essentially at chance

Outline Representing/Modeling Causal Systems Causal Graphs Parametric Models Bayes Nets Linear Structural Equation Models

Tetrad: Complete Causal Modeling Tool

Tetrad Demo & Hands-On Smoking Stained_Teeth LC Build and Save two acyclic causal graphs: Build the Smoking graph picture above Build your own graph with 4 variables Build 2 causal graphs - one for smoking, yf, lc

Parametric Models

Instantiated Models

Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S)

Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S) P(YF | S) P(LC | S) = f() All variables binary [0,1]:  = {1, 2,3,4,5, } P(S = 0) = 1 P(S = 1) = 1 - 1 P(YF = 0 | S = 0) = 2 P(LC = 0 | S = 0) = 4 P(YF = 1 | S = 0) = 1- 2 P(LC = 1 | S = 0) = 1- 4 P(YF = 0 | S = 1) = 3 P(LC = 0 | S = 1) = 5 P(YF = 1 | S = 1) = 1- 3 P(LC = 1 | S = 1) = 1- 5

Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, LC) = P(S) P(YF | S) P(LC | S) = f() All variables binary [0,1]:  = {1, 2,3,4,5, } P(S,YF, LC) = P(S) P(YF | S) P(LC | YF, S) = f() All variables binary [0,1]:  = {1, 2,3,4,5, 6,7, }

Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S) P(S = 0) = .7 P(S = 1) = .3 P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95 P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05 P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80 P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20 P(S=1,YF=1, LC=1) = ?

Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, P(S,YF, L) = P(S) P(YF | S) P(LC | S) P(S = 0) = .7 P(S = 1) = .3 P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95 P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05 P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80 P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20 P(S=1,YF=1, LC=1) = P(S=1) P(YF=1 | S=1) P(LC = 1 | S=1) P(S=1,YF=1, LC=1) = .3 * .80 * .20 = .048

Tetrad Demo & Hands-On Use the DAG you built for Smoking, YF, and LC Define the Bayes PM (# and values of categories for each variable) Attach a Bayes IM to the Bayes PM Fill in the Conditional Probability Tables (make the values plausible). Build 2 causal graphs - one for smoking, yf, lc

Structural Equation Models Causal Graph Structural Equations For each variable X  V, an assignment equation: X := fX(immediate-causes(X), eX) 1. example of recursive structural equation model without correlated errors 2. can show that assumption of independence of errors guarantees correctness of probabilitic interpretation 3. this represents both probability and causality Exogenous Distribution: Joint distribution over the exogenous vars : P(e)

Linear Structural Equation Models Path diagram Causal Graph Equations: Education := Education Income :=Educationincome Longevity :=EducationLongevity Exogenous Distribution: P(ed, Income,Income ) - i≠j ei  ej (pairwise independence) - no variance is zero 1. example of recursive structural equation model without correlated errors 2. can show that assumption of independence of errors guarantees correctness of probabilitic interpretation 3. this represents both probability and causality Structural Equation Model: V = BV + E E.g. (ed, Income,Income ) ~N(0,2) 2 diagonal, - no variance is zero

Standardized SEMs Attach a SEM PM to your 3-4 variable graph Attach a SEM IM to the SEM PM Change the coefficient values. Attach a Standardized SEM IM to the SEM PM, or the SEM IM Compare the Implied Matrices Build 2 causal graphs - one for smoking, yf, lc

Estimation

Estimation

Tetrad Demo and Hands-on Build the standardized SEM IM shown below Generate simulated data N=1000 Estimate model. Save session as “Estimate1”

Estimation

Coefficient inference vs. Model Fit Coefficient Inference: Null: coefficient = 0, e.g., bX1 X3 = 0 p-value = p(Estimated value bX1 X3 ≥ .4788 | bX1 X3 = 0 & rest of model correct) Reject null (coefficient is “significant”) when p-value < a, a usually = .05

Coefficient inference vs. Model Fit Coefficient Inference: Null: coefficient = 0, e.g., bX1 X3 = 0 p-value = p(Estimated value bX1 X3 ≥ .4788 | bX1 X3 = 0 & rest of model correct) Reject null (coefficient is “significant”) when p-value < < a, a usually = .05, Model fit: Null: Model is correctly specified (constraints true in population) p-value = p(f(Deviation(Sml,S)) ≥ 5.7137 | Model correctly specified)

Tetrad Demo and Hands-on Create two DAGs with the same variables – each with one edge flipped, and attach a SEM PM to each new graph (copy and paste by selecting nodes, Ctl-C to copy, and then Ctl-V to paste) Estimate each new model on the data produced by original graph Check p-values of: Edge coefficients Model fit Save session as: “estimation2”

Charitable Giving What influences giving? Sympathy? Impact? "The Donor is in the Details", Organizational Behavior and Human Decision Processes, Issue 1, 15-23, C. Cryder, with G. Loewenstein, R. Scheines. N = 94 TangibilityCondition [1,0] Randomly assigned experimental condition Imaginability [1..7] How concrete scenario I Sympathy [1..7] How much sympathy for target Impact [1..7] How much impact will my donation have AmountDonated [0..5] How much actually donated 1

Theoretical Hypothesis

Tetrad Demo and Hands-on Load charity.txt (tabular – not covariance data) Build graph of theoretical hypothesis Build SEM PM from graph Estimate PM, check results

Break

Search Models  Data Representing/Modeling Causal Systems Estimation and Model fit Hands on with Real Data Models  Data PC & FGES (causally sufficient case) FCI (allow latent confounding) Regression vs. Casual Search

Search Methods Scoring Searches Constraint Based Searches PC, FCI Pointwise, but not uniformly consistent Scoring Searches FGES Scores: BIC, AIC, etc. Search: Hill Climb, Genetic Alg., Simulated Annealing Difficult to extend to latent variable models Meek and Chickering Greedy Equivalence Class (GES) Latent Variable Psychometric Model Search BPC, MIMbuild, etc. Linear non-Gaussian models (Lingam) Models with cycles Images, and more!!!

Tetrad Demo and Hands On

Tetrad Demo and Hands-on Go to “estimation2.tet” Add Search node (from Simulation1) - “forbid latent common causes” - choose “FGES” Add a “Graph” node to search result: choose “Dag in Pattern” Add a PM to Graph Estimate the PM on the data Compare model-fit to model fit for true model

Backround Knowledge Tetrad Demo and Hands-on Create new session Select “Simulate from a given graph, and then Search” from Pipeline menu Build graph below, PM, IM, and generate sample data N=1,000. Execute PC search, a = .05

Backround Knowledge Tetrad Demo and Hands-on Add “Knowledge” node Create “Tiers” as shown below. Execute PC search again, a = .05 Compare results (Search2) to previous search (Search1)

Backround Knowledge Direct and Indirect Consequences True Graph PC Output Background Knowledge PC Output No Background Knowledge

Backround Knowledge Direct and Indirect Consequences True Graph Direct Consequence Of Background Knowledge Indirect Consequence Of Background Knowledge PC Output Background Knowledge PC Output No Background Knowledge

College Plans (Search) Load in College_plans_data Use file: College_plans_data Set Variables = discrete Enter Background Knowledge: Tier 1: Sex , SES Tier 2: IQ Tier 3: pe, cp Add search node Choose FCI (Allow Latent Common Causes) Add another search node Set Bootstrap = 100

Regression vs. Causal Search Multiple Regression (X = putative cause , Y = putative effect) Identify all potential confounders Z, such that Z is prior to X Z is associated with X, and Z is associated with Y Regress Y on X, Z Coefficient on X is evidence of X  Y Causal Search: Find/compute all the causal models that are indistinguishable given background knowledge and data Represent features common to all such models Multiple Regression is the wrong tool for Causal Discovery

Regression vs. Causal Search

Foreign Investment Does Foreign Investment in 3rd World Countries inhibit Democracy? Timberlake, M. and Williams, K. (1984). Dependence, political exclusion, and government repression: Some cross-national evidence. American Sociological Review 49, 141-146. N = 72 PO degree of political exclusivity CV lack of civil liberties EN energy consumption per capita (economic development) FI level of foreign investment 1

Foreign Investment Correlations po fi en cv po 1.0 fi -.175 1.0

Case Study: Foreign Investment Regression Results po = .217*fi - .176*en + .880*cv SE (.058) (.059) (.060) t 3.941 -2.99 14.6 P .0002 .0044 .0000 Interpretation: foreign investment increases political repression 1

Case Study: Foreign Investment Alternative Models There is no model with testable constraints (df > 0) that is not rejected by the data, in which FI has a positive effect on PO.

Charitable Giving (Search) Load in charity data Add search node Enter Background Knowledge: Tangibility is exogenous Amount Donated is endogenous only Tangibility  Imaginability is required Choose and execute one of the “Pattern searches” Add a “Graph” node to search result: “choose Dag in Pattern” Add a PM to Graph Estimate the PM on the data Compare model-fit to hypothetical model

For More https://www.ccd.pitt.edu/