Part II: Graphical models

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
A Tutorial on Learning with Bayesian Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Exact Inference in Bayes Nets
Dynamic Bayesian Networks (DBNs)
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 gR2002 Peter Spirtes Carnegie Mellon University.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Bayes Nets. Bayes Nets Quick Intro Topic of much current research Models dependence/independence in probability distributions Graph based - aka “graphical.
Part II: How to make a Bayesian model. Questions you can answer… What would an ideal learner or observer infer from these data? What are the effects of.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.
Bayes Net Perspectives on Causation and Causal Inference
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Summary of the Bayes Net Formalism David Danks Institute for Human & Machine Cognition.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
A Brief Introduction to Graphical Models
Theory-based causal induction Tom Griffiths Brown University Josh Tenenbaum MIT.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Tetrad project 
Introduction to Bayesian Networks
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Course files
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Lecture 2: Statistical learning primer for biologists
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Pattern Recognition and Machine Learning
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Human causal induction Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Bayesian Decision Theory Introduction to Machine Learning (Chap 3), E. Alpaydin.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
INTRODUCTION TO Machine Learning 2nd Edition
INTRODUCTION TO Machine Learning
Read R&N Ch Next lecture: Read R&N
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
Expectation-Maximization & Belief Propagation
Class #16 – Tuesday, October 26
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
INTRODUCTION TO Machine Learning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Part II: Graphical models

Challenges of probabilistic models Specifying well-defined probabilistic models with many variables is hard (for modelers) Representing probability distributions over those variables is hard (for computers/learners) Computing quantities using those distributions is hard (for computers/learners)

Representing structured distributions Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin Domain {0,1}

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Joint distribution Requires 15 numbers to specify probability of all values x1,x2,x3,x4 N binary variables, 2N-1 numbers Similar cost when computing conditional probabilities

How can we use fewer numbers? Four random variables: X1 coin toss produces heads X2 coin toss produces heads X3 coin toss produces heads X4 coin toss produces heads Domain {0,1}

Statistical independence Two random variables X1 and X2 are independent if P(x1|x2) = P(x1) e.g. coinflips: P(x1=H|x2=H) = P(x1=H) = 0.5 Independence makes it easier to represent and work with probability distributions We can exploit the product rule: If x1, x2, x3, and x4 are all independent…

Expressing independence Statistical independence is the key to efficient probabilistic representation and computation This has led to the development of languages for indicating dependencies among variables Some of the most popular languages are based on “graphical models”

Part II: Graphical models Introduction to graphical models representation and inference Causal graphical models causality learning about causal relationships Graphical models and cognitive science uses of graphical models an example: causal induction

Part II: Graphical models Introduction to graphical models representation and inference Causal graphical models causality learning about causal relationships Graphical models and cognitive science uses of graphical models an example: causal induction

Graphical models Express the probabilistic dependency structure among a set of variables (Pearl, 1988) Consist of a set of nodes, corresponding to variables a set of edges, indicating dependency a set of functions defined on the graph that specify a probability distribution

Undirected graphical models X3 X4 X1 Consist of a set of nodes a set of edges a potential for each clique, multiplied together to yield the distribution over variables Examples statistical physics: Ising model, spinglasses early neural networks (e.g. Boltzmann machines) X2 X5

Directed graphical models X3 X4 X1 Consist of a set of nodes a set of edges a conditional probability distribution for each node, conditioned on its parents, multiplied together to yield the distribution over variables Constrained to directed acyclic graphs (DAGs) Called Bayesian networks or Bayes nets X2 X5

Bayesian networks and Bayes Two different problems Bayesian statistics is a method of inference Bayesian networks are a form of representation There is no necessary connection many users of Bayesian networks rely upon frequentist statistical methods many Bayesian inferences cannot be easily represented using Bayesian networks

Properties of Bayesian networks Efficient representation and inference exploiting dependency structure makes it easier to represent and compute with probabilities Explaining away pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI

Properties of Bayesian networks Efficient representation and inference exploiting dependency structure makes it easier to represent and compute with probabilities Explaining away pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI

Efficient representation and inference Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4)

The Markov assumption Every node is conditionally independent of its non-descendants, given its parents where Pa(Xi) is the set of parents of Xi (via the product rule)

Efficient representation and inference Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin X1 X2 X3 X4 1 P(x4) P(x3) P(x2|x3) P(x1|x3, x4) 1 4 2 total = 7 (vs 15) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Reading a Bayesian network The structure of a Bayes net can be read as the generative process behind a distribution Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents

Reading a Bayesian network Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin P(x4) P(x3) P(x2|x3) P(x1|x3, x4) X4 X3 X1 X2 P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Reading a Bayesian network The structure of a Bayes net can be read as the generative process behind a distribution Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents Simple rules for determining whether two variables are dependent or independent Independence makes inference more efficient

Computing with Bayes nets X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Computing with Bayes nets sum over 8 values X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Computing with Bayes nets X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Computing with Bayes nets sum over 4 values X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Computing with Bayes nets Inference algorithms for Bayesian networks exploit dependency structure Message-passing algorithms “belief propagation” passes simple messages between nodes, exact for tree-structured networks More general inference algorithms exact: “junction-tree” approximate: Monte Carlo schemes (see Part IV)

Properties of Bayesian networks Efficient representation and inference exploiting dependency structure makes it easier to represent and compute with probabilities Explaining away pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI

Explaining away Rain Sprinkler Grass Wet Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet:

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet: Between 1 and P(s)

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet and sprinklers were left on: Both terms = 1

Explaining away Rain Sprinkler Grass Wet Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

Explaining away Rain Sprinkler Grass Wet “Discounting” to prior probability.

Contrast w/ production system Rain Sprinkler Grass Wet Formulate IF-THEN rules: IF Rain THEN Wet IF Wet THEN Rain Rules do not distinguish directions of inference Requires combinatorial explosion of rules IF Wet AND NOT Sprinkler THEN Rain

Contrast w/ spreading activation Rain Sprinkler Grass Wet Excitatory links: Rain Wet, Sprinkler Wet Observing rain, Wet becomes more active. Observing grass wet, Rain and Sprinkler become more active Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!

Contrast w/ spreading activation Rain Sprinkler Grass Wet Excitatory links: Rain Wet, Sprinkler Wet Inhibitory link: Rain Sprinkler Observing grass wet, Rain and Sprinkler become more active Observing grass wet and sprinkler, Rain becomes less active: explaining away

Contrast w/ spreading activation Rain Burst pipe Sprinkler Grass Wet Each new variable requires more inhibitory connections Not modular whether a connection exists depends on what others exist big holism problem combinatorial explosion

Contrast w/ spreading activation (McClelland & Rumelhart, 1981)

Graphical models Capture dependency structure in distributions Provide an efficient means of representing and reasoning with probabilities Allow kinds of inference that are problematic for other representations: explaining away hard to capture in a production system more natural than with spreading activation

Part II: Graphical models Introduction to graphical models representation and inference Causal graphical models causality learning about causal relationships Graphical models and cognitive science uses of graphical models an example: causal induction

Causal graphical models Graphical models represent statistical dependencies among variables (ie. correlations) can answer questions about observations Causal graphical models represent causal dependencies among variables (Pearl, 2000) express underlying causal structure can answer questions about both observations and interventions (actions upon a variable)

Bayesian networks Nodes: variables Links: dependency Each node has a conditional probability distribution Data: observations of x1, ..., x4 Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4)

Causal Bayesian networks Nodes: variables Links: causality Each node has a conditional probability distribution Data: observations of and interventions on x1, ..., x4 Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4)

Interventions X Cut all incoming links for the node that we intervene on Compute probabilities with “mutilated” Bayes net Four random variables: X1 coin toss produces heads X2 pencil levitates X3 friend has psychic powers X4 friend has two-headed coin hold down pencil X1 X2 X3 X4 P(x4) P(x3) P(x2|x3) P(x1|x3, x4)

Learning causal graphical models Strength: how strong is a relationship? Structure: does a relationship exist? E B C

Causal structure vs. causal strength Strength: how strong is a relationship? B E B C E B C B

Causal structure vs. causal strength Strength: how strong is a relationship? requires defining nature of relationship B E B C w0 w1 E B C w0 B

Parameterization Generic Structures: h1 = h0 = Parameterization: C B B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 1 0 0 1 1 1

Parameterization Linear Structures: h1 = h0 = Parameterization: C B B w0 w1 w0, w1: strength parameters for B, C E E w1 w0 w1+ w0 Linear C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 1 0 0 1 1 1

Parameterization “Noisy-OR” Structures: h1 = h0 = Parameterization: C B C B C w0 w1 w0, w1: strength parameters for B, C E E w1 w0 w1+ w0 – w1 w0 “Noisy-OR” C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 1 0 0 1 1 1

maximize i P(bi,ci,ei; w0, w1) Parameter estimation Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1) Bayesian methods: as in Part I

Causal structure vs. causal strength Structure: does a relationship exist? B E B C E B C B

Approaches to structure learning Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993)

Approaches to structure learning Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993) Attempts to reduce inductive problem to deductive problem

Approaches to structure learning Constraint-based: dependency from statistical tests (eg. 2) deduce structure from dependencies B B C E (Pearl, 2000; Spirtes et al., 1993) Bayesian: compute posterior probability of structures, given observed data B C B C E E P(h1|data) P(h0|data) P(h|data)  P(data|h) P(h) (Heckerman, 1998; Friedman, 1999)

Bayesian Occam’s Razor h0 (no relationship) P(d | h ) h1 (relationship) All possible data sets d For any model h,

Causal graphical models Extend graphical models to deal with interventions as well as observations Respecting the direction of causality results in efficient representation and inference Two steps in learning causal models strength: parameter estimation structure: structure learning

Part II: Graphical models Introduction to graphical models representation and inference Causal graphical models causality learning about causal relationships Graphical models and cognitive science uses of graphical models an example: causal induction

Uses of graphical models Understanding existing cognitive models e.g., neural network models Representation and reasoning a way to address holism in induction (c.f. Fodor) Defining generative models mixture models, language models (see Part IV) Modeling human causal reasoning

Human causal reasoning How do people reason about interventions? (Gopnik, Glymour, Sobel, Schulz, Kushnir & Danks, 2004; Lagnado & Sloman, 2004; Sloman & Lagnado, 2005; Steyvers, Tenenbaum, Wagenmakers & Blum, 2003) How do people learn about causal relationships? parameter estimation (Shanks, 1995; Cheng, 1997) constraint-based models (Glymour, 2001) Bayesian structure learning (Steyvers et al., 2003; Griffiths & Tenenbaum, 2005)

Causation from contingencies C present (c+) C absent (c-) a E present (e+) c d E absent (e-) b “Does C cause E?” (rate on a scale from 0 to 100)

Two models of causal judgment Delta-P (Jenkins & Ward, 1965): Power PC (Cheng, 1997): Power

Buehner and Cheng (1997) 0.00 0.25 0.50 0.75 1.00 DP People DP Power

Buehner and Cheng (1997) Constant P, changing judgments People DP Power Constant P, changing judgments

Buehner and Cheng (1997) Constant causal power, changing judgments People DP Power Constant causal power, changing judgments

Buehner and Cheng (1997) People DP Power P = 0, changing judgments

Causal structure vs. causal strength Strength: how strong is a relationship? Structure: does a relationship exist? B E B C w0 w1 E B C w0 B

Causal strength Assume structure: DP and causal power are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): linear  DP, Noisy-OR  causal power B E B C w0 w1

likelihood ratio (Bayes factor) gives evidence in favor of h1 Causal structure Hypotheses: h1 = h0 = Bayesian causal inference: support = B E B C B E B C P(d|h1) likelihood ratio (Bayes factor) gives evidence in favor of h1 P(d|h0)

Buehner and Cheng (1997) People DP (r = 0.89) Power (r = 0.88) Support (r = 0.97)

The importance of parameterization Noisy-OR incorporates mechanism assumptions: generativity: causes increase probability of effects each cause is sufficient to produce the effect causes act via independent mechanisms (Cheng, 1997) Consider other models: statistical dependence: 2 test generic parameterization (cf. Anderson, 1990)

People Support (Noisy-OR) 2 Support (generic)

Generativity is essential P(e+|c+) 8/8 6/8 4/8 2/8 0/8 P(e+|c-) 100 50 Support Predictions result from “ceiling effect” ceiling effects only matter if you believe a cause increases the probability of an effect

Blicket detector (Dave Sobel, Alison Gopnik, and colleagues) Oooh, it’s a blicket! Let’s put this one on the machine. See this? It’s a blicket machine. Blickets make it go.

“Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004) AB Trial A Trial Two objects: A and B Trial 1: A B on detector – detector active Trial 2: A on detector – detector active 4-year-olds judge whether each object is a blicket A: a blicket (100% say yes) B: probably not a blicket (34% say yes)

Possible hypotheses A B E A B A B A B A B A B A B A B A B E E E E E E

Bayesian inference Evaluating causal models in light of data: Inferring a particular causal relation:

Probability of being a blicket Bayesian inference With a uniform prior on hypotheses, and the generic parameterization Probability of being a blicket A B 0.32 0.32 0.34 0.34

Modeling backwards blocking Assume… Links can only exist from blocks to detectors Blocks are blickets with prior probability q Blickets always activate detectors, but detectors never activate on their own deterministic Noisy-OR, with wi = 1 and w0 = 0

Modeling backwards blocking P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=0, B=0): 0 0 0 0 P(E=1 | A=1, B=0): 0 0 1 1 P(E=1 | A=0, B=1): 0 1 0 1 P(E=1 | A=1, B=1): 0 1 1 1

Modeling backwards blocking P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=1, B=1): 0 1 1 1

Modeling backwards blocking P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B E E E P(E=1 | A=1, B=0): 0 1 1 P(E=1 | A=1, B=1): 1 1 1

Manipulating prior probability (Tenenbaum, Sobel, Griffiths, & Gopnik, submitted) Initial AB Trial A Trial

Summary Graphical models provide solutions to many of the challenges of probabilistic models defining structured distributions representing distributions on many variables efficiently computing probabilities Causal graphical models provide tools for defining rational models of human causal reasoning and learning