Conditional Probability Distributions Eran Segal Weizmann Institute.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Probabilistic Reasoning Bayesian Belief Networks Constructing Bayesian Networks Representing Conditional Distributions Summary.
Dynamic Bayesian Networks (DBNs)
Local structures; Causal Independence, Context-sepcific independance COMPSCI 276 Fall 2007.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
Bayesian Networks VISA Hyoungjune Yi. BN – Intro. Introduced by Pearl (1986 ) Resembles human reasoning Causal relationship Decision support system/ Expert.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
Dimensional reduction, PCA
Bayesian Network Representation Continued
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Probability: Review TexPoint fonts used in EMF.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Lecture 4 Probability and what it has to do with data analysis.
Exact Inference Eran Segal Weizmann Institute. Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation.
Continuous Random Variables and Probability Distributions
Bayesian Networks Alan Ritter.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Undirected Graphical Models Eran Segal Weizmann Institute.
Propositional Calculus Math Foundations of Computer Science.
Probability and Statistics Review Thursday Sep 11.
The Multivariate Normal Distribution, Part 2 BMTRY 726 1/14/2014.
Inference in Gaussian and Hybrid Bayesian Networks ICS 275B.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Lecture 28 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
Systems Architecture I1 Propositional Calculus Objective: To provide students with the concepts and techniques from propositional calculus so that they.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Probabilistic graphical models. Graphical models are a marriage between probability theory and graph theory (Michael Jordan, 1998) A compact representation.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
LECTURE IV Random Variables and Probability Distributions I.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
LAC group, 16/06/2011. So far...  Directed graphical models  Bayesian Networks Useful because both the structure and the parameters provide a natural.
CS 416 Artificial Intelligence Lecture 14 Uncertainty Chapters 13 and 14 Lecture 14 Uncertainty Chapters 13 and 14.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Geology 6600/7600 Signal Analysis 02 Sep 2015 © A.R. Lowry 2015 Last time: Signal Analysis is a set of tools used to extract information from sequences.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:
Representation of CPTs CH discrete Canonical distribution: standard Deterministic nodes: values computable exactly from parent nodes Noisy-OR relations:
Tracking with dynamics
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Joint Moments and Joint Characteristic Functions.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Probability and Information Theory
Presentation transcript:

Conditional Probability Distributions Eran Segal Weizmann Institute

Last Time Local Markov assumptions – basic BN independencies d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P Perfect Map – not every distribution P has one PDAGs Compact representation of I-equivalence graphs Algorithm for finding PDAGs

CPDs Thus far we ignored the representation of CPDs Today we will cover the range of CPD representations Discrete Continuous Sparse Deterministic Linear

Table CPDs Entry for each joint assignment of X and Pa(X) For each pa x : Most general representation Represents every discrete CPD Limitations Cannot model continuous RVs Number of parameters exponential in |Pa(X)| Cannot model large in-degree dependencies Ignores structure within the CPD S Is0s0 s1s1 i0i i1i I i0i0 i1i P(S|I)P(I) I S

Structured CPDs Key idea: reduce parameters by modeling P(X|Pa X ) without explicitly modeling all entries of the joint Lose expressive power (cannot represent every CPD)

Deterministic CPDs There is a function f: Val(Pa X )  Val(X) such that Examples OR, AND, NAND functions Z = Y+X (continuous variables)

Deterministic CPDs S T1T1 T2T2 s0s0 s1s1 t0t0 t0t t0t0 t1t t1t1 t0t t1t1 t1t T1T1 S T2T2 T1T1 T T2T2 S S Ts0s0 s1s1 t0t t1t T T1T1 T2T2 t0t0 t1t1 t0t0 t0t0 10 t0t0 t1t1 01 t1t1 t0t0 01 t1t1 t1t1 01 Replace spurious dependencies with deterministic CPDs Need to make sure that deterministic CPD is compactly stored

Deterministic CPDs Induce additional conditional independencies Example: T is any deterministic function of T 1,T 2 T1T1 T T2T2 S1S1 Ind(S 1 ;S 2 | T 1,T 2 ) S2S2

Deterministic CPDs Induce additional conditional independencies Example: C is an XOR deterministic function of A,B D C A E B Ind(D;E | B,C)

Deterministic CPDs Induce additional conditional independencies Example: T is an OR deterministic function of T 1,T 2 T1T1 T T2T2 S1S1 Ind(S 1 ;S 2 | T 1 =t 1 ) S2S2 Context specific independencies

Context Specific Independencies Let X,Y,Z be pairwise disjoint RV sets Let C be a set of variables and c  Val(C) X and Y are contextually independent given Z and c, denoted (X  c Y | Z,c) if:

Tree CPDs D ABCd0d0 d1d1 a0a0 b0b0 c0c a0a0 b0b0 c1c a0a0 b1b1 c0c a0a0 b1b1 c1c a1a1 b0b0 c0c a1a1 b0b0 c1c a1a1 b1b1 c0c A1A1 b1b1 C1C A D BC A D BC A B C (0.2,0.8) (0.4,0.6) (0.7,0.3) (0.9,0.1) a0a0 a1a1 b1b1 b0b0 c1c1 c0c0 4 parameters 8 parameters

ActivatorRepressor Regulated gene ActivatorRepressor Regulated gene Activator Regulated gene Repressor State 1 Activator State 2 Activator Repressor State 3 Gene Regulation: Simple Example Regulated gene DNA Microarray Regulators DNA Microarray Regulators

truefalse true false Regulation Tree Activator? Repressor? State 1State 2State 3 true Regulation program Module genes Activator expression Repressor expression Segal et al., Nature Genetics ’03

Respiration Module Regulation program Module genes Hap4+Msn4 known to regulate module genes Module genes known targets of predicted regulators?  Predicted regulator Segal et al., Nature Genetics ’03

Rule CPDs A rule r is a pair (c;p) where c is an assignment to a subset of variables C and p  [0,1]. Let Scope[r]=C A rule-based CPD P(X|Pa(X)) is a set of rules R s.t. For each rule r  R  Scope[r]  {X}  Pa(X) For each assignment (x,u) to {X}  Pa(X) we have exactly one rule (c;p)  R such that c is compatible with (x,u). Then, we have P(X=x | Pa(X)=u) = p

Rule CPDs Example Let X be a variable with Pa(X) = {A,B,C} r1: (a 1, b 1, x 0 ; 0.1) r2: (a 0, c 1, x 0 ; 0.2) r3: (b 0, c 0, x 0 ; 0.3) r4: (a 1, b 0, c 1, x 0 ; 0.4) r5: (a 0, b 1, c 0 ; 0.5) r6: (a 1, b 1, x 1 ; 0.9) r7: (a 0, c 1, x 1 ; 0.8) r8: (b 0, c 0, x 1 ; 0.7) r9: (a 1, b 0, c 1, x 1 ; 0.6) Note: each assignment maps to exactly one rule Rules cannot always be represented compactly within tree CPDs

Tree CPDs and Rule CPDs Can represent every discrete function Can be easily learned and dealt with in inference But, some functions are not represented compactly XOR in tree CPDs: cannot split in one step on a 0,b 1 and a 1,b 0 Alternative representations exist Complex logical rules

Context Specific Independencies A D BC C (0.2,0.8)(0.4,0.6)(0.7,0.3)(0.9,0.1) a0a0 a1a1 b1b1 b0b0 c1c1 c0c0 B A A D BC A=a 1  Ind(D,C | A=a 1 ) A D BC A=a 0  Ind(D,B | A=a 0 ) Reasoning by cases implies that Ind(B,C | A,D)

Independence of Causal Influence X1X1 Y X2X2 XnXn... Causes: X 1,…X n Effect: Y General case: Y has a complex dependency on X 1,…X n Common case Each X i influences Y separately Influence of X 1,…X n is combined to an overall influence on Y

Example 1: Noisy OR Two independent effects X 1, X 2 Y=y 1 cannot happen unless one of X 1, X 2 occurs P(Y=y 0 | X 1 =x 1 0, X 2 =x 2 0 ) = P(X 1 =x 1 0 )P(X 2 =x 2 0 ) Y X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2

Noisy OR: Elaborate Representation Y X’ 1 X’ 2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 01 x11x11 x20x20 01 x11x11 x21x21 01 X’ 1 Y X’ 2 X1X1 X2X2 Deterministic OR X’ 1 X1X1 x10x10 x11x11 x10x10 10 x11x Noisy CPD 1 X’ 2 X2X2 x20x20 x21x21 x20x20 10 x21x Noisy CPD 2 Noise parameter  X 1 =0.9 Noise parameter  X 1 =0.8

Noisy OR: Elaborate Representation Decomposition results in the same distribution

Noisy OR: General Case Y is a binary variable with k binary parents X 1,...X n CPD P(Y | X 1,...X n ) is a noisy OR if there are k+1 noise parameters 0, 1,..., n such that

Noisy OR Independencies X’ 1 Y X’ 2 X1X1 X2X2 X’ n XnXn... i  j: Ind(X i  X j | Y=y o )

Generalized Linear Models Model is a soft version of a linear threshold function Example: logistic function Binary variables X 1,...X n,Y

General Formulation Let Y be a random variable with parents X 1,...X n The CPD P(Y | X 1,...X n ) exhibits independence of causal influence (ICI) if it can be described by The CPD P(Z | Z 1,...Z n ) is deterministic Z1Z1 Z Z2Z2 X1X1 X2X2 Y ZnZn XnXn... Noisy OR Z i has noise model Z is an OR function Y is the identity CPD Logistic Z i = w i 1(X i =1) Z =  Z i Y = logit (Z)

General Formulation Key advantage: O(n) parameters As stated, not all that useful as any complex CPD can be represented through a complex deterministic CPD

Continuous Variables One solution: Discretize Often requires too many value states Loses domain structure Other solution: use continuous function for P(X|Pa(X)) Can combine continuous and discrete variables, resulting in hybrid networks Inference and learning may become more difficult

Gaussian Density Functions Among the most common continuous representations Univariate case:

Gaussian Density Functions A multivariate Gaussian distribution over X 1,...X n has Mean vector  n x n positive definite covariance matrix  positive definite: Joint density function:  i =E[X i ]  ii =Var[X i ]  ij =Cov[X i,X j ]=E[X i X j ]-E[X i ]E[X j ] (i  j)

Gaussian Density Functions Marginal distributions are easy to compute Independencies can be determined from parameters If X=X 1,...X n have a joint normal distribution N(  ;  ) then Ind(X i  X j ) iff  ij =0 Does not hold in general for non-Gaussian distributions

Linear Gaussian CPDs Y is a continuous variable with parents X 1,...X n Y has a linear Gaussian model if it can be described using parameters  0,...,  n and  2 such that Vector notation: Pros Simple Captures many interesting dependencies Cons Fixed variance (variance cannot depend on parents values)

Linear Gaussian Bayesian Network A linear Gaussian Bayesian network is a Bayesian network where All variables are continuous All of the CPDs are linear Gaussians Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions

Equivalence Theorem Y is a linear Gaussian of its parents X 1,...X n : Assume that X 1,...X n are jointly Gaussian with N(  ;  ) Then: The marginal distribution of Y is Gaussian with N(  Y ;  Y 2 ) The joint distribution over {X,Y} is Gaussian where Linear Gaussian BNs define a joint Gaussian distribution

Converse Equivalence Theorem If {X,Y} have a joint Gaussian distribution then Implications of equivalence Joint distribution has compact representation: O(n 2 ) We can easily transform back and forth between Gaussian distributions and linear Gaussian Bayesian networks Representations may differ in parameters Example: Gaussian distribution has full covariance matrix Linear Gaussian XnXn X2X2 X1X1...

Hybrid Models Models of continuous and discrete variables Continuous variables with discrete parents Discrete variables with continuous parents Conditional Linear Gaussians Y continuous variable X = {X 1,...,X n } continuous parents U = {U 1,...,U m } discrete parents A Conditional Linear Bayesian network is one where Discrete variables have only discrete parents Continuous variables have only CLG CPDs

Hybrid Models Continuous parents for discrete children Threshold models Linear sigmoid

Summary: CPD Models Deterministic functions Context specific dependencies Independence of causal influence Noisy OR Logistic function CPDs capture additional domain structure