Conditional Probability Distributions Eran Segal Weizmann Institute
Last Time Local Markov assumptions – basic BN independencies d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P Perfect Map – not every distribution P has one PDAGs Compact representation of I-equivalence graphs Algorithm for finding PDAGs
CPDs Thus far we ignored the representation of CPDs Today we will cover the range of CPD representations Discrete Continuous Sparse Deterministic Linear
Table CPDs Entry for each joint assignment of X and Pa(X) For each pa x : Most general representation Represents every discrete CPD Limitations Cannot model continuous RVs Number of parameters exponential in |Pa(X)| Cannot model large in-degree dependencies Ignores structure within the CPD S Is0s0 s1s1 i0i i1i I i0i0 i1i P(S|I)P(I) I S
Structured CPDs Key idea: reduce parameters by modeling P(X|Pa X ) without explicitly modeling all entries of the joint Lose expressive power (cannot represent every CPD)
Deterministic CPDs There is a function f: Val(Pa X ) Val(X) such that Examples OR, AND, NAND functions Z = Y+X (continuous variables)
Deterministic CPDs S T1T1 T2T2 s0s0 s1s1 t0t0 t0t t0t0 t1t t1t1 t0t t1t1 t1t T1T1 S T2T2 T1T1 T T2T2 S S Ts0s0 s1s1 t0t t1t T T1T1 T2T2 t0t0 t1t1 t0t0 t0t0 10 t0t0 t1t1 01 t1t1 t0t0 01 t1t1 t1t1 01 Replace spurious dependencies with deterministic CPDs Need to make sure that deterministic CPD is compactly stored
Deterministic CPDs Induce additional conditional independencies Example: T is any deterministic function of T 1,T 2 T1T1 T T2T2 S1S1 Ind(S 1 ;S 2 | T 1,T 2 ) S2S2
Deterministic CPDs Induce additional conditional independencies Example: C is an XOR deterministic function of A,B D C A E B Ind(D;E | B,C)
Deterministic CPDs Induce additional conditional independencies Example: T is an OR deterministic function of T 1,T 2 T1T1 T T2T2 S1S1 Ind(S 1 ;S 2 | T 1 =t 1 ) S2S2 Context specific independencies
Context Specific Independencies Let X,Y,Z be pairwise disjoint RV sets Let C be a set of variables and c Val(C) X and Y are contextually independent given Z and c, denoted (X c Y | Z,c) if:
Tree CPDs D ABCd0d0 d1d1 a0a0 b0b0 c0c a0a0 b0b0 c1c a0a0 b1b1 c0c a0a0 b1b1 c1c a1a1 b0b0 c0c a1a1 b0b0 c1c a1a1 b1b1 c0c A1A1 b1b1 C1C A D BC A D BC A B C (0.2,0.8) (0.4,0.6) (0.7,0.3) (0.9,0.1) a0a0 a1a1 b1b1 b0b0 c1c1 c0c0 4 parameters 8 parameters
ActivatorRepressor Regulated gene ActivatorRepressor Regulated gene Activator Regulated gene Repressor State 1 Activator State 2 Activator Repressor State 3 Gene Regulation: Simple Example Regulated gene DNA Microarray Regulators DNA Microarray Regulators
truefalse true false Regulation Tree Activator? Repressor? State 1State 2State 3 true Regulation program Module genes Activator expression Repressor expression Segal et al., Nature Genetics ’03
Respiration Module Regulation program Module genes Hap4+Msn4 known to regulate module genes Module genes known targets of predicted regulators? Predicted regulator Segal et al., Nature Genetics ’03
Rule CPDs A rule r is a pair (c;p) where c is an assignment to a subset of variables C and p [0,1]. Let Scope[r]=C A rule-based CPD P(X|Pa(X)) is a set of rules R s.t. For each rule r R Scope[r] {X} Pa(X) For each assignment (x,u) to {X} Pa(X) we have exactly one rule (c;p) R such that c is compatible with (x,u). Then, we have P(X=x | Pa(X)=u) = p
Rule CPDs Example Let X be a variable with Pa(X) = {A,B,C} r1: (a 1, b 1, x 0 ; 0.1) r2: (a 0, c 1, x 0 ; 0.2) r3: (b 0, c 0, x 0 ; 0.3) r4: (a 1, b 0, c 1, x 0 ; 0.4) r5: (a 0, b 1, c 0 ; 0.5) r6: (a 1, b 1, x 1 ; 0.9) r7: (a 0, c 1, x 1 ; 0.8) r8: (b 0, c 0, x 1 ; 0.7) r9: (a 1, b 0, c 1, x 1 ; 0.6) Note: each assignment maps to exactly one rule Rules cannot always be represented compactly within tree CPDs
Tree CPDs and Rule CPDs Can represent every discrete function Can be easily learned and dealt with in inference But, some functions are not represented compactly XOR in tree CPDs: cannot split in one step on a 0,b 1 and a 1,b 0 Alternative representations exist Complex logical rules
Context Specific Independencies A D BC C (0.2,0.8)(0.4,0.6)(0.7,0.3)(0.9,0.1) a0a0 a1a1 b1b1 b0b0 c1c1 c0c0 B A A D BC A=a 1 Ind(D,C | A=a 1 ) A D BC A=a 0 Ind(D,B | A=a 0 ) Reasoning by cases implies that Ind(B,C | A,D)
Independence of Causal Influence X1X1 Y X2X2 XnXn... Causes: X 1,…X n Effect: Y General case: Y has a complex dependency on X 1,…X n Common case Each X i influences Y separately Influence of X 1,…X n is combined to an overall influence on Y
Example 1: Noisy OR Two independent effects X 1, X 2 Y=y 1 cannot happen unless one of X 1, X 2 occurs P(Y=y 0 | X 1 =x 1 0, X 2 =x 2 0 ) = P(X 1 =x 1 0 )P(X 2 =x 2 0 ) Y X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x x11x11 x20x x11x11 x21x X1X1 Y X2X2
Noisy OR: Elaborate Representation Y X’ 1 X’ 2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 01 x11x11 x20x20 01 x11x11 x21x21 01 X’ 1 Y X’ 2 X1X1 X2X2 Deterministic OR X’ 1 X1X1 x10x10 x11x11 x10x10 10 x11x Noisy CPD 1 X’ 2 X2X2 x20x20 x21x21 x20x20 10 x21x Noisy CPD 2 Noise parameter X 1 =0.9 Noise parameter X 1 =0.8
Noisy OR: Elaborate Representation Decomposition results in the same distribution
Noisy OR: General Case Y is a binary variable with k binary parents X 1,...X n CPD P(Y | X 1,...X n ) is a noisy OR if there are k+1 noise parameters 0, 1,..., n such that
Noisy OR Independencies X’ 1 Y X’ 2 X1X1 X2X2 X’ n XnXn... i j: Ind(X i X j | Y=y o )
Generalized Linear Models Model is a soft version of a linear threshold function Example: logistic function Binary variables X 1,...X n,Y
General Formulation Let Y be a random variable with parents X 1,...X n The CPD P(Y | X 1,...X n ) exhibits independence of causal influence (ICI) if it can be described by The CPD P(Z | Z 1,...Z n ) is deterministic Z1Z1 Z Z2Z2 X1X1 X2X2 Y ZnZn XnXn... Noisy OR Z i has noise model Z is an OR function Y is the identity CPD Logistic Z i = w i 1(X i =1) Z = Z i Y = logit (Z)
General Formulation Key advantage: O(n) parameters As stated, not all that useful as any complex CPD can be represented through a complex deterministic CPD
Continuous Variables One solution: Discretize Often requires too many value states Loses domain structure Other solution: use continuous function for P(X|Pa(X)) Can combine continuous and discrete variables, resulting in hybrid networks Inference and learning may become more difficult
Gaussian Density Functions Among the most common continuous representations Univariate case:
Gaussian Density Functions A multivariate Gaussian distribution over X 1,...X n has Mean vector n x n positive definite covariance matrix positive definite: Joint density function: i =E[X i ] ii =Var[X i ] ij =Cov[X i,X j ]=E[X i X j ]-E[X i ]E[X j ] (i j)
Gaussian Density Functions Marginal distributions are easy to compute Independencies can be determined from parameters If X=X 1,...X n have a joint normal distribution N( ; ) then Ind(X i X j ) iff ij =0 Does not hold in general for non-Gaussian distributions
Linear Gaussian CPDs Y is a continuous variable with parents X 1,...X n Y has a linear Gaussian model if it can be described using parameters 0,..., n and 2 such that Vector notation: Pros Simple Captures many interesting dependencies Cons Fixed variance (variance cannot depend on parents values)
Linear Gaussian Bayesian Network A linear Gaussian Bayesian network is a Bayesian network where All variables are continuous All of the CPDs are linear Gaussians Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions
Equivalence Theorem Y is a linear Gaussian of its parents X 1,...X n : Assume that X 1,...X n are jointly Gaussian with N( ; ) Then: The marginal distribution of Y is Gaussian with N( Y ; Y 2 ) The joint distribution over {X,Y} is Gaussian where Linear Gaussian BNs define a joint Gaussian distribution
Converse Equivalence Theorem If {X,Y} have a joint Gaussian distribution then Implications of equivalence Joint distribution has compact representation: O(n 2 ) We can easily transform back and forth between Gaussian distributions and linear Gaussian Bayesian networks Representations may differ in parameters Example: Gaussian distribution has full covariance matrix Linear Gaussian XnXn X2X2 X1X1...
Hybrid Models Models of continuous and discrete variables Continuous variables with discrete parents Discrete variables with continuous parents Conditional Linear Gaussians Y continuous variable X = {X 1,...,X n } continuous parents U = {U 1,...,U m } discrete parents A Conditional Linear Bayesian network is one where Discrete variables have only discrete parents Continuous variables have only CLG CPDs
Hybrid Models Continuous parents for discrete children Threshold models Linear sigmoid
Summary: CPD Models Deterministic functions Context specific dependencies Independence of causal influence Noisy OR Logistic function CPDs capture additional domain structure