Download presentation
Presentation is loading. Please wait.
Published byGertrude Johns Modified over 9 years ago
1
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks
2
AGENDA Probabilistic inference queries Top-down inference Variable elimination
3
P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y X given evidence E = e for some subset E X / Y P( Y | E = e ) Inference problem
4
A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) = y P( Y=y, e ) Otherwise, let W = X /( E Y ) P( Y | E = e ) = w P( Y, W = w, e ) /P( e ) P( e ) = y w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables
5
A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Another common case: Y ={Q} (single query variable) Can we do better than brute force marginalization of the joint distribution?
6
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm)
7
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e)
8
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E)
9
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) 4.P(A) = 0.95*0.001*0.002 + 0.94*0.001*0.998 + 0.29*0.999*0.002 + 0.001*0.999*0.998 = 0.00252 Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) 4.P(A) = 0.95*0.001*0.002 + 0.94*0.001*0.998 + 0.29*0.999*0.002 + 0.001*0.999*0.998 = 0.00252
10
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls)
11
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A)
12
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) 2.P(M) = 0.70*0.00252 + 0.01*(1-0.0252) = 0.0117 Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) 2.P(M) = 0.70*0.00252 + 0.01*(1-0.0252) = 0.0117
13
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE WITH E VIDENCE Suppose we want to compute P(Alarm|Earthquake)
14
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b)
15
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95*0.001 + 0.29*0.999 + = 0.29066 Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95*0.001 + 0.29*0.999 + = 0.29066
16
T OP -D OWN INFERENCE Only works if the graph of ancestors is a polytree Evidence given on ancestor(s) of Q Efficient: O(d) time, where d is the number of ancestors of a variable, | Pa X | assumed constant Evidence on an ancestor cuts off influence of portion of graph above evidence node
17
N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class) i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C) i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences
18
N ORMALIZATION F ACTORS P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C) i P(F i |C) 1/Z term is a normalization factor so that P(C|F 1,…,F n ) sums to 1 Z = c P(C=c) i P(F i |C=c) Different for each value of F 1,…,F n Often left implicit Usual implementation: first compute the unnormalized distribution P(C) i P(F i =f i |C) for all values of C, then performing a normalization step in O(|Val(C)|) time
19
N OTE : N UMERICAL I SSUES IN I MPLEMENTATION Suppose P(f i |c) is very small for all i, e.g., probability that a given uncommon word f i appears in a document The product P(C) i P(F i |C) with large n will be exceedingly small and might underflow More numerically stable solution: Compute log P(C) + i log P(f i |C) for all values of C Compute b = max c [log P(c) + i log P(f i |c)] P(C|f 1,…,f n ) = exp(log P(C) + i log P(f i |C) - b) / Z’ With Z’ a normalization factor A common trick when dealing with many products of small #s
20
N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class) i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C) fk+1…fn i=1…k P(F i |C) j=k+1…n P(f j |C) = 1/Z P(C) i=1…k P(F i |C) j=k+1…n fj P(f j |C) = 1/Z P(C) i=1…k P(F i |C) Given some features, what is the distribution over class?
21
F OR G ENERAL B AYES N ETS Exact inference: variable elimination Efficient for polytrees and certain “simple” graphs NP hard in general Approximate inference Monte-Carlo sampling techniques Belief propagation (exact in polytrees)
22
BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(A) P(A) = b,e P(A|b,e)P(b)P(e)
23
ABE (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e-6 0.000938 0.000579 0.000997 1e-7 0.0011976 0.00141858 0.996 A MaryCallsJohnCalls A,B,E AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product)
24
ABE (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e-6 0.000938 0.000579 0.000997 1e-7 0.0011976 0.00141858 0.996 A MaryCallsJohnCalls AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) = b,e (A,b,e) (sum) Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) = b,e (A,b,e) (sum) P(A=T) P(A=F)
25
P ROBABILITY QUERIES Computing P( Y,E ) in a BN is a sum-product operation: P( Y, E ) = w P( Y, W = w, E ) = w ( Y E, W = w ) With ( X ) = X P(X| Pa X ) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-products
26
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 )
27
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) Rearrange equation…
28
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 ) (x 2 ) Factor over each value of X 2 Cache (x 2 ), use for both values of X 3 !
29
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 ) (x 2 ) Computed for each value of X 2 How many * and + saved? *: 2*4*2=16 vs 4+4=8 + 2*3=8 vs 2+1=3 Can lead to huge gains in larger networks
30
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a)
31
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a)
32
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) (j,m,E,b) Factor over all values of E,b Note: (j,m,E,b) = P(j,m|E,b)
33
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) (j,m|E,b) = P(E) (j,m,E) Compute for all values of E Note: (j,m,E) = P(j,m|E)
34
W HAT ORDER TO PERFORM VE? For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability table is 2 k where k is # of parents of a node If the number of parents of a node is bounded, then VE is linear time! Other networks: intermediate factors may become large
35
N ON - POLYTREE NETWORKS P(D) = Σ a Σ b Σ c P(A)P(B|A)P(C|A)P(D|B,C) = Σ b Σ c P(D|B,C) Σ a P(A)P(B|A)P(C|A) A BC D No more simplifications…
36
D O T AU - FACTORS CORRESPOND TO CONDITIONAL DISTRIBUTIONS ? Sometimes, but not necessarily A B C D
37
I MPLEMENTATION NOTES How to implement multidimensional factors? How to efficiently implement sum-product?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.