Probabilistic Reasoning
Power of Cond. Independence Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments. © Daniel S. Weld
Bayes Rule Simple proof from def of conditional probability: QED: (Def. cond. prob.) (Def. cond. prob.) قانون بیز (Mult by P(H) in line 1) QED: (Substitute #3 in #2) © Daniel S. Weld
Use to Compute Diagnostic Probability from Causal Probability E.g. let M be meningitis, S be stiff neck P(M) = 0.0001, P(S) = 0.1, P(S|M)= 0.8 P(M|S) = © Daniel S. Weld
Bayes’ Rule & Cond. Independence © Daniel S. Weld
Bayes Nets In general, joint distribution P over set of variables (X1 x ... x Xn) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P usually quite compact requires assessment of fewer parameters, those being quite natural (e.g., causal) efficient (usually) inference: query answering and belief update © Daniel S. Weld
BN What do the arrows really mean? Topology may happen to encode causal structure Topology only guaranteed to encode conditional independence © Daniel S. Weld
P(X1, X2,... Xn ) = P(X1)P(X2)... P(Xn) Independence If X1, X2,... Xn are mutually independent, then P(X1, X2,... Xn ) = P(X1)P(X2)... P(Xn) Joint can be specified with n parameters cf. the usual 2n-1 parameters required While extreme independence is unusual, Conditional independence is common BNs exploit this conditional independence © Daniel S. Weld
An Example Bayes Net P(E) 0.002 Earthquake Burglary Alarm Nbr1Calls Pr(B=t)) 0.001 Earthquake Burglary Pr(A|E,B) e,b .95 e,b .29 e,b .94 e,b 0.01 Alarm Nbr1Calls Nbr2Calls A P(N1) T 0.9 F 0.05 A P(N2) T 0.7 F 0.01 © Daniel S. Weld
Earthquake Example (con’t) Burglary Alarm Nbr2Calls Nbr1Calls If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E) By the chain rule we have P(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B) Full joint requires only 10 parameters (cf. 32) 10 = 2+2+4+1+1 © Daniel S. Weld
BNs: Qualitative Structure Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) X is conditionally independent of all nondescendents given its parents Graphical test exists for more general independence “Markov Blanket” © Daniel S. Weld
Given Parents, X is Independent of Non-Descendants © Daniel S. Weld
For Example Earthquake Burglary Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld
Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X) Childs(X) Par(Childs(X)) © Daniel S. Weld
Conditional Probability Tables Pr(B=t) Pr(B=f) 0.05 0.95 Earthquake Burglary Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld
Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X1, X2,... Xn is any topological sort of the network, then we are assured: P(Xn,Xn-1,...X1) = P(Xn| Xn-1,...X1)·P(Xn-1 | Xn-2,… X1) … P(X2 | X1) · P(X1) = P(Xn| Par(Xn)) · P(Xn-1 | Par(Xn-1)) … P(X1) © Daniel S. Weld
Bayes Nets Representation Summary Bayes nets compactly encode joint distributions Guaranteed independencies of distributions can be deduced from BN graph structure D-separation gives precise conditional independence guarantees from graph alone A Bayes’ net’s joint distribution may have further (conditional) independence that is not detectable until you inspect its specific distribution © Daniel S. Weld
Inference in BNs Inference: calculating some useful quantity from a joint probability distribution The graphical independence representation yields efficient inference schemes Posterior probability: Most likely explanation: Computations organized by network topology © Daniel S. Weld
P(b|j,m) = P(b) P(e) P(a|b,e)P(j|a)P(m,a) P(B | J=true, M=true) Earthquake Burglary Radio Alarm John Mary P(b|j,m) = P(b) P(e) P(a|b,e)P(j|a)P(m,a) e a © Daniel S. Weld
Inference by Enumeration Given unlimited time, inference in BNs is easy Recipe: State the marginal probabilities you need Figure out ALL the atomic probabilities you need Calculate and combine them E.g. © Daniel S. Weld
In this simple method, we only need the BN to synthesize the joint entries © Daniel S. Weld
Inference by Enumeration? © Daniel S. Weld
Variable Elimination Why is inference by enumeration so slow? You join up the whole joint distribution before you sum out the hidden variables You end up repeating a lot of work! Idea: interleave joining and marginalizing! Called “Variable Elimination” Still NP-hard, but usually much faster than inference by enumeration We’ll need some new notation to define VE 2 © Daniel S. Weld
Factor 1 Joint distribution: P(X,Y) Selected joint: P(x,Y) Entries P(x,y) for all x, y Sums to 1 Selected joint: P(x,Y) A slice of the joint distribution Entries P(x,y) for fixed x, all y Sums to P(x) © Daniel S. Weld
Factor 2 Family of conditionals: P(X |Y) Single conditional: P(Y | x) Multiple conditionals Entries P(x | y) for all x, y Sums to |Y| Single conditional: P(Y | x) Entries P(y | x) for fixed x, all y Sums to 1 © Daniel S. Weld
Specified family: P(y | X) Entries P(y | x) for fixed y, but for all x Sums to … who knows! In general, when we write P(Y1 … YN | X1 … XM) It is a “factor,” a multi-dimensional array Its values are all P(y1 … yN | x1 … xM) Any assigned X or Y is a dimension missing (selected) from the array © Daniel S. Weld
Example: Traffic Domain Random Variables R: Raining T: Traffic L: Late for class! First query: P(L) © Daniel S. Weld
Operation 1: Join Factors First basic operation: joining factors Combining factors: Just like a database join Get all factors over the joining variable Build a new factor over the union of the variables involved Example: Join on R Computation for each entry: pointwise products © Daniel S. Weld
Example: Multiple Joins © Daniel S. Weld
© Daniel S. Weld
Operation 2: Eliminate Second basic operation: marginalization Take a factor and sum out a variable Shrinks a factor to a smaller one A projection operation Example: © Daniel S. Weld
Multiple Elimination © Daniel S. Weld
P(L) : Marginalizing Early! © Daniel S. Weld
Marginalizing Early-2 © Daniel S. Weld
Evidence If evidence, start with factors that select that evidence No evidence uses these initial factors: Computing P(L|+r) , the initial factors become: We eliminate all vars other than query + evidence © Daniel S. Weld
Evidence II esult will be a selected joint of query and evidence E.g. for P(L | +r), we’d end up with: To get our answer, just normalize this © Daniel S. Weld
General Variable Elimination Query: Start with initial factors: Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): Pick a hidden variable H Join all factors mentioning H Eliminate (sum out) H Join all remaining factors and normalize © Daniel S. Weld
Variable Elimination Bayes Rule © Daniel S. Weld
Example 2 © Daniel S. Weld
© Daniel S. Weld
Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor linear in number of vars, exponential in largest factorelimination ordering greatly impacts factor size optimal elimination orderings: NP-hard heuristics, special structure (e.g., polytrees) On tree-structured graphs, variable elimination runs in polynomial time, like tree-structured CSPs Practically, inference is much more tractable using structure of this sort © Daniel S. Weld