Presentation is loading. Please wait.

Presentation is loading. Please wait.

Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease.

Similar presentations


Presentation on theme: "Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease."— Presentation transcript:

1 Radu Marinescu 4C @ University College Cork

2 Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease ID  Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests are not perfect – False positive, false negative How do we estimate confidence?  P(disease | symptoms, tests) = ?

3 Uncertainty in medical decision-making  Physicians, patients must decide on treatments  Treatments may not be successful  Treatments may have unpleasant side effects Choosing treatments  Weigh risks of adverse outcomes People are BAD at reasoning intuitively about probabilities  Provide systematic analysis

4 Probabilistic modeling with joint distributions Conditional independence and factorization Belief (or Bayesian) networks  Example networks and software Inference in belief networks  Exact inference Variable elimination, join-tree clustering, AND/OR search  Approximate inference Mini-clustering, belief propagation, sampling

5 Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988 Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17) Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks" http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference" http://www.ics.uci.edu/~csp/R48a.ps Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference" http://www.ics.uci.edu/~csp/r62a.pdf Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models". http://www.ics.uci.edu/~csp/r126.pdf

6 A problem domain is modeled by a list of (discrete) random variables: X 1, X 2, …, X n Knowledge about the problem is represented by a joint probability distribution:P(X 1, X 2, …, X n )

7 Alarm (Pearl88)  Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911  Problem: estimate the probability of a burglary based on who has or has not called  Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)  Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)

8 Defines probabilities for all possible value assignments to the variables in the set

9 What is the probability of burglary given that Mary called, P(B=y | M=y)? Compute marginal probability: Compute answer (reasoning by conditioning):

10 Probability theory well-established and well understood In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables  Diagnostic inference: From effects to causes Example: P(B=y | M=y)  Predictive inference: From causes to effects Example: P(M=y | B=y)  Combining evidence: P(B=y | J=y, M=y, E=n) All inference sanctioned by probability theory and hence has clear semantics

11 In Alarm example:  32 numbers needed (parameters)  Quite unnatural to assess P(B=y, E=y, A=y, J=y, M=y)  Computing P(B=y | M=y) takes 29 additions In general,  P(X 1, X 2, …, X n ) needs at least 2 n numbers to specify the joint probability distribution  Knowledge acquisition difficult (complex, unnatural)  Exponential storage and inference

12 Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks  Example networks and software Inference in belief networks  Exact inference  Approximate inference Miscellaneous  Mixed networks, influence diagrams, etc.

13 Overcome the problem of exponential size by exploiting conditional independencies  The chain rule of probability:  No gains yet. The number of parameters required by the factors is still O(2 n )

14 A random variable X is conditionally independent of a set of random variables Y given a set of random variables Z if  P(X | Y, Z) = P(X | Z) Intuitively:  Y tells us nothing more about X than we know by knowing Z  As far as X is concerned, we can ignore Y if we know Z

15 About P(X i |X 1,…,X i-1 ):  Domain knowledge usually allows one to identify a subset pa(X i )  {X 1, …, X i-1 } such that Given pa(X i ), X i is independent of all variables in {X 1,…,X i-1 } \ pa(X i ), i.e. P(X i | X 1, …, X i-1 ) = P(X i | pa(X i )) Then Joint distribution factorized! The number of parameters might have been substantially reduced

16 pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A} Conditional probability tables (CPT)

17 Model size reduced from 32 to 2+2+4+4+8=20 Model construction easier  Fewer parameters to assess  Parameter more natural to assess e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc. Inference easier. Will see this later.

18 Probabilistic modeling with joint distributions Conditional Independence and factorization Belief networks  Example networks and software Inference in belief networks  Exact inference  Approximate inference

19 Graphically represent the conditional independency relationships:  Construct a directed graph by drawing an arc from X j to X i iff X j  pa(X i )  Also attach the CPT P(X i | pa(X i )) to node X i BE A JM P(B)P(E) P(A|B,E) P(J|A)P(M|A)

20 A belief network is:  A directed acyclic graph (DAG), where: Each node represents a random variable And is associated with the conditional probability of the node given its parents  Represents the joint probability distribution:  A variable is conditionally independent of its non- descendants given its parents

21 3 basic independence structures Burglary Alarm JohnCalls 1: chain Burglary Alarm Earthquake 2: common descendants MaryCalls Alarm JohnCalls 3: common ancestors

22 Burglary Alarm JohnCalls 1. JohnCalls is independent of Burglary given Alarm

23 Burglary Alarm Earthquake 2. Burglary is independent of Earthquake not knowing Alarm. Burglary and Earthquake become dependent given Alarm!!

24 MaryCalls Alarm JohnCalls 3. MaryCalls is independent of JohnCalls given Alarm.

25 BN models many conditional independence relations relating distant variables and sets, which are defined in terms of the graphical criterion called d-separation d-separation = conditional independence  Let X, Y and Z be three sets of nodes  If X and Y are d-separated by Z, then X and Y are conditionally independent given Z: P(X|Y, Z) = P(X|Z) d-separation in the graph:  A is d-separated from B given C if every undirected path between them is blocked Path blocking  3 cases that expand on three basic independence structures

26 With “linear” substructure With “wedge” substructure (common ancestors) With “vee” substructure (common descendants) XY Z in C XY X Y Z or any of its descendants not in C

27 1 23 4 5 XY Z X = {2} and Y = {3} are d-separated by Z = {1} path 2  1  3 is blocked by 1  Z path 2  4  3 is blocked because 4 and all its descendants are outside Z X = {2} and Y = {3} are not d-separated by Z = {1,5} path 2  1  3 is blocked by 1  Z path 2  4  3 is activated because 5 (which is a descendant of 4) is in Z learning the value of consequence 5 renders 5’s causes 2 and 3 dependant

28 Given a probability distribution P on a set of variables {X 1, …, X n }, a belief network B representing P is a minimal I-map (Pearl88)  I-mapness: every d-separation condition displayed in B corresponds to a valid conditional independence relationship in P  Minimal: none of the arrows in B can be deleted without destroying its I-mapness

29 BE A JM P(B,E,A,J,M) = Rewrite the full joint probability using the product rule: = P(J|B,E,A,M) P(B,E,A,M) = P(J|A)P(B,E,A,M) P(M|B,E,A) P(B,E,A) P(M|A) P(B,E,A) P(A|B,E) P(B,E) P(B) P(E) = P(J|A) P(M|A) P(A|B,E) P(B) P(E)

30 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP The “alarm” network: Monitoring Intensive-Care Patients 37 variables, 509 parameters (instead of 2 37 )

31 GeNIe (University of Pittsburgh) - free  http://genie.sis.pitt.edu http://genie.sis.pitt.edu SamIam (UCLA) - free  http://reasoning.cs.ucla.edu/SamIam/ http://reasoning.cs.ucla.edu/SamIam/ Hugin - commercial  http://www.hugin.com http://www.hugin.com Netica - commercial  http://www.norsys.com http://www.norsys.com UCI Lab – free but no GUI  http://graphmod.ics.uci.edu/ http://graphmod.ics.uci.edu/

32

33 Belief networks are used in:  Genetic linkage analysis  Speech recognition  Medical diagnosis  Probabilistic error correcting coding  Monitoring and diagnosis in distributed systems  Troubleshooting (Microsoft)  …

34 Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks  Exact inference  Approximate inference

35 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search (tree, graph)

36 Smoking Bronchitis Lung cancer X-ray Dyspnoea P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?

37 Belief updating Maximum probable explanation (MPE) Maximum a posteriori hypothesis (MAP)

38 A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) P(A|E=0) ∑ E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = α P(A,E=0) = P(A) ∑ E=0 ∑ D ∑ C P(C|A) ∑ B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination

39 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) A BC ED Moralize (“marry parents”) Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) Ordering: A, E, D, C, B P(C|A)

40 ELIMINATION: multiply (*) and sum (∑) bucket(B): { P(E|B,C), P(D|A,B), P(B|A) }  λ B (A,C,D,E) = ∑ B P(B|A)*P(D|A,B)*P(E|B,C) OBSERVED BUCKET: bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 }  λ B (A) = P(B=1|A)  λ B (A,D) = P(D|A,B=1)  λ B (E,C) = P(E|B=1,C)

41

42

43 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) ∑∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) P(A,E=0) B C D E A w* = 4 “induced width” (max clique size)

44 B C D E A A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) Induced width of the ordering w*(d) || max width of the nodes A BC ED

45 w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2

46 NP-complete A tree has induced width of ? Greedy algorithms:  Min-width  Min induced-width  Max-cardinality  Min-fill (thought as the best)  Anytime min-width (via Branch-and-Bound)

47 Smoking Bronchitis Lung Cancer X-ray Dyspnoea

48 Probabilistic decoding  A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits x0x0 x1x1 x2x2 x3x3 x4x4 u0u0 u1u1 u2u2 u3u3 u4u4 y0uy0u y1uy1u y2uy2u y3uy3u y4uy4u y0xy0x y1xy1x y2xy2x y3xy3x y4xy4x Transmitted bits Parity check bits Received bits (observed) Received parity check bits (observed)

49 Medical diagnosis  Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms Symptom2 Symptom3 Symptom4 Symptom5 Symptom1 Symptom6 Disease1 Disease2Disease4 Disease6 Disease5 Disease3 Disease7

50 Genetic linkage analysis  Given the genotype information of a pedigree, infer the maximum likelihood haplotype configuration (maternal and paternal) of the unobserved individuals 2 1 A B a b A a B b 3 genotyped haplotype S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Locus 1 Locus 2 (Fishelson & Geiger, 2002)

51 A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) MPE = max A,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = max A P(A) max E=0 max D max C P(C|A) max B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination

52 ABCf(A,B,C) TTT0.03 TTF0.07 TFT0.54 TFF0.36 FTT0.06 FTF0.14 FFT0.48 FFF0.32 ACf(A,C) TT0.54 TF0.36 FT0.48 FF0.32 max out B

53 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination/combination operators λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE value B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0

54 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A)a’ = argmax P(A)∙λ E (A) e’ = 0 d’ = argmax λ C (a’,D,e’) c’ = argmax P(C|a’) ∙ ∙ λ C (a’,d’,C,e’) b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) ∙ P(B|a’) Return (a’, b’, c’, d’, e’)

55 w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2

56 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search (tree, graph)

57 Motivation  BE computes P(evidence) or P(X|evidence) where X is the last variable in the ordering  What if we need all marginal probabilities P(X i |evidence), where X i  {X 1, X 2, …, X n } ? Run BE n times with X i being the last variable Inefficient! – induced width may vary significantly from one ordering to another SOLUTION: Bucket-Tree Elimination (BTE)

58 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) Bucket E: Bucket D: Bucket C: Bucket B: Bucket A: P(E|B,C) P(D|A,B) P(B|A) P(A) P(C|A)λ E (B,C) λ D (A,B)λ C (A,B) λ B (A) P(E|B,C) P(D|A,B) P(C|A) P(B|A) P(A) E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) Variable elimination can be viewed as message passing (elimination) using a bucket tree Any node (bucket) can be the root Complexity: time and space exponential in the induced width P(C|A)

59 Bucket Tree  A bucket tree has each bucket B i as a node and there is an arc from B i to B j if the function created at B i was placed in B j Graph-based definition  Let G d be the induced graph along d. Each variable X and its earlier neighbors is a node B X. There is an arc from B X to B Y if Y is the closest parent to X.

60 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) Belief network E D C B A Induced graph E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) Bucket tree P(C|A)

61 u XnXn X2X2 X1X1 v h(u,v) … Compute the message: h(x 1,u) h(x n,u) elim(u,v) = vars(u) – vars(v)

62 E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B) A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A)

63 E,B,C : P(E|B,C) A,B,D : P(D|A,B) A,B,C : P(C|A) B,A : P(B|A) A : P(A) E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B)

64 G,F F,B,CD,B,A A,B,CB,A A F B,CA,B A G,F F,B,CD,B,A A,B,C F B,C A,B G,F A,B,C,D,F F A BC FD P(A) P(B|A) P(F|B,C) P(D|A,B) Time-space trade off! G P(C|A) P(G|F)

65 A tree decomposition for a belief network ‹X,D,G,P› is a triple ‹T,χ,ψ›, where T=(V,E) is a tree, and χ and ψ are labeling functions, associating with each vertex v  V two sets χ(v)  V and ψ(v)  P such that:  For each function (CPT) p i  P there is exactly one vertex such that p i  ψ(v) and scope(p i )  χ(v)  For each variable X i  X, the set {v  V | X i  χ(v)} forms a connected sub-tree (running intersection property) A join-tree is a tree decomposition where all clusters are maximal  E.g., a bucket-tree is a tree decomposition but not a join-tree

66 The width (aka treewidth) of a tree decomposition ‹T,χ,ψ› is max|χ(v)|, and its hyperwidth is max|ψ(v)| Given two adjacent vertices u and v of a tree decomposition, a separator of u and v is defined as sep(u,v) = χ(u)  χ(v)

67 Good join trees using triangulation  Create induced graph G’ along some ordering d  Identify all maximal cliques in G’  Order cliques {C 1, C 2, …, C t } by rank of the highest vertex in each clique  Form the join tree by connecting each C i to a predecessor C j (j < i) sharing the largest number of vertices with C i

68 E D C B A Induced graph A BC ED Moral graph ECB C3C3 DBA C2C2 CBA C1C1 P(A) P(B|A) P(C|A) P(E|B,C)P(D|A,B) BC P(E|B,C) P(D|A,B) P(A), P(B|A), P(C|A) AB Treewidth = 3 Separator size = 2 χ(C 3 ) ψ(C 3 ) separators

69 A B CD F E G ABC P(A), P(B|A), P(C|A,B) BCDF P(D|B), P(F|C,D) BEF P(E|E,F) EFG P(G|E,F) BC BF EF 1 2 3 4

70 A B CD F E G ABC BCDF BEF EFG BC BF EF 1 2 3 4 Time: O(exp(w+1)) Space: O(exp(sep))

71 Correctness and completeness  Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence Time complexity: O(deg x (n+N) x d w*+1 ) Space complexity: O(N x d sep ) » deg = max degree of a node in T » n = number of variables (=number of CPTs) » N = number of nodes in T » d = maximum domain size » w* = induced width » sep = separator size

72 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  Cycle cutset scheme  VE+C hybrid  AND/OR search (tree, graph)

73 0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=0|A=0,B=0) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=1|A=0,B=0) … P(A=0)P(B=1|A=0)P(C=1|A=0)P(E=0|B=1,C=1)P(D=1|A=0,B=1) ∑ = P(A=0, E=0)

74 0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0, E=0)P(A=1, E=0)

75 IDEA: condition until w* of the remaining graph gets small enough! 0000 0101 E C D B A 01 Search Elimination A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) w* = 1w* loop cutset w* = ww* = 0 searchw-cutsetelimination

76 Condition until we get a polytree (no loops)  subset of conditioning variables = loop-cutset A BC ED BC ED A=0 BC ED A=1 P(B|D=0) = P(B,A=0|D=0) + P(B,A=1|D=0) Loop-cutset method is time exponential in loop-cutset size and linear space!

77 Identify a w-cutset, C w, of the network  Finding smallest loop-cutset/w-cutset is NP-hard For each assignment of the cutset, solve by VE the conditioned subproblem Aggregate the solutions over all cutset assignments Time complexity: exp(|C w | + w) Space complexity: exp(w)

78

79 Eliminate

80

81

82

83 Condition

84 ...

85 All algorithms generalize to any graphical model  Through general operations of combination and marginalization  General BE, BTE, CTE, VE+C  Applicable to Markov networks, to constraint optimization, to counting number of solutions in SAT/CSP, etc.

86 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  Cycle cutset scheme  AND/OR search (tree, graph)

87 Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination

88 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  Cycle cutset  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

89 A D BC E F Ordering: A B E C D F A D BC E F 01010101 0101010101010101 01010101010101010101010101010101 0101 E C F D B A 01

90 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 A D BC E F A D B CE F Moral graphDFS tree A D BC E F A D BC E F

91 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F 1 1 1 0 1 0

92 92 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F AND/OR size: exp(4), OR size exp(6)

93 The AND/OR search tree of R relative to a spanning-tree, T, has:  Alternating levels of: OR nodes (variables) and AND nodes (values) Successor function:  The successors of OR nodes X are all its consistent values along its path  The successors of AND are all X child variables in T A solution is a consistent subtree Task: compute the value of the root node A D BC E F A D B CE F

94 (a) Graph 461 3275 (b) DFS tree depth=3 (c) Pseudo tree depth=2 (d) Chain depth=6 46 1 3 27 5 27 1 4 3 5 6 4 6 1 3 2 7 5 (Freuder85, Bayardo&Miranker95)

95 N = number of nodes, P = number of parents. MIN-FILL ordering. 100 instances.

96 Finding min depth DFS, or pseudo tree is NP- complete, but: Given a tree-decomposition whose treewidth is w*, there exists a pseudo tree T of G whose depth, satisfies: m <= w* log n (Bayardo & Miranker96, Bodlaender & Gilbert91)

97 FA C BD E E A C B D F (AF) (EF) (A) (AB) (AC) (BC) (AE) (BD) (DE) Bucket-tree based on dd: A B C E D F E A C B D F Induced graph E A C B DF Bucket-tree used as pseudo tree AND/OR search tree Bucket-tree ABE A ABC ABAB BDEBDEAEF bucket-A bucket-E bucket-B bucket-C bucket-Dbucket-F (AE) (BE)

98 Depth-first traversal of the induced graph constructed along some elimination ordering (e.g., min-fill)  Sometimes can get slightly different trees than those obtained from the bucket-tree Recursive decomposition of the dual hypergraph while minimizing the separator size at each step  Functions (CPTs) are vertices in the dual hypergraph, while variables are hyperedges  Separator = set of hyperedges (i.e., variables)

99 Bayesian Networks Repository

100 Theorem: Any AND/OR search tree based on a pseudo tree is sound and complete (expresses all and only solutions) Theorem: Size of AND/OR search tree is O(n k m ) Size of OR search tree is O(k n ) Theorem: Size of AND/OR search tree can be bounded by O(exp(w* log n)) Related to: (Freuder85; Dechter90, Bayardo et. al. 96, Darwiche01, Bacchus et. al. 03) When the pseudo-tree is a chain we get an OR space

101 Random graphs with 20 nodes, 20 edges and 2 values per node

102 v(n) is the value of the tree T(n) for the task:  Optimization (MPE): v(n) is the optimal solution in T(n)  Belief updating: v(n), probability of evidence in T(n). Goal: compute the value of the root node recursively using DFS search of the AND/OR tree. Theorem: Complexity of AO DFS search is:  Space: O(n)  Time:O(n k m )  Time:O(exp(w* log n))

103 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

104 OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

105 AND node: Combination operator (product) OR node: Marginalization operator (summation) Value of node = updated belief for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 Result: P(D=1,E=0) 0.3028*0.6 + 0.1559*0.4 = 0.24408

106 k = domain size m = depth of pseudo-tree n = number of variables w*= treewidth

107 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

108 Any two nodes that root identical sub-trees or sub-graphs can be merged

109

110 A D BC E F GH J K A D B CE F G H J K A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01

111 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 A D BC E F GH J K A D B CE F G H J K

112 One way of recognizing nodes that can be merged context(X) = ancestors of X in the pseudo tree that are connected to X, or to descendants of X [ ] [A] [AB] [AE] [BC] [AB] A D B EC F pseudo tree A E C B F D A E C B F D

113 .7.8 0 A B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0101 1 EC 0101 A D BC E.7.8.9.5 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.9.8.9.5.7.5.8.9.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 A D B CE [ ] [A] [AB] [BC] [AB] Context Cache table for D Result: P(D=1,E=0)

114 C 0 K 0 H 0 L 01 NN 0101 FFF 1 1 0101 F G 01 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 G 01 G 01 G 01 M 01 M 01 M 01 M 01 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 K 0 H 0 L 01 NN 0101 1 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 BA C E FG H J D K M L N O P C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G)

115 Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree. C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G) BA C E FG H J D K M L N O P max context size = treewidth

116 G E K F L H C B A M J D E K L H C A M J ABC BDEF BDFG EFH FHK HJKLM treewidth = 3 = (max cluster size) - 1 ABC BDEFGEFHFHKJKLM pathwidth = 4 = (max cluster size) - 1 D G B F TREE CHAIN

117 AO(i): searches depth-first, cache i-context  i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i

118 k = domain size n = number of variables w*= treewidth pw*= pathwidth w* ≤ pw* ≤ w* log n

119 Recursive Conditioning (RC) (Darwiche01)  Can be viewed as an AND/OR graph search algorithm guided by tree  Guiding tree structure is called “dtree” Value Elimination (VE) (Bacchus et al.03)  Also an AND/OR graph search algorithm using an advanced caching scheme based on components rather than graph-based contexts  Can use dynamic variable orderings

120 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

121 A C BK G L DF H M J E A C B K G L D F H M J E A C BK G L DF H M J E C B K G L D F H M J E 3-cutset A C BK G L DF H M J E C K G L D F H M J E 2-cutset A C BK G L DF H M J E L D F H M J E 1-cutset

122 A C B K G L D F H M J E A C B K G L D F H M J E A C B K G L D F H M J E pseudo tree1-cutset treemoral graph

123 AO(i): searches depth-first, cache i-context  i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i

124 Definition:  T_w is a w-cutset tree relative to backbone pseudo tree T, iff T_w roots T and when removed, yields treewidth w. Theorem:  AO(i) time complexity for backbone T is time O(exp(i+m_i)) and space O(i), m_i is the depth of the T_i tree. Better than w-cutset: O(exp(i+c_i)) where c_i is the number of nodes in T_i

125 Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search for Most Probable Explanations

126 Solved by BE in time and space exponential in treewidth w* Solved by Conditioning in linear space and time exponential in the number of variables n It can be solved by AND/OR search:  Tree search: space O(n), time O(exp(w* log n))  Graph search: time and space O(exp(w*))

127 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

128 OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

129 AND node: Combination operator (product) OR node: Marginalization operator (maximization) Value of node = MPE value for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.72.40.81.45.288.20.567.09.12.081.072.12.081 Result: MPE(D=1,E=0) MAX( 0.12*0.6, 0.081*0.4 )= 0.072

130 n g(n) cost of the search path to n h(n) estimates the optimal cost below n UB(n) = g(n) * h(n) Upper Bound UB(n) OR Search Tree Prune if UB(n) ≤ LB Lower Bound LB (Lawler & Wood66)

131 0 D 0 (A=0, B=0, C=0, D=0) 0 A BC 0 0 A BC 00 D 1 (A=0, B=0, C=0, D=1) 0 A BC 01 D 0 (A=0, B=1, C=0, D=0) 0 A BC 01 D 1 (A=0, B=1, C=0, D=1) A BC D Pseudo tree Extension(T’) – solution trees that extend T’

132 OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 v(D,0) 3 350 0 9 tip nodes F 1 3 35 0 F v(F) A B C DE F A B CD E F f*(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * v(D,0) * v(F)

133 OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)

134 OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 1 0 D E E 0101 01 C 1 0 B 01 f(T’) ≤ LB LB (Marinescu and Dechter, 05)

135 Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down)  Evaluate f(T’) of the current partial solution sub- tree T’, and prune search if f(T’) ≤ LB  Expand the tip node n by generating its successors PROPAGATE (bottom-up)  Update value of the parent p of n OR nodes: maximization AND nodes: product

136 The principle of relaxed models  Mini-Bucket Elimination for belief networks (Pearl86)

137 Min-fill pseudo tree. Time limit 1 hour. (Sang et al.05)

138 (Fishelson&Geiger02) Min-fill pseudo tree. Time limit 3 hours.

139 Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down)  Evaluate f(T’) of the current partial solution sub-tree T’, and prune search if f(T’) ≤ LB  If not in cache, expand the tip node n by generating its successors PROPAGATE (bottom-up)  Update value of the parent p of n OR nodes: maximization AND nodes: multiplication  Cache value of n, based on context

140 Best-first search expands first the node with the best heuristic evaluation function among all nodes encountered so far It never expands nodes whose cost is beyond the optimal one, unlike depth-first search algorithms (Dechter & Pearl85) Superior among memory intensive algorithms employing the same heuristic function

141 Maintains the set of best partial solution trees EXPAND (top-down)  Traces down marked connectors from root (best partial solution tree)  Expands a tip node n by generating its successors n’  Associate each successor with heuristic estimate h(n’) Initialize v(n’) = h(n’) REVISE (bottom-up)  Updates node values v(n) OR nodes: maximization AND nodes: multiplication  Marks the most promising solution tree from the root  Label the nodes as SOLVED: OR is SOLVED if marked child is SOLVED AND is SOLVED if all children are SOLVED Terminate when root node is SOLVED [specializes Nilsson’s AO* to graphical models (Nilsson80)] (Marinescu & Dechter, 07)

142 Min-fill pseudo tree. Time limit 1 hour.

143 Solved by BE in time and space exponential in constrained induced width w* Solved by AND/OR search:  Tree search: space O(n), time O(exp(w* log n))  Graph search: time and space O(exp(w*))

144 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) A BC ED Moralize (marry parents) Variables A and B are the hypothesis variables, variable E is evidence

145 Bucket E: Bucket D: Bucket C: Bucket B: Bucket A: P(E|B,C), E = 0 P(D|A,B) P(A) λ E (B, C) λ C (A,B)λ D (A, B) λB(A)λB(A) MAP value P(C|A) P(B|A) SUM buckets MAX buckets

146 Elimination order is important: SUM variables are eliminated first, followed by the MAX variables  ordering: A, B, C, D, E is legal  ordering: A, C, D, E, B is illegal Induced width corresponding to a legal elimination order is called constrained induced width cw*  Typically it may be far larger than the unconstrained induced width, ie cw* ≥ w* When interleaving MAX and SUM (using unconstrained orderings) the result is an Upper Bound on the MAP value  Can be used as a guiding heuristic function for search

147 AND node: Combination operator (product) OR node: MAX for hypothesis, SUM otherwise 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.162.0936.0972.162.0936 Result: MAP(D=1,E=0) MAX( 0.162*0.6, 0.0936*0.4 )= 0.0972

148 Pseudo tree must be consistent with the constrained elimination order Graph search via context-based caching Time and space complexity  Tree search: Space linear, time O(exp(cw*log n))  Graph search: Time and space O(exp(cw*))

149 Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks  Exact inference  Approximate inference

150 Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling

151 Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination

152 A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) MPE = ? max A,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = max A P(A) max E=0 max D max C P(C|A) max B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination Given a belief network and some evidence

153 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0

154 Computation in a bucket is time and space exponential in the number of variables involved (i.e., width) Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables The idea is similar to i-consistency: bound the size of recorded dependencies (Dechter 2003)

155 Split a bucket into mini-buckets => bound complexity

156 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on MPE value λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets max∏

157 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) λE(A)λE(A) λ B (A,D) λ D (A,E) a’ = argmax P(A) ∙ λ E (A) e’ = 0 d’ = argmax λ C (a’,D,e’) ∙ ∙ λ C (a’,D) c’ = argmax P(C|a’) ∙ ∙ λ C (C,e’) b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) ∙ P(B|a’) Return (a’, b’, c’, d’, e’) A Lower Bound can also be computed as the probability of the sub-optimal assignment P(a’, b’, c’, d’, e’)

158 Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on P(evidence) λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets ∑∏

159 If we process all mini-buckets by summation then we get an unnecessarily large upper bound on the probability of evidence Tighter upper bound  Process first mini-bucket by summation and remaining ones by maximization We can also get a lower bound on P(evidence)  Process first mini-bucket by summation and remaining ones by minimization

160 Controlling parameter i (called i-bound)  Maximum number of distinct variables in a mini-bucket  Outputs both a lower and an upper bound Complexity: O(exp(i)) time and space As i-bound increases, both accuracy and time complexity increase  Clearly, if i = w*, then we have pure BE Possible use of mini-bucket approximations  As anytime algorithms (Dechter & Rish, 1997)  As heuristic functions for depth-first and best-first search (Kask & Dechter, 2001), (Marinescu & Dechter, 2005)

161 Static Mini-Buckets  Pre-compiled  Reduced overhead  Less accurate  Static variable ordering Dynamic Mini-Buckets  Computed dynamically  Higher overhead  High accuracy  Dynamic variable ordering

162 OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)

163 A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(A,D) f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (A,B,C) h C (A,B) AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h*(a, b, c) = h D (a, b, c) * h E (b, c) (Dechter99)

164 A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (B,C) h C (B) h D (A) f(A,D) D mini-buckets AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (a) * h D (b, c) * h E (b, c) ≥ h*(a, b, c) MBE(3)

165 A f(a,b) B f(b,C) C f(b,F) F f(a,G) f(F,G) G f(b,E) f(C,E) E f(a,D) f(b,D) f(C,D) D h G (F) h F () h B () h E (C)h D (C) h C () AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (c) * h E (c) = h*(a, b, c) MBE(3)

166 s1196 ISCAS’89 circuit.

167 Mini-Bucket Elimination  Mini-clustering (tree decompositions) Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling  Particle filtering

168 Correctness and completeness:  Algorithm CTE is correct, i.e. it computes the exact posterior joint probability of all single variables (or subsets) and the evidence. Time complexity: O ( deg  (n+N)  d w*+1 ) Space complexity: O ( N  d sep ) wheredeg = the maximum degree of a node n = number of variables (= number of CPTs) N = number of nodes in the tree decomposition d = the maximum domain size of a variable w* = the induced width sep = the separator size

169 A B C p(a), p(b|a), p(c|a,b) B C D F p(d|b), p(f|c,d) h (1,2) (b,c) B E F p(e|b,f), h (2,3) (b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF sep(2,3)={B,F} elim(2,3)={C,D} G E F C D B A

170 Motivation:  Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem  When the induced width w* is big, CTE algorithm becomes infeasible The basic idea:  Try to reduce the size of the cluster (the exponent); partition each cluster into mini-clusters with less variables  Accuracy parameter i = maximum number of variables in a mini-cluster  The idea was explored for variable elimination (MBE)

171 Split a cluster into mini-clusters => bound complexity

172 A B C p(a), p(b|a), p(c|a,b) B E F p(e|b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF Cluster Tree Elimination Mini-Clustering, i=3 G E F C D B A B C D F p(d|b), p(f|c,d) 2 B C D F p(d|b), h (1,2) (b,c), p(f|c,d) sep(2,3)= {B,F} elim(2,3) = {C,D} C D F B C D C D F p(f|c,d) p(d|b), h (1,2) (b,c) p(f|c,d)

173 EF BF BC ABC 2 4 1 3 BEF EFG BCDF

174 Correctness and completeness:  Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(X i,e) of each variable and each of its values. Time & space complexity: O(exp(i))

175 Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling  Particle filtering

176 Belief propagation is exact for poly-trees (Pearl, 1988) IBP - applying BP iteratively to cyclic networks No guarantees for convergence Works well for many coding networks

177 A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI The graph IBP works on (dual graph) A D I B E J F G C H Belief network P(A) P(B|A,C) P(C) P(D|A,B,E)P(E|B,C) P(F|C,D,E) P(G|H,F) P(H) P(I|F,G)P(J|H,G,I)

178 IBP is applied to a loopy network iteratively  not an anytime algorithm  when it converges, it converges very fast MC applies bounded inference along a tree decomposition  MC is an anytime algorithm controlled by i-bound  MC converges in two passes up and down the tree IJGP combines:  the iterative feature of IBP  the anytime feature of MC

179  Apply Cluster Tree Elimination to any join-graph  We commit to graphs that are minimal I-maps  Avoid cycles as long as I-mapness is not violated  Result: use minimal arc-labeled join-graphs

180 A D I B E J F G C H A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI Belief networkThe graph IBP works on (dual graph)

181 A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABABBCBC BEBE C C DEDECECE F H F FGFGGHGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABABBCBC C DEDECECE H F FGFGGHGH GI

182 A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGFGGHGH GIGI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGHGH GIGI

183 a) Minimal arc-labeled join graphb) Join-graph obtained by collapsing nodes of graph a) c) Minimal arc-labeled join graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC CDECECE F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC DECECE F FGH GI

184 ABCDE FGHIGHIJ CDEF CDE F GHI a) Minimal arc-labeled join graphb) Tree decomposition ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI

185 A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI ABCDE FGHIGHIJ CDEF CDE F GHI more accuracy less complexity

186 ABCDE FGI BCE GHIJ CDEF FGH BCBC CDE CECE F FGH GI ABCDE p(a), p(c), p(b|ac), p(d|abe),p(e|b,c) h(3,1)(bc) BCD CDEF BCBC CDE CECE 13 2 h (3,1) (bc) h (1,2) Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C} Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}

187 We want arc-labeled decompositions such that:  the cluster size (internal width) is bounded by i (the accuracy parameter)  the width of the decomposition as a graph (external width) is as small as possible – closer to a tree Possible approaches to build decompositions:  partition-based algorithms - inspired by the mini-bucket decomposition  grouping-based algorithms

188 G E F C D B A a) schematic mini-bucket(i), i=3 b) minimal arc-labeled join-graph decomposition CDB CAB BA A CB P(D|B) P(C|A,B) P(A) BA P(B|A) FCD P(F|C,D) GFE EBF BF EF P(E|B,F) P(G|F,E) B CD BF A F G: (GFE) E: (EBF) (EF) F: (FCD) (BF) D: (DB) (CD) C: (CAB) (CB) B: (BA) (AB) (B) A: (A)

189 IJGP(i) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i On join-trees IJGP finds exact beliefs! IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman and Weiss, 2001) Complexity of one iteration:  time: O(deg(n+N) d i+1 )  space: O(Nd  )

190 evidence=0 evidence=5

191 evidence=0evidence=5

192

193 IJGP borrows the iterative feature from IBP and the anytime virtues of bounded inference from MC Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks IJGP is almost always superior, often by a high margin, to IBP and MC Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating #CSP: can use IJGP to generate solution counts estimates for depth-first Branch-and-Bound search

194 Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling

195 Structural Approximations  Eliminate some dependencies Remove edges Mini-Bucket and Mini-Clustering approaches Local Search  Approach for optimization tasks: MPE, MAP Favorite MAX-CSP/WCSP/WSAT local search solver! Sampling  Generate random samples and compute values of interest from samples, not original network

196 Input: Bayesian network with set of nodes X Sample = a tuple with assigned values s=(X 1 =x 1,X 2 =x 2,…,X k =x k ) Tuple may include all variables (except evidence) or a subset Sampling schemas dictate how to generate samples (tuples) Ideally, samples are distributed according to P(X|E)

197 Given a set of variables X = {X 1, X 2, … X n } that represent joint probability distribution  (X) and some function g(X), we can compute expected value of g(X) :

198 Given independent, identically distributed samples (iid) S 1, S 2, …S T from  (X), it follows from Strong Law of Large Numbers: A sample S t is an instantiation:

199 Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Generate k=10 samples: 0,1,1,1,0,1,1,0,1,0 Approximate P’(X):

200 Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Sample X  P (X)  draw random number r  [0, 1]  If (r < 0.3) then set X=0  Else set X=1 Can generalize for any domain size

201 Same idea: generate a set of samples T Estimate posterior marginal P(X i |E) from samples Challenge: X is a vector and P(X) is a huge distribution represented by BN Need to know:  How to generate a new sample ?  How many samples T do we need ?  How to estimate P(E=e) and P(X i |e) ?

202 Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

203 Forward Sampling  Case with No evidence E={}  Case with Evidence E=e

204 Input: Bayesian network X= {X 1,…,X N }, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1.For t = 1 to T 2. For i = 1 to N 3. X i  sample x i t from P(x i | pa i )

205 What does it mean to sample x i t from P(X i | pa i ) ? Assume D(X i )={0,1} Assume P(X i | pa i ) = (0.3, 0.7) Draw a random number r from [0,1] If r falls in [0,0.3], set X i = 0 If r falls in [0.3,1], set X i = 1 010.3 r

206 X1X1 X4X4 X2X2 X3X3

207 Task: given T samples {S 1,S 2,…,S n } estimate P(X i = x i ) : Basically, count the proportion of samples where X i = x i

208 Input: Bayesian network X= {X 1,…,X N }, N- #nodes E – evidence, T - # samples Output: T samples consistent with E 1.For t=1 to T 2. For i=1 to N 3. X i  sample x i t from P(x i | pa i ) 4. If X i in E and X i  x i, reject sample: 5. i = 1 and go to step 2

209 X1X1 X4X4 X2X2 X3X3

210 Let Y be a subset of evidence nodes s.t. Y=u

211 Theorem: Let  s (y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most  with probability at least 1-  it is enough to have: Derived from Chebychev’s Bound.

212 Advantages: P(x i | pa(x i )) is readily available Samples are independent ! Drawbacks: If evidence E is rare (P(e) is low), then we will reject most of the samples! Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves! If P(e) is small, T will become very big!

213 Forward Sampling  High Rejection Rate Fix evidence values  Gibbs sampling (MCMC)  Likelihood Weighting  Importance Sampling

214 Forward Sampling  High rejection rate  Samples are independent Fix evidence values  Gibbs sampling (MCMC)  Likelihood Weighting  Importance Sampling

215 Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling

216 Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|e) Guaranteed to converge when all P > 0 Methods to improve convergence:  Blocking  Rao-Blackwellised

217 A sample t  [1,2,…], is an instantiation of all variables in the network: Sampling process  Fix values of observed variables e  Instantiate node values in sample x 0 at random  Generate samples x 1,x 2,…x T from P(X|e)  Compute posteriors from samples

218 Generate sample x t+1 from x t : In short, for i=1 to N: Process All variables In Some Order

219 Markov blanket :

220 Input: X, E Output: T samples {x t } Fix evidence E Generate samples from P(X | E) 1.For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. X i  sample x i t from P(X i | markov t \ X i )

221 Query: P(x i |e) = ? Method 1: count #of samples where X i =x i : Method 2: average probability (mixture estimator):

222 X = {X 1,X 2,…,X 9 } E = {X 9 } X1 X4 X8X5 X2 X3 X9 X7 X6

223 X 1 = x 1 0 X 6 = x 6 0 X 2 = x 2 0 X 7 = x 7 0 X 3 = x 3 0 X 8 = x 8 0 X 4 = x 4 0 X 5 = x 5 0 X1 X4 X8X5 X2 X3 X9 X7 X6

224 X 1  P (X 1 |X 0 2,…,X 0 8,X 9 ) E = {X 9 } P (X 1 =0 |X 0 2,X 0 3,X 9 } = αP(X 1 =0)P(X 0 2 |X 1 =0)P(X 3 0 |X 1 =0) P (X 1 =1 |X 0 2,X 0 3,X 9 } = αP(X 1 =1)P(X 0 2 |X 1 =1)P(X 3 0 |X 1 =1) X1 X4 X8X5 X2 X3 X9 X7 X6

225 X 2  P(X 2 |X 1 1,…,X 0 8,X 9 } E = {X 9 } Markov blanket for X 2 is: {X 2, X 1, X 4, X 5, X 3 } X1 X4 X8X5 X2 X3 X9 X7 X6

226

227 We want to sample from P(X | E) But … starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample values from approximate P(x|e)  For example, run IBP first

228 Converge to stationary distribution  * :  * =  * P where P is a transition kernel p ij = P(X i  X j ) Guaranteed to converge iff chain is :  irreducible  aperiodic  ergodic (  i,j p ij > 0)

229 Advantage :  guaranteed to converge to P(X|E), as long as P i > 0 Disadvantage :  convergence may be slow Problems:  Samples are dependent !  Statistical variance is too big in high-dimensional problems

230 Objectives: 1.Reduce dependence between samples (autocorrelation)  Skip samples  Randomize Variable Sampling Order 2.Reduce variance  Blocking Gibbs Sampling  Rao-Blackwellisation

231 Pick only every k-th sample (Gayer, 1992)  Can reduce dependence between samples!  Increases variance ! Waists samples !

232 Random Scan Gibbs Sampler  Pick each next variable X i for update at random with probability p i,  i p i = 1. In the simplest case, p i are distributed uniformly.  In some instances, reduces variance (MacEachern, Peruggia, 1999)

233 Sample several variables together, as a block Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (x t,y t,z t ), compute next sample: X t+1  P(y t,z t )=P(w t ) (y t+1,z t+1 )=W t+1  P(x t+1 ) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

234 Do not sample all variables! Sample a subset! Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (x t,y t ), compute next sample: x t+1  P(y t ) y t+1  P(x t+1 )

235 Bottom line: reducing number of variables in a sample reduce variance!

236 Standard Gibbs: P(x|y,z),P(y|x,z),P(z|x,y)(1) Blocking: P(x|y,z), P(y,z|x)(2) Rao-Blackwellised: P(x|y), P(y|x)(3) Var3 < Var2 < Var1 ( Liu, Wong, Kong, 1994 ) XY Z

237 Select C  X (possibly cycle-cutset), |C| = m Fix evidence E Initialize nodes with random values: For i=1 to m: c i to C i = c 0 i For t=1 to n, generate samples: For i=1 to m: C i =c i t+1  P(c i |c 1 t+1,…,c i-1 t+1,c i+1 t,…,c m t,e)

238 Generate sample c t+1 from c t :

239 How to choose C ?  Special case: C is cycle-cutset, O(N)  General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.  Pick C wisely so as to minimize w  notion of w- cutset

240 C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w Complexity of exact inference:  bounded by w ! Cycle-cutset is a special case!

241 Query:  c i  C, P(c i |e)=? same as Gibbs: Special case of w-cutset Query: P(x i |e) = ? computed while generating sample t compute after generating sample t (easy because C is a cut-set)

242 X1 X7 X5 X4 X2 X9 X8 X3 E=x 9 X6

243 X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 2 :

244 X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 5 :

245 X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 2 |e) for sampling node X 2 : Sample 1 Sample 2 Sample 3

246 X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 3 |e) for non-sampled node X 3 :

247 MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(X i )<=4, |E| = 35 Exact Time = 122 sec using Loop-Cutset Conditioning

248 MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(X i )=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination

249 Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling

250 “Clamping” evidence + Forward sampling + Weighting samples by evidence likelihood Works well for likely evidence!

251 eeeee Sample in topological order over X ! eeee x i  P(X i |pa i ) P(X i |pa i ) is a look-up in CPT!

252

253 Estimate Posterior Marginals: P(X i | e)

254 Converges to exact posterior marginals Generates samples fast Sampling distribution is close to prior (especially if E  Leaf Nodes) Increasing sampling variance  Convergence may be slow  Many samples with P(x (t) )=0 rejected

255 Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling

256 In general, it is hard to sample from target distribution P(X|E) Generate samples from sampling (proposal) distribution Q(X) Weigh each sample against P(X|E)

257

258 Given a distribution called the proposal distribution Q (such that P(Z=z,e)>0 => Q(Z=z)>0) w(Z=z) is called importance weight

259 Underlying principle, Approximate Average over a set of numbers by an average over a set of sampled numbers

260 Express the problem as computing the average over a set of real numbers Sample a subset of real numbers Approximate the true average by sample average.  True Average: Average of (0.11, 0.24, 0.55, 0.77, 0.88,0.99)=0.59  Sample Average over 2 samples: Average of (0.24, 0.77) = 0.505

261 Express Q in product form:  Q(Z)=Q(Z 1 )Q(Z 2 |Z 1 )….Q(Z n |Z 1,..Z n-1 ) Sample along the order Z 1,..Z n Example:  Q(Z 1 )=(0.2,0.8)  Q(Z 2 |Z 1 )=(0.2,0.8,0.1,0.9)  Q(Z 3 |Z 1,Z 2 )=Q(Z 3 |Z 1 )=(0.5,0.5,0.3,0.7)

262 Each Sample Z=z  Sample Z 1 =z 1 from Q(Z 1 )  Sample Z 2 =z 2 from Q(Z 2 |Z 1 =z1)  Sample Z 3 =z 3 from Q(Z 3 |Z1=z1) Generate N such samples

263 Q= Prior Distribution = CPTs of the Bayesian network

264 lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)

265 lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Q=Prior Q(S,C,D)=Q(S)*Q(C|S)*Q(D|C,B=0) =P(S)P(C|S)P(D|C,B=0) Sample S=s from P(S) Sample C=c from P(C|S=s) Sample D=d from P(D|C=c,B=0)

266

267


Download ppt "Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease."

Similar presentations


Ads by Google