Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~brigham brigham@cmu.edu Probabililty = relative area

2 OUTLINE Probability Axioms  Rules Probabilistic Models Probability Tables  Bayes Nets Machine Learning Inference Anomaly detection Classification

3 Overview The point of this section: 1.Introduce probability models. 2.Introduce Bayes Nets as powerful probability models.

4 Representing P(A,B,C) How can we represent the function P(A,B,C)? It ideally should contain all the information in this Venn diagram A B C 0.05 0.25 0.100.05 0.10 0.30

5 Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 Example: P(A, B, C) A B C 0.05 0.25 0.100.05 0.10 0.30 The Joint Probability Table

6 Using the Prob. Table One you have the PT you can ask for the probability of any logical expression involving your attribute

7 P(Poor, Male) = 0.4654 Using the Prob. Table

8 P(Poor) = 0.7604 Using the Prob. Table

9 Inference with the Prob. Table

10 Inference with the Prob. Table P(Male | Poor) = 0.4654 / 0.7604 = 0.612

11 Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? I see that there’s a long queue by my gate and the clouds are dark and I’m traveling by US Airways. What’s the chance I’ll be home tonight? (Answer: 0.00000001) There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

12 The Full Probability Table Problem: We usually don’t know P(A,B,C)! Solutions: 1.Construct it analytically 2.Learn it from data : P(A,B,C) ABC Prob 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 P(A,B,C)

13 Learning a Probability Table Build a PT for your attributes in which the probabilities are unspecified The fill in each row with ABCProb 000? 001? 010? 011? 100? 101? 110? 111? ABC 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 Fraction of all records in which A and B are True but C is False

14 Example of Learning a Prob. Table This PTable was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

15 Where are we? We have covered the fundamentals of probability We have become content with PTables And we even know how to learn PTables from data.

16 Probability Table as Anomaly Detector Our PTable is our first example of something called an Anomaly Detector. An anomaly detector tells you how likely a datapoint is given a model. Anomaly Detector P(x) Data point x

17 Given a record x, a model M can tell you how likely the record is: Given a dataset with R records, a model can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the model’s probability function) Evaluating a Model

18 A small dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 Training Set Records

19 A small dataset: Miles Per Gallon 192 Training Set Records

21 Log Probabilities Since probabilities of datasets get so small we usually use log probabilities

23 Summary: The Good News Prob. Tables allow us to learn P(X) from data. Can do anomaly detection spot suspicious / incorrect records Can do inference: P(E 1 |E 2 ) Automatic Doctor, Help Desk, etc Can do Bayesian classification Coming soon…

24 Summary: The Bad News Learning a Prob. Table for P(X) is for Cowboy data miners Those probability tables get big fast! With so many parameters relative to our data, the resulting model could be pretty wild.

25 Using a test set An independent test set with 196 cars has a worse log likelihood (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!

26 Overfitting Prob. Tables If this ever happens, it means there are certain combinations that we learn are impossible

27 Using a test set The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 10 20 We need P() estimators that are less prone to overfitting

28 Naïve Density Estimation The problem with the Probability Table is that it just mirrors the training data. In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from our Table. We need something which generalizes more usefully. The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

29 Probability Models Full Prob. Table Naïve Density No assumptions Overfitting-prone Scales horribly Strong assumptions Overfitting-resistant Scales incredibly well

30 Independently Distributed Data Let x[i] denote the i’th field of record x. The independently distributed assumption says that for any i,v, u 1 u 2 … u i-1 u i+1 … u M Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]} This is often written as

31 A note about independence Assume A and B are Boolean Random Variables. Then “A and B are independent” if and only if P(A|B) = P(A) “A and B are independent” is often notated as

32 Independence Theorems Assume P(A|B) = P(A) Then P(A, B) = = P(A) P(B) Assume P(A|B) = P(A) Then P(B|A) = = P(B)

33 Independence Theorems Assume P(A|B) = P(A) Then P(~A|B) = = P(~A) Assume P(A|B) = P(A) Then P(A|~B) = = P(A)

34 Multivalued Independence For Random Variables A and B, if and only if from which you can then prove things like…

35 Back to Naïve Density Estimation Let x[i] denote the i’th field of record x: Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]} Example: Suppose that each record is generated by randomly shaking a green dice and a red dice Dataset 1: A = red value, B = green value Dataset 2: A = red value, B = sum of values Dataset 3: A = sum of values, B = difference of values Which of these datasets violates the naïve assumption?

36 Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)? = P(A) P(~B) P(C) P(~D)

37 Naïve Distribution General Case Suppose x[1], x[2], … x[M] are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. So we can do any inference* But how do we learn a Naïve Density Estimator?

38 Learning a Naïve Density Estimator Another trivial learning algorithm!

39 Contrast Joint DENaïve DE Can model anythingCan model only very boring distributions No problem to model “C is a noisy copy of A” Outside Naïve’s scope Given 100 records and more than 6 boolean attributes will screw up badly Given 100 records and 10,000 multivalued attributes will be fine

40 Empirical Results: “Hopeless” The “hopeless” dataset consists of 40,000 records and 21 boolean attributes called a,b,c, … u. Each attribute in each record is generated 50-50 randomly as 0 or 1. Despite the vast amount of data, “Joint” overfits hopelessly and does much worse Average test set log probability during 10 folds of k-fold cross-validation* Described in a future Andrew lecture

41 Empirical Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The DE learned by “Joint” The DE learned by “Naive”

42 Empirical Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The DE learned by “Joint” The DE learned by “Naive”

43 A tiny part of the DE learned by “Joint” Empirical Results: “MPG” The “MPG” dataset consists of 392 records and 8 attributes The DE learned by “Naive”

44 The DE learned by “Joint” Empirical Results: “Weight vs MPG” Suppose we train only from the “Weight” and “MPG” attributes The DE learned by “Naive”

45 The DE learned by “Joint” Empirical Results: “Weight vs MPG” Suppose we train only from the “Weight” and “MPG” attributes The DE learned by “Naive”

46 The DE learned by “Joint” “Weight vs MPG”: The best that Naïve can do The DE learned by “Naive”

47 Reminder: The Good News We have two ways to learn a Density Estimator from data. There are vastly more impressive* Density Estimators * (Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and many more) Density estimators can do many good things… Anomaly detection Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc Ingredient for Bayes Classifiers

48 Probability Models Full Prob. Table Naïve Density No assumptions Overfitting-prone Scales horribly Strong assumptions Overfitting-resistant Scales incredibly well Bayes Nets Carefully chosen assumptions Overfitting and scaling properties depend on assumptions

49 Bayes Nets

50 Bayes Nets What are they? Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty What are they used for? Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining Where did they come from? Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities Why the sudden interest? Development of propagation algorithms followed by availability of easy to use commercial software Growing number of creative applications How are they different from other knowledge representation and probabilistic analysis tools? Different from other knowledge-based systems tools because uncertainty is handled in mathematically rigorous yet efficient and simple way

51 Bayes Net Concepts 1.Chain Rule P(A,B) = P(A) P(B|A) 2.Conditional Independence P(B|A) = P(B)

52 A Simple Bayes Net Let’s assume that we already have P(Mpg,Horse) How would you rewrite this using the Chain rule? 0.480.12 bad 0.040.36 good highlow P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) =

53 Review: Chain Rule 0.480.12 bad 0.040.36 good highlow P(Mpg, Horse) P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg) P(Horse|Mpg) *

54 Review: Chain Rule 0.480.12 bad 0.040.36 good highlow P(Mpg, Horse) P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg) P(Horse|Mpg) * = P(good) * P(low|good) = 0.4 * 0.89 = P(good) * P(high|good) = 0.4 * 0.11 = P(bad) * P(low|bad) = 0.6 * 0.21 = P(bad) * P(high|bad) = 0.6 * 0.79

55 How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg Horse

58 How to Make a Bayes Net So, what have we accomplished thus far? Nothing; we’ve just “Bayes Net-ified” P(Mpg, Horse) using the Chain rule. …the real excitement starts when we wield conditional independence Mpg Horse P(Mpg) P(Horse|Mpg)

59 How to Make a Bayes Net Before we apply conditional independence, we need a worthier opponent than puny P(Mpg, Horse)… We’ll use P(Mpg, Horse, Accel)

60 How to Make a Bayes Net Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Note: Obviously, we could have written this 3!=6 different ways… P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = …

61 How to Make a Bayes Net Mpg Horse Accel Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

66 Independence “The Acceleration does not depend on the Mpg once I know the Horsepower.” This can be specified very simply: P(Accel  Mpg, Horse) = P(Accel | Horse) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

67 Bayes Net-Building Tools 1.Chain Rule P(A,B) = P(A) P(B|A) 2.Conditional Independence P(B|A) = P(B) This is Ridiculously Useful in general

68 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

69 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

70 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

71 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S

72 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes

73 Independence We’ve stated: P(M) = 0.6 P(S) = 0.3 P(S  M) = P(S) M SProb TT TF FT FF And since we now have the joint pdf, we can make any queries we like. From these statements, we can derive the full joint pdf.

74 The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values )

75 The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E 2 mentions k variables. How much work is the above computation?

76 The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables.

77 The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right?

78 The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.

79 Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net. 8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values. Apparently, the experts found it quite easy to invent the causal links and probabilities. Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

80 What you should know The meanings and importance of independence and conditional independence. The definition of a Bayes net. Computing probabilities of assignments of variables (i.e. members of the joint p.d.f.) with a Bayes net. The slow (exponential) method for computing arbitrary, conditional probabilities. The stochastic simulation method and likelihood weighting.

82 Chain Rule 0.6 0.4 good bad 0.790.11 high 0.210.89 low badgood & P(Mpg)P(Horse | Mpg) 0.920.28 bad 0.080.72 good highlow 0.52 0.48 low high & P(Horse)P(Mpg | Horse) 0.480.12 bad 0.040.36 good highlow P(Mpg, Horse) P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

83 Chain Rule & P(Mpg) P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) P(Mpg, Horse) = P(Horse) * P(Mpg | Horse) 0.48 high 0.12 lowbad 0.04 high 0.36 lowgood Horse Mpg good0.4 bad0.6 0.79 high 0.21 lowbad 0.11 high 0.89 lowgood Horse Mpg P(Horse | Mpg) & P(Horse) low0.48 high0.52 0.92 bad 0.08 goodhigh 0.28 bad 0.72 goodlow Mpg Horse P(Mpg | Horse)

84 Making a Bayes Net P(Horse) P(Mpg | Horse) 1.Start with a Joint 2.Rewrite joint with Chain Rule 3.Each component becomes a node (with arcs) P(Horse, Mpg)

85 P(Mpg, Horsepower) Mpg Horsepower P(Mpg) P(Horsepower | Mpg) 0.790.11 high 0.210.89 low badgood 0.6 0.4 good bad We can rewrite P(Mpg, Horsepower) as a Bayes net! 1.Decompose with the Chain rule. 2.Make each element of decomposition into a node

86 P(Mpg, Horsepower) Horsepower Mpg P(Horse) P(Mpg | Horse) 0.92 bad 0.08 goodhigh 0.28 bad 0.72 goodlow Mpg Horse 0.52high 0.48low Horse

87 h f h~h f0.01 ~f0.090.89 0.01 0.89 0.09 The Joint Distribution

88 h f The Joint Distribution a ~hh ~f f a ~a

89 Joint Distribution 0.480.12 bad 0.040.36 good highlow P(Mpg, Horse) Venn Diagrams Visual Good for tutorials Probability Tables Efficient Good for computation good 0.36 high 0.48 0.12 0.04 Up to this point, we have represented a probability model (or joint distribution) in two ways:

90 Where do Joint Distributions come from? Idea One: Expert Humans with lots of free time Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(A) = 0.7 P(B|A) = 0.2 P(B|~A) = 0.1 P(C|A,B) = 0.1 P(C|A,~B) = 0.8 P(C|~A,B) = 0.3 P(C|~A,~B) = 0.1 Then you can automatically compute the PT using the chain rule P(x, y, z) = P(x) P(y|x) P(z|x,y) Later: Bayes Nets will do this systematically

91 Where do Joint Distributions come from? Idea Three: Learn them from data! Prepare to be impressed…

92 A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.

93 A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. We already know the Joint of S and M, so all we need now is P(L  S=u, M=v) in the 4 cases of u/v = True/False.

94 A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven* *Savings are larger for larger numbers of variables.

95 A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Question: Express P(L=x ^ M=y ^ S=z) in terms that only need the above expressions, where x,y and z may each be True or False.

96 A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

97 A bit of notation P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 SM L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Read the absence of an arrow between S and M to mean “it would not help me predict M if I knew the value of S” Read the two arrows into L to mean that if I want to know the value of L it may help me to know M and to know S. This kind of stuff will be thoroughly formalized later

98 An even cuter trick Suppose we have these three events: M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots Suppose: Andrew has a higher chance of being late than Manuela. Andrew has a higher chance of giving robotics lectures. What kind of independence can we find? How about: P(L  M) = P(L) ? P(R  M) = P(R) ? P(L  R) = P(L) ?

99 Conditional independence Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots. P(R  M,L) = P(R  M) and P(R  ~M,L) = P(R  ~M) We express this in the following way: “R and L are conditionally independent given M” M L R Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc...which is also notated by the following diagram.

100 Conditional Independence formalized R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

101 Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” means forall s,g,h,w,a P(ShoeSize=s|Height=h,Weight=w,Age=a) = P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

102 Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “Shoe-size is conditionally independent of Glove-size given height weight and age” does not mean forall s,g,h P(ShoeSize=s|Height=h) = P(ShoeSize=s|Height=h, GloveSize=g)

103 Conditional independence M L R We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(L  M) and P(L  ~M) and know we’ve fully specified L’s behavior. Ditto for R. P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 ‘R and L conditionally independent given M’

104 Conditional independence M L R P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 Conditional Independence: P(R  M,L) = P(R  M), P(R  ~M,L) = P(R  ~M) Again, we can obtain any member of the Joint prob dist that we desire: P(L=x ^ R=y ^ M=z) =

105 Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent

106 Making a Bayes net S M R L T Step One: add variables. Just choose the variables you’d like to be included in the net. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny

107 Making a Bayes net S M R L T Step Two: add links. The link structure must be acyclic. If node X is given parents Q 1,Q 2,..Q n you are promising that any variable that’s a non-descendent of X is conditionally independent of X given {Q 1,Q 2,..Q n } T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny

108 Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step Three: add a probability table for each node. The table for node X must list P(X|Parent Values) for each possible combination of parent values T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny

109 Making a Bayes net S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Two unconnected variables may still be correlated Each node is conditionally independent of all non- descendants in the tree, given its parents. You can deduce many other conditional independence relations from a Bayes net. See the next lecture. T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny

110 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

111 Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

112 Example Bayes Net Building Suppose we’re building a nuclear power station. There are the following random variables: GRL : Gauge Reads Low. CTL : Core temperature is low. FG : Gauge is faulty. FA : Alarm is faulty AS : Alarm sounds If alarm working properly, the alarm is meant to sound if the gauge stops reading a low temp. If gauge working properly, the gauge is meant to read the temp of the core.

113 Computing a Joint Entry How to compute an entry in a joint distribution? E.G: What is P(S ^ ~M ^ L ~R ^ T)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

114 Computing with Bayes Net P(T ^ ~R ^ L ^ ~M ^ S) = P(T  ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M | S)*P(S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M)*P(S). S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

115 The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

116 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2

117 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

118 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S

119 Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) ------------------------------------- P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match ~R ^ T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback