CS 484 – Artificial Intelligence1 Announcements Homework 8 due today, November 13 ½ to 1 page description of final project due Thursday, November 15 Current Events Christian - now Jeff - Thursday Research Paper due Tuesday, November 20
Probabilistic Reasoning Lecture 15
CS 484 – Artificial Intelligence3 Probabilistic Reasoning Logic deals with certainties A → B Probabilities are expressed in a notation similar to that of predicates in First Order Predicate Calculus: P(R) = 0.7 P(S) = 0.1 P(¬(A Λ B) V C) = = certain; 0 = certainly not
CS 484 – Artificial Intelligence4 What's the probability that either A is true or B is true? A Λ B A B P(A V B) = Venn Diagram
CS 484 – Artificial Intelligence5 Conditional Probability Conditional probability refers to the probability of one thing given that we already know another to be true: This states the probability of B, given A. A Λ B A B
CS 484 – Artificial Intelligence6 Calculate P(R|S) given that the probability of rain is 0.7, the probability of sun is 0.1 and the probability of rain and sun is 0.01 P(R|S) = Note: P(A|B) ≠ P(B|A)
CS 484 – Artificial Intelligence7 Joint Probability Distributions A joint probability distribution represents the combined probabilities of two or more variables. This table shows, for example, that P (A Λ B) = 0.11 P (¬A Λ B) = 0.09 Using this, we can calculate P(A): P(A) = P(A Λ B) + P(A Λ ¬B) = = 0.74 A⌐A B ⌐B A Λ B AB
CS 484 – Artificial Intelligence8 Bayes’ Theorem Bayes’ theorem lets us calculate a conditional probability: P(B) is the prior probability of B. P(B | A) is the posterior probability of B.
CS 484 – Artificial Intelligence9 Bayes' Theorem Deduction Recall:
CS 484 – Artificial Intelligence10 Medical Diagnosis Data 80% of the time you have a cold, you also have a high temperature. At any one time, 1 in every 10,000 people has a cold 1 in every 1000 people has a high temperature Suppose you have a high temperature. What is the likelihood that you have a cold?
CS 484 – Artificial Intelligence11 Witness Reliability A hit-and-run incident has been reported, and an eye witness has stated she is certain that the car was a white taxi. How likely is she right? Facts: Yellow taxi company has 90 cars White taxi company has 10 cars Expert says that given the foggy weather, the witness has 75% chance of correctly identifying the taxi
CS 484 – Artificial Intelligence12 Witness Reliability – Prior Probability Imagine lady shown a sequence of 1000 cars Expect 900 to be yellow and 100 to be white Given 75% accuracy, how many will she say are white and yellow Of 900 yellow cars, says yellow and says white Of 100 yellow cars, says yellow and says white What is the probability women says white? How likely is she right?
CS 484 – Artificial Intelligence13 Comparing Conditional Probabilities Medical diagnosis Probability of cold (C) is P(HT|C) = 0.8 Probability of plague (P) is P(HT|P) = 0.99 Relative likelihood of cold and plague
CS 484 – Artificial Intelligence14 Simple Bayesian Concept Learning (1) P (H|E) is used to represent the probability that some hypothesis, H, is true, given evidence E. Let us suppose we have a set of hypotheses H 1 … H n. For each H i Hence, given a piece of evidence, a learner can determine which is the most likely explanation by finding the hypothesis that has the highest posterior probability.
CS 484 – Artificial Intelligence15 Simple Bayesian Concept Learning (2) In fact, this can be simplified. Since P(E) is independent of H i it will have the same value for each hypothesis. Hence, it can be ignored, and we can find the hypothesis with the highest value of: We can simplify this further if all the hypotheses are equally likely, in which case we simply seek the hypothesis with the highest value of P(E|H i ). This is the likelihood of E given H i.
CS 484 – Artificial Intelligence16 Bayesian Belief Networks (1) A belief network shows the dependencies between a group of variables. If two variables A and B are independent if the likelihood that A will occur has nothing to do with whether B occurs. C and D are dependent on A; D and E are dependent on B. The Bayesian belief network has probabilities associated with each link. E.g., P(C|A) = 0.2, P(C|¬A) = 0.4
CS 484 – Artificial Intelligence17 Bayesian Belief Networks (2) A complete set of probabilities for this belief network might be: P(A) = 0.1 P(B) = 0.7 P(C|A) = 0.2 P(C|¬A) = 0.4 P(D|A Λ B) = 0.5 P(D|A Λ ¬B) = 0.4 P(D|¬A Λ B) = 0.2 P(D|¬A Λ ¬B) = P(E|B) = 0.2 P(E|¬B) = 0.1
CS 484 – Artificial Intelligence18 Bayesian Belief Networks (3) We can now calculate conditional probabilities: In fact, we can simplify this, since there are no dependencies between certain pairs of variables – between E and A, for example. Hence:
CS 484 – Artificial Intelligence19 College Life Example C = that you will go to college S = that you will study P = that you will party E = that you will be successful in your exams F = that you will have fun C S P E F
CS 484 – Artificial Intelligence20 College Life Example C S P E F P(C) 0.2 CP(S) true0.8 false0.2 CP(P) true0.6 false0.5 SPP(E) true 0.6 truefalse0.9 falsetrue0.1 false 0.2 PP(F) true0.9 false0.7
CS 484 – Artificial Intelligence21 College Example Using the tables to solve problems such as P(C==true, S = true, P = false, E = true, F = false) == P(C,S, ¬P,E, ¬F) General solution
CS 484 – Artificial Intelligence22 Noisy-V Function Want to assume know all reasons for a possible event E.g. Medical Diagnosis System P(HT|C) = 0.8 P(HT|P) = 0.99 Assume P(HT|C V P) = 1 (?) Assumption clearly not true Leak node – represents all other causes P(HT|O) = 0.9 Define noise parameters – conditional probabilities for ¬HT P(¬ HT|C) = 1 – P(HT|C) = 0.2 P(¬ HT|P) = P(¬ HT|O) = Further assumption – the causes of a high temperature are independent of each other and the noisy parameters are independent
CS 484 – Artificial Intelligence23 Noisy V-Function Benefit of Noisy V-Function If cold, plague, and other is all false, P(¬HT) = 1 Otherwise, P(¬HT) is equal to product of the noise parameters for all the variables that are true E.g. If plague and other is true and cold is false, P(HT) = 1 – (0.01 * 0.1) = Benefit – don’t need to store as many values as the Bayesian belief network
CS 484 – Artificial Intelligence24 Bayes’ Optimal Classifier A system that uses Bayes’ theory to classify data. We have a piece of data y, and are seeking the correct hypothesis from H 1 … H 5, each of which assigns a classification to y. The probability that y should be classified as c j is: x 1 to x n are the training data, and m is the number of hypotheses. This method provides the best possible classification for a piece of data. Example: Given some date will classify it as true or false P(true|x 1,…,x n ) = P(false|x 1,…,x n ) = P(H 1 | x 1,…,x n ) = 0.2P(false|H 1 ) = 0P(true|H 1 ) = 1 P(H 2 | x 1,…,x n ) = 0.3P(false|H 2 ) = 0P(true|H 2 ) = 1 P(H 3 | x 1,…,x n ) = 0.1P(false|H 3 ) = 1P(true|H 3 ) = 0 P(H 4 | x 1,…,x n ) = 0.25P(false|H 4 ) = 0P(true|H 4 ) = 1 P(H 5 | x 1,…,x n ) = 0.15P(false|H 5 ) = 1P(true|H 5 ) = 0
CS 484 – Artificial Intelligence25 The Naïve Bayes Classifier (1) A vector of data is classified as a single classification. p(c i | d 1, …, d n ) The classification with the highest posterior probability is chosen. The hypothesis which has the highest posterior probability is the maximum a posteriori, or MAP hypothesis. In this case, we are looking for the MAP classification. Bayes’ theorem is used to find the posterior probability:
CS 484 – Artificial Intelligence26 The Naïve Bayes Classifier (2) Since P(d 1, …, d n ) is a constant, independent of c i, we can eliminate it, and simply aim to find the classification c i, for which the following is maximised: We now assume that all the attributes d 1, …, d n are independent So P(d 1, …, d n |c i ) can be rewritten as: The classification for which this is highest is chosen to classify the data.
CS 484 – Artificial Intelligence27 Classifier Example xyzClassification 232A 414B 132A 243A 424B 213C 124A 233B 224A 333C 321A 121B 214A 434C 224A New piece of data to classify (x = 2, y = 3, z =4) Want P(c i |x=2,y=3,z=4) P(A) * P(x=2|A) * P(y=3|A) * P(z=4|A) P(B) * P(x=2|B) * P(y=3|B) * P(z=4|B) Training Data
CS 484 – Artificial Intelligence28 M-estimate Problem with too little training data (x=1, y=2, z=2) P(x=1 | B) = 1/4 P(y=2 | B) = 2/4 P(z=2 | B) = 0 Avoid problem by using M-estimate which pads the computation with additional samples Conditional probability = (a + mp) / (b + m) m = 5 (equivalent sample size) p = 1/num_values_for_category (1/4 for x) a = training example with category value and classification (x=1 and B is 1) b = training examples with classification (B is 4)
CS 484 – Artificial Intelligence29 Collaborative Filtering A method that uses Bayesian reasoning to suggest items that a person might be interested in, based on their known interests. If we know that Anne and Bob both like A, B and C, and that Anne likes D then we guess that Bob would also like D. P(Bob likes Z | Bob likes A, Bob likes B, …, Bob likes Y) Can be calculated using decision trees: B