Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Slides:



Advertisements
Similar presentations
Probability. Uncertainty Let action A t = leave for airport t minutes before flight –Will A t get me there on time? Problems: Partial observability (road.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian analysis with a discrete distribution Source: Stattrek.com.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
More probability CS151 David Kauchak Fall 2010 Some material borrowed from: Sara Owsley Sood and others.
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Reasoning under Uncertainty: Conditional Prob., Bayes and Independence Computer Science cpsc322, Lecture 25 (Textbook Chpt ) March, 17, 2010.
Marginal Independence and Conditional Independence Computer Science cpsc322, Lecture 26 (Textbook Chpt 6.1-2) March, 19, 2010.
Review: Bayesian learning and inference
Probabilistic inference
CPSC 422 Review Of Probability Theory.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Probability.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
Bayesian Belief Networks
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
Naïve Bayes Model. Outline Independence and Conditional Independence Naïve Bayes Model Application: Spam Detection.
Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Review: Probability Random variables, events Axioms of probability
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
More probability CS311 David Kauchak Spring 2013 Some material borrowed from: Sara Owsley Sood and others.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Naive Bayes Classifier
1 Chapter 13 Uncertainty. 2 Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Probability and naïve Bayes Classifier Louis Oliphant cs540 section 2 Fall 2005.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Bayes Nets & HMMs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Outline [AIMA Ch 13] 1 Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
HL2 Math - Santowski Lesson 93 – Bayes’ Theorem. Bayes’ Theorem  Main theorem: Suppose we know We would like to use this information to find if possible.
Matching ® ® ® Global Map Local Map … … … obstacle Where am I on the global map?                                   
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Computer Science cpsc322, Lecture 25
Bayesian inference, Naïve Bayes model
CS 188: Artificial Intelligence Spring 2007
Marginal Independence and Conditional Independence
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Uncertainty Chapter 13.
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
CS 4/527: Artificial Intelligence
Machine Learning. k-Nearest Neighbor Classifiers.
Uncertainty.
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2007
LECTURE 23: INFORMATION THEORY REVIEW
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Presentation transcript:

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity | Toothache) from causal probability P(Toothache | Cavity) –Can update our beliefs based on evidence –Key tool for probabilistic inference Rev. Thomas Bayes ( )

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes rule: Another example 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Independence Two events A and B are independent if and only if P(A  B) = P(A) P(B) –In other words, P(A | B) = P(A) and P(B | A) = P(B) –This is an important simplifying assumption for modeling, e.g., Toothache and Weather can be assumed to be independent Are two mutually exclusive events independent? –No, but for mutually exclusive events we have P(A  B) = P(A) + P(B) Conditional independence: A and B are conditionally independent given C iff P(A  B | C) = P(A | C) P(B | C)

Conditional independence: Example Toothache: boolean variable indicating whether the patient has a toothache Cavity: boolean variable indicating whether the patient has a cavity Catch: whether the dentist’s probe catches in the cavity If the patient has a cavity, the probability that the probe catches in it doesn't depend on whether he/she has a toothache P(Catch | Toothache, Cavity) = P(Catch | Cavity) Therefore, Catch is conditionally independent of Toothache given Cavity Likewise, Toothache is conditionally independent of Catch given Cavity P(Toothache | Catch, Cavity) = P(Toothache | Cavity) Equivalent statement: P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Conditional independence: Example How many numbers do we need to represent the joint probability table P(Toothache, Cavity, Catch)? 2 3 – 1 = 7 independent entries Write out the joint distribution using chain rule: P(Toothache, Catch, Cavity) = P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity) = P(Cavity) P(Catch | Cavity) P(Toothache | Cavity) How many numbers do we need to represent these distributions? = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n

Probabilistic inference Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e –Partially observable, stochastic, episodic environment –Examples: X = {spam, not spam}, e = message X = {zebra, giraffe, hippo}, e = image features Bayes decision theory: –The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise –The estimate of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e) –This is the Maximum a Posteriori (MAP) decision

MAP decision Value x of X that has the highest posterior probability given the evidence E = e: Maximum likelihood (ML) decision: likelihood prior posterior

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating –If each feature F i can take on k values, how many entries are in the (conditional) joint probability table P(E 1, …, E n |X = x)?

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: –If each feature can take on k values, what is the complexity of storing the resulting distributions?

Naïve Bayes Spam Filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

Naïve Bayes Spam Filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) We have P(spam | message)  P(message | spam)P(spam) and ¬P(spam | message)  P(message | ¬spam)P(¬spam)

Naïve Bayes Spam Filter We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) The message is a sequence of words (w 1, …, w n ) Bag of words representation –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam)

Naïve Bayes Spam Filter We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) The message is a sequence of words (w 1, …, w n ) Bag of words representation –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam) Our filter will classify the message as spam if

Bag of words illustration US Presidential Speeches Tag Cloud

Bag of words illustration US Presidential Speeches Tag Cloud

Bag of words illustration US Presidential Speeches Tag Cloud

Naïve Bayes Spam Filter prior likelihood posterior

Parameter estimation In order to classify a message, we need to know the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam) –These are the parameters of the probabilistic model –How do we obtain the values of these parameters? spam: 0.33 ¬spam: 0.67 P(word | ¬spam)P(word | spam)prior

Parameter estimation How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? –Empirically: use training data –This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: P(word | spam) = # of word occurrences in spam messages total # of words in spam messages d: index of training document, i: index of a word

Parameter estimation How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? –Empirically: use training data Parameter smoothing: dealing with words that were never seen or seen too few times –Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did P(word | spam) = # of word occurrences in spam messages total # of words in spam messages P(word | spam) = # of word occurrences in spam messages + 1 total # of words in spam messages + V (V: total number of unique words)

Summary of model and parameters Naïve Bayes model: Model parameters: P(spam) P(¬spam) P(w 1 | spam) P(w 2 | spam) … P(w n | spam) P(w 1 | ¬spam) P(w 2 | ¬spam) … P(w n | ¬spam) Likelihood of spam prior Likelihood of ¬ spam

Bag-of-word models for images Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)

Bag-of-word models for images 1.Extract image features

Bag-of-word models for images 1.Extract image features

2.Learn “visual vocabulary” Bag-of-word models for images

1.Extract image features 2.Learn “visual vocabulary” 3.Map image features to visual words Bag-of-word models for images

Bayesian decision making: Summary Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x 1,e 1 ), …, (x n,e n )}