Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal probability Rev. Thomas Bayes ( )
Bayesian decision theory Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e –Partially observable, stochastic, episodic environment –Examples: X = {spam, not spam}, e = message X = {zebra, giraffe, hippo}, e = image features –The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise –What is agent’s optimal estimate of the value of X? Maximum a posteriori (MAP) decision: value of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e)
MAP decision X = x: value of query variable E = e: evidence Maximum likelihood (ML) decision: likelihood prior posterior
Example: Spam Filter We have X = {spam, ¬spam}, E = message. What should be our decision criterion? –Compute P(spam | message) and P(¬spam | message), and assign the message to the class that gives higher posterior probability
Example: Spam Filter We have X = {spam, ¬spam}, E = message. What should be our decision criterion? –Compute P(spam | message) and P(¬spam | message), and assign the message to the class that gives higher posterior probability P(spam | message) P(message | spam) P(spam) P(¬spam | message) P(message | ¬spam) P(¬spam)
Example: Spam Filter We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) How do we represent the message? –Bag of words model: The order of the words is not important Each word is conditionally independent of the others given message class If the message consists of words (w 1, …, w n ), how do we compute P(w 1, …, w n | spam)? –Naïve Bayes assumption: each word is conditionally independent of the others given message class
Example: Spam Filter Our filter will classify the message as spam if In practice, likelihoods are pretty small numbers, so we need to take logs to avoid underflow: Model parameters: –Priors P(spam), P(¬spam) –Likelihoods P(w i | spam), P(w i | ¬spam) These parameters need to be learned from a training set (a representative sample of messages marked with their classes)
Parameter estimation Model parameters: –Priors P(spam), P(¬spam) –Likelihoods P(w i | spam), P(w i | ¬spam) Estimation by empirical word frequencies in the training set: –This happens to be the parameter estimate that maximizes the likelihood of the training data: P(w i | spam) = # of occurrences of w i in spam messages total # of words in spam messages d: index of training document, i: index of a word
Parameter estimation Model parameters: –Priors P(spam), P(¬spam) –Likelihoods P(w i | spam), P(w i | ¬spam) Estimation by empirical word frequencies in the training set: Parameter smoothing: dealing with words that were never seen or seen too few times –Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did P(w i | spam) = # of occurrences of w i in spam messages total # of words in spam messages
Bayesian decision making: Summary Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x 1,e 1 ), …, (x n,e n )}
Bag-of-word models for images Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
Bag-of-word models for images 1.Extract image features
Bag-of-word models for images 1.Extract image features
2.Learn “visual vocabulary” Bag-of-word models for images
1.Extract image features 2.Learn “visual vocabulary” 3.Map image features to visual words Bag-of-word models for images