Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Probabilistic Reasoning (2)
HMMs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Overview Full Bayesian Learning MAP learning
Bayesian network inference
Review: Bayesian learning and inference
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Probability.
Inference in Bayesian Nets
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Fall 2009 Lecture 17: Bayes Nets IV 10/27/2009 Dan Klein – UC Berkeley.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Learning Bayesian Networks
Announcements Homework 8 is out Final Contest (Optional)
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CPSC 322, Lecture 24Slide 1 Reasoning under Uncertainty: Intro to Probability Computer Science cpsc322, Lecture 24 (Textbook Chpt 6.1, 6.1.1) March, 15,
1 Midterm Exam Mean: 72.7% Max: % Kernel Density Estimation.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Review: Probability Random variables, events Axioms of probability
Bayesian Networks Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
QUIZ!!  T/F: The forward algorithm is really variable elimination, over time. TRUE  T/F: Particle Filtering is really sampling, over time. TRUE  T/F:
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.
CPSC 322, Lecture 28Slide 1 More on Construction and Compactness: Compact Conditional Distributions Once we have established the topology of a Bnet, we.
UBC Department of Computer Science Undergraduate Events More
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Announcements Project 4: Ghostbusters Homework 7
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Advanced Artificial Intelligence Lecture 5: Probabilistic Inference.
Inference Algorithms for Bayes Networks
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
Bayes Nets & HMMs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
Bayesian inference, Naïve Bayes model
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Artificial Intelligence
Quizzz Rihanna’s car engine does not start (E).
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Hidden Markov Models Lirong Xia.
CS 188: Artificial Intelligence Fall 2007
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Announcements HW3 released online this afternoon, due Oct 29, 11:59pm Demo by Mark Zhao!

Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x 1,e 1 ), …, (x n,e n )}

Inference Graphs can have observed (shaded) and unobserved nodes. If nodes are always unobserved they are called hidden or latent variables Probabilistic inference is the problem of computing a conditional probability distribution over the values of some of the nodes (the “hidden” or “unobserved” nodes), given the values of other nodes (the “evidence” or “observed” nodes).

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: –Posterior probability: –Most likely explanation: 6 BE A JM

Inference – computing conditional probabilities Marginalization: Conditional Probabilities:

Inference by Enumeration Given unlimited time, inference in BNs is easy Recipe: –State the marginal probabilities you need –Figure out ALL the atomic probabilities you need –Calculate and combine them Example: 8 BE A JM

Example: Enumeration In this simple method, we only need the BN to synthesize the joint entries 9

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?  Problems  Full joint distributions are too large  Marginalizing out Y may involve too many summation terms

Inference by Enumeration? 11

Variable Elimination Why is inference by enumeration on a Bayes Net inefficient? –You join up the whole joint distribution before you sum out the hidden variables –You end up repeating a lot of work! Idea: interleave joining and marginalizing! –Called “Variable Elimination” –Choosing the order to eliminate variables that minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration 12

General Variable Elimination Query: Start with initial factors: –Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): –Pick a hidden variable H –Join all factors mentioning H –Eliminate (sum out) H Join all remaining factors and normalize 13

14 Example: Variable elimination Query: What is the probability that a student attends class, given that they pass the exam? [based on slides taken from UMBC CMSC 671, 2005] P(pr|at,st)atst 0.9TT 0.5TF 0.7FT 0.1FF attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

15 Join study factors attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|sm)P(pr|sm) TTT TFT TTF TFF FTT FFT FTF FFF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

16 Marginalize out study attend prepared, study fair pass P(at)=.8 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|at)P(pr|at) TTT TFT TTF TFF FTT FFT FTF FFF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

17 Remove “study” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prat 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

18 Join factors “fair” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfair P(pa|at,pre, fa)P(fair) P(pa,fa|sm, pre) P(pa|sm,pre ) tTTT tTTF tTFT tTFF tFTT tFTF tFFT tFFF

19 Marginalize out “fair” attend prepared pass, fair pass, fair P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfairP(pa|at,pre,fa)P(fair)P(pa,fa|at,pre)P(pa|at,pre) TTTT TTTF TTFT TTFF TFTT TFTF TFFT TFFF

20 Marginalize out “fair” attend prepared pass P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre)papreattend 0.82tTT 0.64tTF tFT 0.19tFF

21 Join factors “prepared” attend prepared pass P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|sm)P(pa|sm) tTT tTF tFT tFF

22 Join factors “prepared” attend pass, prepared P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|at)P(pa|at) tTT tTF tFT tFF

23 Join factors “prepared” attend pass P(at)=.8 P(pa|at)paattend tT 0.397tF

24 Join factors attend pass P(at)=.8 OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,sm)P(at|pa) TT TF

25 Join factors attend, pass OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,at)P(at|pa) TT TF

Bayesian network inference: Big picture Exact inference is intractable –There exist techniques to speed up computations, but worst-case complexity is still exponential except in some classes of networks Approximate inference –Sampling, variational methods, message passing / belief propagation…

Approximate Inference Sampling 27

Approximate Inference 28

Sampling – the basics... Scrooge McDuck gives you an ancient coin. He wants to know what is P(H) You have no homework, and nothing good is on television – so you toss it 1 Million times. You obtain x Heads, and x Tails. What is P(H)? 29

Sampling – the basics... P(H)=0.7 Why? 30

Monte Carlo Method 31 Who is more likely to win? Green or Purple? What is the probability that green wins, P(G)? Two ways to solve this: 1.Compute the exact probability. 2.Play 100,000 games and see how many times green wins.

Approximate Inference Sampling is a hot topic in machine learning, and it’s really simple Basic idea: –Draw N samples from a distribution –Compute an approximate posterior probability –Show this converges to the true probability P Why sample? –Learning: get samples from a distribution you don’t know –Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination) 32 S A F

Forward Sampling Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass 33 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, -s, +r, +w -c, +s, -r, +w …

Forward Sampling This process generates samples with probability: …i.e. the BN’s joint probability If we sample n events, then in the limit we would converge to the true joint probabilities. Therefore, the sampling procedure is consistent 34

Example We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w If we want to know P(W) –We have counts –Normalize to get P(W) = –This will get closer to the true distribution with more samples –Can estimate anything else, too –What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? –Fast: can use fewer samples if less time (what’s the drawback?) Cloudy Sprinkler Rain WetGrass C S R W 35

Rejection Sampling Let’s say we want P(C) –No point keeping all samples around –Just tally counts of C as we go Let’s say we want P(C| +s) –Same thing: tally C outcomes, but ignore (reject) samples which don’t have S=+s –This is called rejection sampling –It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy Sprinkler Rain WetGrass C S R W 36

Sampling Example There are 2 cups. –The first contains 1 penny and 1 quarter –The second contains 2 quarters Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter (yes!). What is the probability that the other coin in that cup is also a quarter?

Likelihood Weighting Problem with rejection sampling: –If evidence is unlikely, you reject a lot of samples –You don’t exploit your evidence as you sample –Consider P(B|+a) Idea: fix evidence variables and sample the rest Problem: sample distribution not consistent! Solution: weight by probability of evidence given parents BurglaryAlarm BurglaryAlarm 38 -b, -a +b, +a -b +a -b, +a +b, +a

Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Now, samples have weights Together, weighted sampling distribution is consistent Cloudy R C S W 39

Likelihood Weighting 40 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

 Inference:  Sum over weights that match query value  Divide by total sample weight  What is P(C|+w,+r)? Likelihood Weighting Example 41 CloudyRainySprinklerWet GrassWeight

Likelihood Weighting Likelihood weighting is good –We have taken evidence into account as we generate the sample –E.g. here, W’s value will get picked based on the evidence values of S, R –More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn’t solve all our problems –Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable 42 Cloudy Rain C S R W

Markov Chain Monte Carlo* Idea: instead of sampling from scratch, create samples that are each like the last one. Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(b|c): Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators! What’s the point: both upstream and downstream variables condition on evidence. 43 +a+c+b +a+c-b-b-a-a -b-b

Gibbs Sampling 1.Set all evidence E to e 2.Do forward sampling to obtain x 1,...,x n 3.Repeat: 1.Pick any variable X i uniformly at random. 2.Resample x i ’ from p(X i | x 1,..., x i-1, x i+1,..., x n ) 3.Set all other x j ’=x j 4.The new sample is x 1’,..., x n ’ 44

Markov Blanket 45 X Markov blanket of X: 1.All parents of X 2.All children of X 3.All parents of children of X (except X itself) X is conditionally independent from all other variables in the BN, given all variables in the markov blanket (besides X).

Inference Algorithms Exact algorithms –Elimination algorithm –Sum-product algorithm –Junction tree algorithm Sampling algorithms –Importance sampling –Markov chain Monte Carlo Variational algorithms –Mean field methods –Sum-product algorithm and variations –Semidefinite relaxations

Summary Sampling can be really effective –The dominating approach to inference in BNs Approaches: –Forward (/Prior) Sampling –Rejection Sampling –Likelihood Weighted Sampling 47

Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x 1,e 1 ), …, (x n,e n )}

Learning in Bayes Nets Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net. Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Parameter learning Suppose we know the network structure (but not the parameters), and have a training set of complete observations Sample CSRW 1TFTT 2FTFT 3TFFF 4TTTT 5FTFT 6TFTF ………….… ? ???? ???? ???????? Training set

Parameter learning Suppose we know the network structure (but not the parameters), and have a training set of complete observations –P(X | Parents(X)) is given by the observed frequencies of the different values of X for each combination of parent values

Parameter learning Incomplete observations Expectation maximization (EM) algorithm for dealing with missing data Sample CSRW 1?FTT 2?TFT 3?FFF 4?TTT 5?TFT 6?FTF ………….… ? ???? ???? ???????? Training set

Learning from incomplete data Observe real-world data –Some variables are unobserved –Trick: Fill them in EM algorithm –E=Expectation –M=Maximization 53 Sample CSRW 1?FTT 2?TFT 3?FFF 4?TTT 5?TFT 6?FTF ………….…

EM-Algorithm Initialize CPTs (e.g. randomly, uniformly) Repeat until convergence –E-Step: Compute for all i and x (inference) –M-Step: Update CPT tables 54

Parameter learning What if the network structure is unknown? –Structure learning algorithms exist, but they are pretty complicated… Sample CSRW 1TFTT 2FTFT 3TFFF 4TTTT 5FTFT 6TFTF ………….… Training set C SR W ?

Summary: Bayesian networks Joint Distribution Independence Inference Parameters Learning

Naïve Bayes

Bayesian inference, Naïve Bayes model

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Key tool for probabilistic inference: can get diagnostic probability from causal probability E.g., P(Cavity | Toothache) from P(Toothache | Cavity) Rev. Thomas Bayes ( )

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes rule: Another example 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Probabilistic inference Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e –Examples: X = {spam, not spam}, e = message X = {zebra, giraffe, hippo}, e = image features

Bayesian decision theory Let x be the value predicted by the agent and x* be the true value of X. The agent has a loss function, which is 0 if x = x* and 1 otherwise Expected loss: What is the estimate of X that minimizes the expected loss? –The one that has the greatest posterior probability P(x|e) –This is called the Maximum a Posteriori (MAP) decision

MAP decision Value x of X that has the highest posterior probability given the evidence E = e: Maximum likelihood (ML) decision: likelihood prior posterior

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating –If each feature E i can take on k values, how many entries are in the (conditional) joint probability table P(E 1, …, E n |X = x)?

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: –If each feature can take on k values, what is the complexity of storing the resulting distributions?

Exchangeability De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Naïve Bayes model Posterior: MAP decision: likelihood priorposterior

A Simple Example – Naïve Bayes C – Class F - Features We only specify (parameters): prior over class labels how each feature depends on the class

Slide from Dan Klein

Spam filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

Spam filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) We have P(spam | message)  P(message | spam)P(spam) and P(¬spam | message)  P(message | ¬spam)P(¬spam) To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)

Naïve Bayes Representation Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam) Likelihood: bag of words representation –The message is a sequence of words (w 1, …, w n ) –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam)

Naïve Bayes Representation Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam) Likelihood: bag of words representation –The message is a sequence of words (w 1, …, w n ) –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam)

Bag of words illustration US Presidential Speeches Tag Cloud

Bag of words illustration US Presidential Speeches Tag Cloud

Bag of words illustration US Presidential Speeches Tag Cloud

Naïve Bayes Representation Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam) Likelihood: bag of words representation –The message is a sequence of words (w 1, …, w n ) –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam) –Thus, the problem is reduced to estimating marginal likelihoods of individual words P(w i | spam) and P(w i | ¬spam)

General MAP rule for Naïve Bayes: Thus, the filter should classify the message as spam if Summary: Decision rule likelihood priorposterior

Slide from Dan Klein

Percentage of documents in training set labeled as spam/ham Slide from Dan Klein

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein