CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
What is Statistical Modeling
Learning for Text Categorization
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Text Categorization CSE 454. Administrivia Mailing List Groups for PS1 Questions on PS1? Groups for Project Ideas for Project.
CS 188: Artificial Intelligence Fall 2009
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Made by: Maor Levy, Temple University  Up until now: how to reason in a give model  Machine learning: how to acquire a model on the basis of.
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
CSE 511a: Artificial Intelligence Spring 2013
Text Classification, Active/Interactive learning.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Announcements  Project 5  Due Friday 4/10 at 5pm  Homework 9  Released soon, due Monday 4/13 at 11:59pm.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Machine Learning 5. Parametric Methods.
1 Text Categorization CSE Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS 541: Artificial Intelligence Lecture IX: Supervised Machine Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
CAP 5636 – Advanced Artificial Intelligence
Artificial Intelligence
Probability Theory and Parameter Estimation I
Naïve Bayes Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Artificial Intelligence
Perceptrons Lirong Xia.
CS 188: Artificial Intelligence Fall 2008
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Naïve Bayes Classifiers
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
CS 188: Artificial Intelligence Fall 2007
Naïve Bayes Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein

Today Gaussians Naïve Bayes Text Classification 2

Long Ago Random variables, distributions Marginal, joint & conditional probabilities Sum rule, product rule, Bayes rule Independence, conditional independence 3

Last Time PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

Bayesian Learning Use Bayes rule: Or equivalently: Prior Normalization Data Likelihood Posterior

Conjugate Priors? 6

Those Pesky Distributions 7 DiscreteContinuous Binary {0, 1}M Values Single Event Sequence (N trials) N=  H +  T Bernouilli BinomialMultinomial BetaDirichlet Conjugate Prior

What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian – X ~ N( ,  2 ) – Y = aX + b  Y ~ N(a  +b,a 2  2 ) Sum of Gaussians is Gaussian – X ~ N(  X,  2 X ) – Y ~ N(  Y,  2 Y ) – Z = X+Y  Z ~ N(  X +  Y,  2 X +  2 Y ) Can make easy to differentiate, as we will see soon!

Learning a Gaussian Collect a bunch of data – Hopefully, i.i.d. samples – e.g., exam scores Learn parameters – Mean: μ – Variance: σ X i i = Exam Score …… 9989

MLE for Gaussian: Prob. of i.i.d. samples D={x 1,…,x N }: Log-likelihood of data:

Your second learning algorithm: MLE for mean of a Gaussian What’s MLE for mean?

MLE for variance Again, set derivative to zero:

Learning Gaussian parameters MLE:

Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator:

Bayesian learning of Gaussian parameters Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution

Bayesian learning of Gaussian parameters Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution Prior for mean:

MAP for mean of Gaussian

Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) | i = 1 … n } Find: A good approximation to f : X  Y Examples: what are X and Y ? Spam Detection –Map to {Spam,Ham} Digit recognition –Map pixels to {0,1,2,3,4,5,6,7,8,9} Stock Prediction –Map new, historic prices, etc. to (the real numbers) Classification

20 Bayesian Categorization Let set of categories be {c 1, c 2,…c n } Let E be description of an instance. Determine category of E by determining for each c i P(E) can be ignored since is factor  categories

Text classification Classify s – Y = {Spam,NotSpam} Classify news articles – Y = {what is the topic of the article?} Classify webpages – Y = {Student, professor, project, …} What to use for features, X?

Example: Spam Filter Input: Output: spam/ham Setup: –Get a large collection of example s, each labeled “spam” or “ham” –Note: someone has to hand label all this data! –Want to learn to predict labels of new, future s Features: The attributes used to make the ham / spam decision –Words: FREE! –Text Patterns: $dd, CAPS –For specifically, Semantic features: SenderInContacts –…–… Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Features X are word sequence in document X i for i th word in article

Features for Text Classification X is sequence of words in document X (and hence P(X|Y)) is huge!!! – Article at least 1000 words, X={X 1,…,X 1000 } – X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. 10, = Atoms in Universe: – We may have a problem…

Bag of Words Model Typical additional assumption – – Position in document doesn’t matter: P(X i =x i |Y=y) = P(X k =x i |Y=y) (all positions have the same distribution) – Ignore the order of words When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

Bag of Words Model in is lecture lecture next over person remember room sitting the the the to to up wake when you Typical additional assumption – – Position in document doesn’t matter: P(X i =x i |Y=y) = P(X k =x i |Y=y) (all position have the same distribution) – Ignore the order of words – Sounds really silly, but often works very well! – From now on: X i = Boolean: “word i is in document” X = X 1  …  X n

Bag of Words Approach aardvark0 about2 all2 Africa1 apple0 anxious0... gas1... oil1 … Zaire0

28 Bayesian Categorization Need to know: – Priors: P(y i ) – Conditionals: P(X | y i ) P(y i ) are easily estimated from data. – If n i of the examples in D are in y i, then P(y i ) = n i / |D| Conditionals: – X = X 1  …  X n – Estimate P(X 1  …  X n | y i ) Too many possible instances to estimate! – (exponential in n) – Even with bag of words assumption! Problem! P(y 1 | X) ~ P(y i )P(X|y i )

Need to Simplify Somehow Too many probabilities – P(x 1  x 2  x 3 | y i ) Can we assume some are the same? – P(x 1  x 2 ) = P(x 1 )P(x 2 ) 29 P(x 1  x 2  x 3 | spam) P(x 1  x 2   x 3 | spam) P(x 1   x 2  x 3 | spam) …. P(  x 1   x 2   x 3 |  spam) ?

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

Naïve Bayes Naïve Bayes assumption: – Features are independent given class: – More generally: How many parameters now? Suppose X is composed of n binary features

The Naïve Bayes Classifier Given: – Prior P(Y) – n conditionally independent features X given the class Y – For each X i, we have likelihood P(X i |Y) Decision rule: Y X1X1 XnXn X2X2

MLE for the parameters of NB Given dataset, count occurrences MLE for discrete NB, simply: – Prior: – Likelihood:

Subtleties of NB Classifier 1 Violating the NB Assumption Usually, features are not conditionally independent: Actual probabilities P(Y|X) often biased towards 0 or 1 Nonetheless, NB is the single most used classifier out there – NB often performs well, even when assumption is violated – [Domingos & Pazzani ’96] discuss some conditions for good performance

Subtleties of NB classifier 2 : Overfitting

2 wins!!

For Binary Features: We already know the answer! MAP: use most likely parameter Beta prior equivalent to extra observations for each feature As N → 1, prior is “forgotten” But, for small sample size, prior is important!

That’s Great for Bionomial Works for Spam / Ham What about multiple classes – Eg, given a wikipedia page, predicting type 38

Multinomials: Laplace Smoothing Laplace’s estimate: –Pretend you saw every outcome k extra times –What’s Laplace with k = 0? –k is the strength of the prior –Can derive this as a MAP estimate for multinomial with Dirichlet priors Laplace for conditionals: –Smooth each condition independently: HHT

40 Naïve Bayes for Text Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w 1, w 2,…w m } based on the probabilities P(w j | c i ). Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V| – Equivalent to a virtual sample of seeing each word in each category exactly once.

41 Easy to Implement But… If you do… it probably won’t work…

Probabilities: Important Detail! Any more potential problems here? P(spam | X 1 … X n ) =  P(spam | X i ) i  We are multiplying lots of small numbers Danger of underflow!  = 7 E -18  Solution? Use logs and add!  p 1 * p 2 = e log(p1)+log(p2)  Always keep in log form

43 Naïve Bayes Posterior Probabilities Classification results of naïve Bayes – I.e. the class with maximum posterior probability… – Usually fairly accurate (?!?!?) However, due to the inadequacy of the conditional independence assumption… – Actual posterior-probability estimates not accurate. – Output probabilities generally very close to 0 or 1.

Bag of words model in is lecture lecture next over person remember room sitting the the the to to up wake when you Typical additional assumption – – Position in document doesn’t matter: P(X i =x i |Y=y) = P(X k =x i |Y=y) (all position have the same distribution) – “Bag of words” model – order of words on the page ignored – Sounds really silly, but often works very well!

NB with Bag of Words for text classification Learning phase: – Prior P(Y) Count how many documents from each topic (prior) – P(X i |Y) For each topic, count how many times you saw word in documents of this topic (+ prior); remember this dist’n is shared across all positions i Test phase: – For each document Use naïve Bayes decision rule

Twenty News Groups results

Learning curve for Twenty News Groups

Example: Digit Recognition Input: images / pixel grids Output: a digit 0-9 Setup: –Get a large collection of example images, each labeled with a digit –Note: someone has to hand label all this data! –Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision –Pixels: (6,8)=ON –Shape Patterns: NumComponents, AspectRatio, NumLoops –… ??

A Digit Recognizer Input: pixel grids Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs) Simple version: –One feature F ij for each grid position –Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image –Each input maps to a feature vector, e.g. –Here: lots of features, each is binary valued Naïve Bayes model: Are the features independent given class? What do we need to learn?

Example Distributions

What if we have continuous X i ? Eg., character recognition: X i is i th pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e.,  i ), or independent of X i (i.e.,  k ) or both (i.e.,  )

Estimating Parameters: Y discrete, X i continuous Maximum likelihood estimates: Mean: Variance: jth training example  x  if x true, else 

Example: GNB for classifying mental states ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response Typical impulse response 10 sec [Mitchell et al.]

Brain scans can track activation with precision and sensitivity [Mitchell et al.]

Gaussian Naïve Bayes: Learned  voxel,word P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.]

Learned Bayes Models – Means for P(BrainActivity | WordCategory) Animal words People words Pairwise classification accuracy: 85% [Mitchell et al.]

58

Bayes Classifier is Optimal! Learn: h : X  Y – X – features – Y – target classes Suppose: you know true P(Y|X): – Bayes classifier: Why?

Optimal classification Theorem: – Bayes classifier h Bayes is optimal! – Why?

What you need to know about Naïve Bayes Naïve Bayes classifier – What’s the assumption – Why we use it – How do we learn it – Why is Bayesian estimation important Text classification – Bag of words model Gaussian NB – Features are still conditionally independent – Each feature has a Gaussian distribution given class Optimal decision using Bayes Classifier

62 Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category c i  C Let D i be the subset of documents in D in category c i P(c i ) = |D i | / |D| Let T i be the concatenation of all the documents in D i Let n i be the total number of word occurrences in T i For each word w j  V Let n ij be the number of occurrences of w j in T i Let P(w i | c i ) = (n ij + 1) / (n i + |V|)

63 Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where a i is the word occurring the ith position in X

64 Naïve Bayes Time Complexity Training Time: O(|D|L d + |C||V|)) where L d is the average length of a document in D. – Assumes V and all D i, n i, and n ij pre-computed in O(|D|L d ) time during one pass through all of the data. – Generally just O(|D|L d ) since usually |C||V| < |D|L d Test Time: O(|C| L t ) where L t is the average length of a test document. Very efficient overall, linearly proportional to the time needed to just read in all the data.

Multi-Class Categorization Pick the category with max probability Create many 1 vs other classifiers Use a hierarchical approach (wherever hierarchy available) Entity Person Location Scientist Artist City County Country 65