Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm.

Slides:

Advertisements

Similar presentations

Overview of this week Debugging tips for ML algorithms

Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.

Naïve Bayes William W. Cohen. Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.

What is Statistical Modeling

Overview Full Bayesian Learning MAP learning

An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University

Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Naïve Bayes and Hadoop Shannon Quinn.

Thanks to Nir Friedman, HU

Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Crash Course on Machine Learning

A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.

Bayesian Networks. Male brain wiring Female brain wiring.

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

SI485i : NLP Set 5 Using Naïve Bayes.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,

MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,

MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,

A Quick Overview of Probability William W. Cohen Machine Learning

Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.

A Quick Overview of Probability William W. Cohen Machine Learning

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)

Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.

Logistic Regression William Cohen.

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.

Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

A Quick Overview of Probability + Naïve William W. Cohen Machine Learning

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Matt Gormley Lecture 3 September 7, 2016

Support Vector Machines

NAÏVE BAYES CLASSIFICATION

Analysis of Large Graphs: Overlapping Communities

Presentation transcript:

Naïve Bayes William W. Cohen

Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: s. Comments and corrections gratefully received. s Again filched from:

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification Naïve Bayes density estimators and classifiers Conditional independence…more on this next week!

Copyright © Andrew W. Moore The Axioms Of Probabi lity

Some of A Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

Coupled Temporal Scoping of Relational Facts. P.P. Talukdar, D.T. Wijaya and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2012 Understanding Semantic Change of Words Over Centuries. D.T. Wijaya and R. Yeniterzi. In Workshop on Detecting and Exploiting Cultural Diversity on the Social Web (DETECT), 2011 at CIKM 2011

Some of A Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

A Project Idea Problem for non-native speakers: article selection in English – “I plan to use an SVM to classify….” – “The SVM I used was libsvm….” – “I bough a shrunken head in the Amazon” – “I bought a shrunken head on Amazon” Question 1: can you learn how to select articles accurately from big-data? – Google n-grams? – Pre-parsed text? Question 2: can you learn an article-selection algorithm that clusters the different cases in a cognitively plausible way? – There are ~= 60 rules/clusters that are taught (but 6 cover most cases) We have a few examples of each – People exhibit a power-law learning curve within cases of the same rule We can test to see how well a given clustering fits student performance data – This is a semi-supervised learning problem - or maybe a constrained clustering problem - or maybe …. Nan Li (my student, finishing this year) is working on the ITS side of this problem and is interested in helping out.

Big ML c (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

Performance … PatternUsedErrors P(C|A,B,D,E)1011 P(C|A,B,D)1576 P(C|B,D)16313 P(C|B)24478 P(C)5831 Is this good performance? Do other brute-force estimates of joint probabilities have the same problem?

Flashback Abstract : Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset) + 1; 3 (here) Size of table: 2 15 =32768 (if all binary)  avg of 1.5 examples per row Actual m = 1,974,927, 360 (if continuous attributes binarized)

Copyright © Andrew W. Moore Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data. We need something which generalizes more usefully. The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D) ?

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? P(A) P(~B) P(C) P(~D)

Copyright © Andrew W. Moore Naïve Distribution General Case Suppose X 1,X 2,…,X d are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this?

Copyright © Andrew W. Moore Learning a Naïve Density Estimator Another trivial learning algorithm! MLE Dirichlet (MAP)

Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: s. Comments and corrections gratefully received. s Again filched from:

Is this an interesting learning algorithm? For n-grams, what is P(C=effect|A=will)? In joint: P(C=effect|A=will) = 0.38 In naïve: P(C=effect|A=will) = P(C=effect) = #[C=effect]/#totalNgrams = 0.94 (!) What is P(C=effect|B=no)? In joint: P(C=effect|B=no) = In naïve: P(C=effect|B=no) = P(C=effect) = 0.94 ^ ^ ^ ^ ^ ^ ^ ^ No

Copyright © Andrew W. Moore Independently Distributed Data Review: A and B are independent if – Pr(A,B)=Pr(A)Pr(B) – Sometimes written: A and B are conditionally independent given C if Pr(A,B|C)=Pr(A|C)*Pr(B|C) – Written

Bayes Classifiers If we can do inference over Pr(X,Y)… … in particular compute Pr(X|Y) and Pr(Y). – We can compute

Can we make this interesting? Yes! Key ideas: – Pick the class variable Y – Instead of estimating P(X 1,…,X n,Y) = P(X 1 )*…*P(X n )*Y, estimate P(X 1,…,X n |Y) = P(X 1 |Y)*…*P(X n |Y) – Or, assume P(X i |Y)=Pr(X i |X 1,…,X i-1,X i+1,…X n,Y) – Or, that X i is conditionally independent of every X j, j!=i, given Y. – How to estimate? MLE

The Naïve Bayes classifier – v1 Dataset: each example has – A unique id id Why? For debugging the feature extractor – d attributes X 1,…,X d Each X i takes a discrete value in dom(X i ) – One class label Y in dom(Y) You have a train dataset and a test dataset

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ This will overfit, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1 This will underflow, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1

The Naïve Bayes classifier – v2 For text documents, what features do you use? One common choice: – X 1 = first word in the document – X 2 = second word in the document – X 3 = third … – X 4 = … – … But: Pr( X 13 =hockey|Y=sports ) is probably not that different from Pr( X 11 =hockey|Y=sports )…so instead of treating them as different variables, treat them as different copies of the same variable

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset To classify documents, these might be: – academic,FacultyHome William W. Cohen Research Professor Machine Learning Department Carnegie Mellon University Member of the Language Technology Institute the joint CMU-Pitt Program in Computational Biology the Lane Center for Computational Biology and the Center for Bioimage Informatics Director of the Undergraduate Minor in Machine Learning Bio Teaching Projects Publications recent all Software Datasets Talks Students Colleagues Blog Contact Info Other Stuff … – commercial Search Images Videos …. – … How about for n-grams?

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset To do spelling correction these might be – ng1223 effect a_the b_main d_of e_the – ng1224 affect a_shows b_not d_mice e_in – …. I.e., encode event X i = w with another event X= i_w Question: are there any differences in behavior?

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Complexity: O( n), n= size of train Complexity: O(| dom(Y)|*n’), n’= size of test Assume hashtable holding all counts fits in memory Sequential reads

Complexity of Naïve Bayes You have a train dataset and a test dataset Process: – Count events in the train dataset O( n 1 ), where n 1 is total size of train – Write the counts to disk O(min( |dom(X)|*|dom(Y)|, n 1 ) O(| V | ), if V is vocabulary and dom(Y) is small – Classify the test dataset O(| V|+n 2 ) – Worst-case memory usage: O(min( |dom(X)|*|dom(Y)|, n 1 )

Naïve Bayes v2 This is one example of a streaming classifier – Each example is only read only once – You can create a classifier and perform classifications at any point – Memory is minimal (<< O(n)) Ideally it would be constant Traditionally less than O(sqrt(N)) – Order doesn’t matter Nice because we may not control the order of examples in real life This is a hard one to get a learning system to have! There are few competitive learning methods that as stream-y as naïve Bayes…

First assigment Implement naïve Bayes v2 Run and test it on Reuters RCV2 – O(100k) newswire stories – One of the largest widely-used classification datasets – Details on the wiki – Turn in by next Monday Hint to all: – The next assignment will be a Naïve Bayes that does not use a hashtable for event counts Thursday’s lecture – You will want to reuse some stuff from this assignment later….