Naïve Bayes and Hadoop Shannon Quinn.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Naïve-Bayes Classifiers Business Intelligence for Managers.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Naïve Bayes William W. Cohen. Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability.
Parallel and Distributed Computing: MapReduce Alona Fyshe.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Classification on high octane (1): Naïve Bayes (hopefully, with Hadoop) COSC 526 Class 3 Arvind Ramanathan Computational Science & Engineering Division.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Overview Full Bayesian Learning MAP learning
Multiple Instance Learning
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Lecture 5 (Classification with Decision Trees)
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Basic Probability (most slides borrowed, with permission, from Andrew Moore of CMU and Google)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Map-Reduce With Hadoop. Announcement - 1 Assignment 1B: Autolab is not secure and assignments aren’t designed for adversarial interactions Our policy:
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Bayesian Networks. Male brain wiring Female brain wiring.
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
SI485i : NLP Set 5 Using Naïve Bayes.
Naive Bayes Classifier
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Using Large-Vocabulary Classifiers William W. Cohen.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Scaling up Decision Trees. Decision tree learning.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
1 Naïve Bayes Classifiers CS 171/ Definition A classifier is a system that categorizes instances Inputs to a classifier: feature/attribute values.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
A Quick Overview of Probability William W. Cohen Machine Learning
Map-Reduce With Hadoop. Announcement 1/2 Assignments, in general: Autolab is not secure and assignments aren’t designed for adversarial interactions Our.
Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn.
Hadoop and MapReduce Shannon Quinn (with thanks from William Cohen of CMU)
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Parallel and Distributed Computing: MapReduce pilfered from: Alona Fyshe.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Logistic Regression William Cohen.
CIS 335 CIS 335 Data Mining Classification Part I.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
A Quick Overview of Probability + Naïve William W. Cohen Machine Learning
KNN & Naïve Bayes Hongning Wang
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Announcements HW1b is going out today You should now be on autolab
Matt Gormley Lecture 3 September 7, 2016
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Presentation transcript:

Naïve Bayes and Hadoop Shannon Quinn

Coupled Temporal Scoping of Relational Facts. P.P. Talukdar, D.T. Wijaya and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2012 Understanding Semantic Change of Words Over Centuries. D.T. Wijaya and R. Yeniterzi. In Workshop on Detecting and Exploiting Cultural Diversity on the Social Web (DETECT), 2011 at CIKM 2011

Some of A Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

Flashback Abstract : Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset) + 1; 3 (here) Size of table: 2 15 =32768 (if all binary)  avg of 1.5 examples per row Actual m = 1,974,927, 360 (if continuous attributes binarized)

Copyright © Andrew W. Moore Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data (and is *hard* to compute). We need something which generalizes more usefully. The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D) ?

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? P(A) P(~B) P(C) P(~D)

Copyright © Andrew W. Moore Naïve Distribution General Case Suppose X 1,X 2,…,X d are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this?

Copyright © Andrew W. Moore Learning a Naïve Density Estimator Another trivial learning algorithm! MLE Dirichlet (MAP)

Copyright © Andrew W. Moore Independently Distributed Data Review: A and B are independent if – Pr(A,B)=Pr(A)Pr(B) – Sometimes written: A and B are conditionally independent given C if Pr(A,B|C)=Pr(A|C)*Pr(B|C) – Written

Bayes Classifiers If we can do inference over Pr(X,Y)… … in particular compute Pr(X|Y) and Pr(Y). – We can compute

Can we make this interesting? Yes! Key ideas: – Pick the class variable Y – Instead of estimating P(X 1,…,X n,Y) = P(X 1 )*…*P(X n )*Y, estimate P(X 1,…,X n |Y) = P(X 1 |Y)*…*P(X n |Y) – Or, assume P(X i |Y)=Pr(X i |X 1,…,X i-1,X i+1,…X n,Y) – Or, that X i is conditionally independent of every X j, j!=i, given Y. – How to estimate? MLE

The Naïve Bayes classifier – v1 Dataset: each example has – A unique id id Why? For debugging the feature extractor – d attributes X 1,…,X d Each X i takes a discrete value in dom(X i ) – One class label Y in dom(Y) You have a train dataset and a test dataset

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ This will overfit, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1 This will underflow, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1

The Naïve Bayes classifier – v2 For text documents, what features do you use? One common choice: – X 1 = first word in the document – X 2 = second word in the document – X 3 = third … – X 4 = … – … But: Pr( X 13 =hockey|Y=sports ) is probably not that different from Pr( X 11 =hockey|Y=sports )…so instead of treating them as different variables, treat them as different copies of the same variable

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Complexity: O( n), n= size of train Complexity: O(| dom(Y)|*n’), n’= size of test Sequential reads

MapReduce!

Inspiration not Plagiarism This is not the first lecture ever on Mapreduce I borrowed from William Cohen, who borrowed from Alona Fyshe, and she borrowed from: Jimmy Lin Google Cloudera

Introduction to MapReduce 3 main phases – Map (send each input record to a key) – Sort (put all of one key in the same place) Handled behind the scenes in Hadoop – Reduce (operate on each key and its set of values) Terms come from functional programming: –map(lambda x:x.upper(),[”shannon",”p",”quinn"])  [’SHANNON', ’P', ’QUINN'] –reduce(lambda x,y:x+"- "+y,[”shannon",”p",”quinn"])  ”shannon-p-quinn”

MapReduce overview MapShuffle/Sort Reduce

MapReduce in slow motion Canonical example: Word Count Example corpus:

others: KeyValueInputFormat SequenceFileInputFormat

Is any part of this wasteful? Remember - moving data around and writing to/reading from disk are very expensive operations No reducer can start until: all mappers are done data in its partition has been sorted

Common pitfalls You have no control over the order in which reduces are performed You have “no” control over the order in which you encounter reduce values – Sort of (more later). The only ordering you should assume is that Reducers always start after Mappers

Common pitfalls You should assume your Maps and Reduces will be taking place on different machines with different memory spaces Don’t make a static variable and assume that other processes can read it – They can’t. – It appear that they can when run locally, but they can’t – No really, don’t do this.

Common pitfalls Do not communicate between mappers or between reducers – overhead is high – you don’t know which mappers/reducers are actually running at any given point – there’s no easy way to find out what machine they’re running on because you shouldn’t be looking for them anyway

Assignment 1 Naïve Bayes on Hadoop Document classification Due Thursday, Jan 29 by 11:59:59pm

Other reminders If you haven’t yet signed up for a student presentation, do so! – I will be assigning slots starting Friday at 5pm, so sign up before then – You need to be on BitBucket to do this Mailing list for questions Office hours Mondays 9-10:30am