ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber 10.1-10.3.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Machine Learning Week 2 Lecture 1.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Naïve Bayes Classifier
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –SVM –Multi-class SVMs –Neural Networks –Multi-layer Perceptron Readings:
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
ECE 5984: Introduction to Machine Learning
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Probability Review –Statistical Estimation (MLE) Readings: Barber 8.1, 8.2.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
KNN & Naïve Bayes Hongning Wang
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Matt Gormley Lecture 3 September 7, 2016
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Dhruv Batra Georgia Tech
Probability Theory and Parameter Estimation I
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Presentation transcript:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

Administrativia HW2 –Due: Friday 03/06, 03/15, 11:55pm –Implement linear regression, Naïve Bayes, Logistic Regression Need a couple of catch-up lectures –How about 4-6pm? (C) Dhruv Batra2

Administrativia Mid-term –When: March 18, class timing –Where: In class –Format: Pen-and-paper. –Open-book, open-notes, closed-internet. No sharing. –What to expect: mix of Multiple Choice or True/False questions “Prove this statement” “What would happen for this dataset?” –Material Everything from beginning to class to (including) SVMs (C) Dhruv Batra3

New Topic: Naïve Bayes (your first probabilistic classifier) (C) Dhruv Batra4 Classificationxy Discrete

Error Decomposition Approximation/Modeling Error –You approximated reality with model Estimation Error –You tried to learn model with finite data Optimization Error –You were lazy and couldn’t/didn’t optimize to completion Bayes Error –Reality just sucks – (C) Dhruv Batra5

Classification Learn: h:X  Y –X – features –Y – target classes Suppose you know P(Y|X) exactly, how should you classify? –Bayes classifier: Why? Slide Credit: Carlos Guestrin

Optimal classification Theorem: Bayes classifier h Bayes is optimal! –That is Proof: Slide Credit: Carlos Guestrin

Generative vs. Discriminative Generative Approach –Estimate p(x|y) and p(y) –Use Bayes Rule to predict y Discriminative Approach –Estimate p(y|x) directly OR –Learn “discriminant” function h(x) (C) Dhruv Batra8

Generative vs. Discriminative Generative Approach –Assume some functional form for P(X|Y), P(Y) –Estimate p(X|Y) and p(Y) –Use Bayes Rule to calculate P(Y| X=x) –Indirect computation of P(Y|X) through Bayes rule –But, can generate a sample, P(X) =  y P(y) P(X|y) Discriminative Approach –Estimate p(y|x) directly OR –Learn “discriminant” function h(x) –Direct but cannot obtain a sample of the data, because P(X) is not available (C) Dhruv Batra9

Generative vs. Discriminative Generative: –Today: Naïve Bayes Discriminative: –Next: Logistic Regression NB & LR related to each other. (C) Dhruv Batra10

How hard is it to learn the optimal classifier? Slide Credit: Carlos Guestrin Categorical Data How do we represent these? How many parameters? –Class-Prior, P(Y): Suppose Y is composed of k classes –Likelihood, P(X|Y): Suppose X is composed of d binary features Complex model  High variance with limited data!!!

Independence to the rescue (C) Dhruv Batra12Slide Credit: Sam Roweis

The Naïve Bayes assumption Naïve Bayes assumption: –Features are independent given class: –More generally: How many parameters now? Suppose X is composed of d binary features Slide Credit: Carlos Guestrin(C) Dhruv Batra13

The Naïve Bayes Classifier Slide Credit: Carlos Guestrin(C) Dhruv Batra14 Given: –Class-Prior P(Y) –d conditionally independent features X given the class Y –For each X i, we have likelihood P(X i |Y) Decision rule: If assumption holds, NB is optimal classifier!

MLE for the parameters of NB (C) Dhruv Batra15 Given dataset –Count(A=a,B=b) #number of examples where A=a and B=b MLE for NB, simply: –Class-Prior: P(Y=y) = –Likelihood: P(X i =x i | Y=y) =

HW1 (C) Dhruv Batra16

Naïve Bayes In class demo Variables –Y = {did-not-watch superbowl, watched superbowl} –X 1 = {domestic, international} –X 2 = { =2years} Estimate –P(Y=1) –P(X 1 =0 | Y=0), P(X 2 =0 | Y=0) –P(X 1 =0 | Y=1), P(X 2 =0 | Y=1) Prediction: argmax_y P(Y=y)P(x 1 |Y=y)P(x 2 |Y=y) (C) Dhruv Batra17

Subtleties of NB classifier 1 – Violating the NB assumption Slide Credit: Carlos Guestrin(C) Dhruv Batra18 Usually, features are not conditionally independent: Probabilities P(Y|X) often biased towards 0 or 1 Nonetheless, NB is a very popular classifier –NB often performs well, even when assumption is violated –[Domingos & Pazzani ’96] discuss some conditions for good performance

Subtleties of NB classifier 2 – Insufficient training data What if you never see a training instance where X 1 =a when Y=c? –e.g., Y={NonSpam }, X 1 ={‘Nigeria’} –P(X 1 =a | Y=c) = 0 Thus, no matter what the values X 2,…,X d take: –P(Y=c | X 1 =a,X 2,…,X d ) = 0 What now??? Slide Credit: Carlos Guestrin(C) Dhruv Batra19

Recall MAP for Bernoulli-Beta MAP: use most likely parameter: Beta prior equivalent to extra flips Slide Credit: Carlos Guestrin(C) Dhruv Batra20

Bayesian learning for NB parameters – a.k.a. smoothing Prior on parameters –Dirichlet everything! MAP estimate Now, even if you never observe a feature/class, posterior probability never zero Slide Credit: Carlos Guestrin(C) Dhruv Batra21

Text classification Classify s –Y = {Spam,NotSpam} Classify news articles –Y = {what is the topic of the article?} Classify webpages –Y = {Student, professor, project, …} What about the features X? –The text! Slide Credit: Carlos Guestrin(C) Dhruv Batra22

Features X are entire document – X i for i th word in article Slide Credit: Carlos Guestrin(C) Dhruv Batra23

NB for Text classification P(X|Y) is huge!!! –Article at least 1000 words, X={X 1,…,X 1000 } –X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. NB assumption helps a lot!!! –P(X i =x i |Y=y) is just the probability of observing word x i in a document on topic y Slide Credit: Carlos Guestrin(C) Dhruv Batra24

Bag of Words model Typical additional assumption: Position in document doesn’t matter: P(X i =a|Y=y) = P(X k =a|Y=y) –“Bag of words” model – order of words on the page ignored –Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. Slide Credit: Carlos Guestrin(C) Dhruv Batra25

Bag of Words model Typical additional assumption: Position in document doesn’t matter: P(X i =x i |Y=y) = P(X k =x i |Y=y) –“Bag of words” model – order of words on the page ignored –Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you Slide Credit: Carlos Guestrin(C) Dhruv Batra26

Bag of Words model aardvark0 about2 all2 Africa1 apple0 anxious0... gas1... oil1 … Zaire0 Slide Credit: Carlos Guestrin(C) Dhruv Batra27

Object Bag of ‘words’ Slide Credit: Fei Fei Li(C) Dhruv Batra28

Slide Credit: Fei Fei Li(C) Dhruv Batra29

categorydecisionlearning feature detection & representation codewords dictionary image representation category models (and/or) classifiers recognition Slide Credit: Fei Fei Li(C) Dhruv Batra30

What if we have continuous X i ? Eg., character recognition: X i is i th pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e.,  i ), or independent of X i (i.e.,  k ) or both (i.e.,  ) Slide Credit: Carlos Guestrin(C) Dhruv Batra31

Estimating Parameters: Y discrete, X i continuous Maximum likelihood estimates: (C) Dhruv Batra32

What you need to know about NB Optimal decision using Bayes Classifier Naïve Bayes classifier –What’s the assumption –Why we use it –How do we learn it –Why is Bayesian estimation of NB parameters important Text classification –Bag of words model Gaussian NB –Features are still conditionally independent –Each feature has a Gaussian distribution given class (C) Dhruv Batra33

Generative vs. Discriminative Generative Approach –Estimate p(x|y) and p(y) –Use Bayes Rule to predict y Discriminative Approach –Estimate p(y|x) directly OR –Learn “discriminant” function h(x) (C) Dhruv Batra34

Today: Logistic Regression Main idea –Think about a 2 class problem {0,1} –Can we regress to P(Y=1 | X=x)? Meet the Logistic or Sigmoid function –Crunches real numbers down to 0-1 Model –In regression: y ~ N(w’x, λ 2 ) –Logistic Regression: y ~ Bernoulli(σ(w’x)) (C) Dhruv Batra35

Understanding the sigmoid w 0 =0, w 1 =1w 0 =2, w 1 =1w 0 =0, w 1 =0.5 Slide Credit: Carlos Guestrin(C) Dhruv Batra36