ECE 5424: Introduction to Machine Learning

Slides:



Advertisements
Similar presentations
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Advertisements

Machine Learning Week 2 Lecture 1.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Naïve Bayes Classifier
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Exercise Session 10 – Image Categorization
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Matt Gormley Lecture 3 September 7, 2016
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Dhruv Batra Georgia Tech
Probability Theory and Parameter Estimation I
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Generative verses discriminative classifier
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
with observed random variables
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Naïve Bayes Classifier
Presentation transcript:

ECE 5424: Introduction to Machine Learning Topics: Classification: Naïve Bayes Readings: Barber 10.1-10.3 Stefan Lee Virginia Tech

Administrativia HW2 Next Tuesday’s Class Due: Friday 09/28, 10/3, 11:55pm Implement linear regression, Naïve Bayes, Logistic Regression Next Tuesday’s Class Review of topics Assigned readings on convexity with optional (useful) video Might be on the exam so brush up this and stochastic gradient descent. (C) Dhruv Batra

Administrativia Midterm Exam When: October 6th, class timing Where: In class Format: Pen-and-paper. Open-book, open-notes, closed-internet. No sharing. What to expect: mix of Multiple Choice or True/False questions “Prove this statement” “What would happen for this dataset?” Material Everything from beginning to class to Tuesday’s lecture (C) Dhruv Batra

New Topic: Naïve Bayes (your first probabilistic classifier) x Classification y Discrete (C) Dhruv Batra

Error Decomposition Approximation/Modeling Error Estimation Error You approximated reality with model Estimation Error You tried to learn model with finite data Optimization Error You were lazy and couldn’t/didn’t optimize to completion Bayes Error Reality just sucks (C) Dhruv Batra

Classification Learn: h:X  Y X – features Y – target classes Suppose you know P(Y|X) exactly, how should you classify? Bayes classifier: Why? Slide Credit: Carlos Guestrin

Optimal classification Theorem: Bayes classifier hBayes is optimal! That is Proof: Slide Credit: Carlos Guestrin

Generative vs. Discriminative Generative Approach Estimate p(x|y) and p(y) Use Bayes Rule to predict y Discriminative Approach Estimate p(y|x) directly OR Learn “discriminant” function h(x) (C) Dhruv Batra

Generative vs. Discriminative Generative Approach Assume some functional form for P(X|Y), P(Y) Estimate p(X|Y) and p(Y) Use Bayes Rule to calculate P(Y| X=x) Indirect computation of P(Y|X) through Bayes rule But, can generate a sample, P(X) = y P(y) P(X|y) Discriminative Approach Estimate p(y|x) directly OR Learn “discriminant” function h(x) Direct but cannot obtain a sample of the data, because P(X) is not available (C) Dhruv Batra

Generative vs. Discriminative Today: Naïve Bayes Discriminative: Next: Logistic Regression NB & LR related to each other. (C) Dhruv Batra

How hard is it to learn the optimal classifier? Categorical Data How do we represent these? How many parameters? Class-Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X|Y): Suppose X is composed of d binary features Complex model  High variance with limited data!!! Slide Credit: Carlos Guestrin

Independence to the rescue (C) Dhruv Batra Slide Credit: Sam Roweis

The Naïve Bayes assumption Features are independent given class: More generally: How many parameters now? Suppose X is composed of d binary features (C) Dhruv Batra Slide Credit: Carlos Guestrin

The Naïve Bayes Classifier Given: Class-Prior P(Y) d conditionally independent features X given the class Y For each Xi, we have likelihood P(Xi|Y) Decision rule: If assumption holds, NB is optimal classifier! (C) Dhruv Batra Slide Credit: Carlos Guestrin

MLE for the parameters of NB Given dataset Count(A=a,B=b) #number of examples where A=a and B=b MLE for NB, simply: Class-Prior: P(Y=y) = Likelihood: P(Xi=xi | Y=y) = (C) Dhruv Batra

HW1 (C) Dhruv Batra

Naïve Bayes In class demo Variables Estimate Y = {haven’t watched an American football game, have watched an American football game} X1 = {domestic, international} X2 = {<2 years at VT, >=2years} Estimate P(Y=1) P(X1=0 | Y=0), P(X2=0 | Y=0) P(X1=0 | Y=1), P(X2=0 | Y=1) Prediction: argmax_y P(Y=y)P(x1|Y=y)P(x2|Y=y) (C) Dhruv Batra

Subtleties of NB classifier 1 – Violating the NB assumption Usually, features are not conditionally independent: Probabilities P(Y|X) often biased towards 0 or 1 Nonetheless, NB is a very popular classifier NB often performs well, even when assumption is violated [Domingos & Pazzani ’96] discuss some conditions for good performance (C) Dhruv Batra Slide Credit: Carlos Guestrin

Subtleties of NB classifier 2 – Insufficient training data What if you never see a training instance where X1=a when Y=c? e.g., Y={NonSpamEmail}, X1={‘Nigeria’} P(X1=a | Y=c) = 0 Thus, no matter what the values X2,…,Xd take: P(Y=c | X1=a,X2,…,Xd) = 0 What now??? (C) Dhruv Batra Slide Credit: Carlos Guestrin

Recall MAP for Bernoulli-Beta MAP: use most likely parameter: Beta prior equivalent to extra flips (C) Dhruv Batra Slide Credit: Carlos Guestrin

Bayesian learning for NB parameters – a.k.a. smoothing Prior on parameters Dirichlet all the things! MAP estimate Now, even if you never observe a feature/class, posterior probability never zero (C) Dhruv Batra Slide Credit: Carlos Guestrin

Text classification Classify e-mails Classify news articles Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, …} What about the features X? The text! (C) Dhruv Batra Slide Credit: Carlos Guestrin

Features X are entire document – Xi for ith word in article (C) Dhruv Batra Slide Credit: Carlos Guestrin

NB for Text classification P(X|Y) is huge!!! Article at least 1000 words, X={X1,…,X1000} Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. NB assumption helps a lot!!! P(Xi=xi|Y=y) is just the probability of observing word xi in a document on topic y (C) Dhruv Batra Slide Credit: Carlos Guestrin

Bag of Words model Typical additional assumption: Position in document doesn’t matter: P(Xi=a|Y=y) = P(Xk=a|Y=y) “Bag of words” model – order of words on the page ignored Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. (C) Dhruv Batra Slide Credit: Carlos Guestrin

Bag of Words model Typical additional assumption: Position in document doesn’t matter: P(Xi=a|Y=y) = P(Xk=a|Y=y) “Bag of words” model – order of words on the page ignored Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you (C) Dhruv Batra Slide Credit: Carlos Guestrin

Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 oil 1 … Zaire 0 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Object Bag of ‘words’ (C) Dhruv Batra Slide Credit: Fei Fei Li

(C) Dhruv Batra Slide Credit: Fei Fei Li

learning recognition category models (and/or) classifiers feature detection & representation codewords dictionary image representation category decision category models (and/or) classifiers (C) Dhruv Batra Slide Credit: Fei Fei Li

What if we have continuous Xi ? Eg., character recognition: Xi is ith pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., ) (C) Dhruv Batra Slide Credit: Carlos Guestrin

Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates: (C) Dhruv Batra

What you need to know about NB Optimal decision using Bayes Classifier Naïve Bayes classifier What’s the assumption Why we use it How do we learn it Why is Bayesian estimation of NB parameters important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class (C) Dhruv Batra

Generative vs. Discriminative Generative Approach Estimate p(x|y) and p(y) Use Bayes Rule to predict y Discriminative Approach Estimate p(y|x) directly OR Learn “discriminant” function h(x) (C) Dhruv Batra

Today: Logistic Regression Main idea Think about a 2 class problem {0,1} Can we regress to P(Y=1 | X=x)? Meet the Logistic or Sigmoid function Crunches real numbers down to 0-1 Model In regression: y ~ N(w’x, λ2) Logistic Regression: y ~ Bernoulli(σ(w’x)) (C) Dhruv Batra

Understanding the sigmoid w0=2, w1=1 w0=0, w1=1 w0=0, w1=0.5 (C) Dhruv Batra Slide Credit: Carlos Guestrin