Crash Course on Machine Learning Part II

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
2D1431 Machine Learning Boosting.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Ensemble Learning: An Introduction
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
Issues with Data Mining
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CS 391L: Machine Learning: Ensembles
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Generative verses discriminative classifier
Benk Erika Kelemen Zsolt
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Machine Learning: Ensemble Methods
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ensemble learning.
Presentation transcript:

Crash Course on Machine Learning Part II Several slides from Luke Zettlemoyer, Carlos Guestrin, and Ben Taskar

We have talked about… MLE MAP and Conjugate priors Naïve Bayes

another probabilistic approach!!! Naïve Bayes: directly estimate the data distribution P(X,Y)! challenging due to size of distribution! make Naïve Bayes assumption: only need P(Xi|Y)! But wait, we classify according to: maxY P(Y|X) Why not learn P(Y|X) directly?

Discriminative vs. generative Generative model (The artist) 10 20 30 40 50 60 70 0.05 0.1 x = data 10 20 30 40 50 60 70 0.5 1 x = data Discriminative model (The lousy painter) 10 20 30 40 50 60 70 80 -1 1 x = data Classification function

Logistic Regression Learn P(Y|X) directly! Logistic function (Sigmoid): Learn P(Y|X) directly! Assume a particular functional form Sigmoid applied to a linear function of the data: Z

Logistic Regression: decision boundary Prediction: Output the Y with highest P(Y|X) For binary Y, output Y=0 if w.X+w0 = 0 A Linear Classifier!

Loss functions / Learning Objectives: Likelihood v Loss functions / Learning Objectives: Likelihood v. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood But, discriminative (logistic regression) loss function: Conditional Data Likelihood Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters for classification Discriminative models cannot compute P(xj|w)!

Conditional Log Likelihood equal because yj is in {0,1} remaining steps: substitute definitions, expand logs, and simplify

Logistic Regression Parameter Estimation: Maximize Conditional Log Likelihood Good news: l(w) is concave function of w no locally optimal solutions! Bad news: no closed-form solution to maximize l(w) Good news: concave functions “easy” to optimize

Optimizing concave function – Gradient ascent Conditional likelihood for Logistic Regression is concave ! Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better Gradient: Update rule:

Maximize Conditional Log Likelihood: Gradient ascent

Gradient Descent for LR Gradient ascent algorithm: (learning rate η > 0) do: For i=1…n: (iterate over weights) until “change” <  Loop over training examples!

Large parameters… Maximum likelihood solution: prefers higher weights higher likelihood of (properly classified) examples close to decision boundary larger influence of corresponding features on decision can cause overfitting!!! Regularization: penalize high weights again, more on this later in the quarter

How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance Often called Regularization Helps avoid very large weights and overfitting MAP estimate:

M(C)AP as Regularization Add log p(w) to objective: Quadratic penalty: drives weights towards zero Adds a negative linear term to the gradients

MLE vs. MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate

Logistic regression v. Naïve Bayes Consider learning f: X  Y, where X is a vector of real-valued features, < X1 … Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(ik,i) model P(Y) as Bernoulli(q,1-q) What does that imply about the form of P(Y|X)? Cool!!!!

Derive form for P(Y|X) for continuous Xi up to now, all arithmetic only for Naïve Bayes models Looks like a setting for w0? Can we solve for wi ? Yes, but only in Gaussian case

Ratio of class-conditional probabilities Linear function! Coefficents expressed with original Gaussian parameters! …

Derive form for P(Y|X) for continuous Xi

Gaussian Naïve Bayes vs. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Can go both ways, we only did one way Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! Optimize different functions ! Obtain different solutions

Naïve Bayes vs. Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters: Naïve Bayes: 4n +1 Logistic Regression: n+1 Estimation method: Naïve Bayes parameter estimates are uncoupled Logistic Regression parameter estimates are coupled

Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Generative vs. Discriminative classifiers Asymptotic comparison (# training examples  infinity) when model correct GNB (with class independent variances) and LR produce identical classifiers when model incorrect LR is less biased – does not assume conditional independence therefore LR expected to outperform GNB

Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Generative vs. Discriminative classifiers Non-asymptotic analysis convergence rate of parameter estimates, (n = # of attributes in X) Size of training data to get close to infinite data solution Naïve Bayes needs O(log n) samples Logistic Regression needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class ! assumption on P(X|Y) LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave ! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit

Decision Boundary

Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers that are most “sure” will vote with more conviction Classifiers will be most “sure” about a particular part of the space On average, do better than single classifier! But how??? force classifiers to learn about different parts of the input space? different subsets of the data? weigh the votes of different classifiers?

BAGGing = Bootstrap AGGregation (Breiman, 1996) for i = 1, 2, …, K: Ti  randomly select M training instances with replacement hi  learn(Ti) [ID3, NB, kNN, neural net, …] Now combine the Ti together with uniform voting (wi=1/K for all i)

Decision Boundary

shades of blue/red indicate strength of vote for particular classification

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Low variance, don’t usually overfit Simple (a.k.a. weak) learners are bad High bias, can’t solve hard learning problems Can we make weak learners always good??? No!!! But often yes…

Boosting On each iteration t: [Schapire, 1989] Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: weight each training example by how incorrectly it was classified Learn a hypothesis – ht A strength for this hypothesis – t Final classifier: Practically useful Theoretically interesting

time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertical line

time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis

time = 2

time = 3

time = 13

time = 100

time = 300 overfitting!!

Learning from weighted data Consider a weighted dataset D(i) – weight of i th training example (xi,yi) Interpretations: i th training example counts as if it occurred D(i) times If I were to “resample” data, I would get more samples of “heavier” data points Now, always do weighted calculations: e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

How? Many possibilities. Will see one shortly! Final Result: linear sum of “base” or “weak” classifier outputs.

What t to choose for hypothesis ht? [Schapire, 1989] Idea: choose t to minimize a bound on training error! Where

What t to choose for hypothesis ht? [Schapire, 1989] Idea: choose t to minimize a bound on training error! Where And If we minimize t Zt, we minimize our training error!!! We can tighten this bound greedily, by choosing t and ht on each iteration to minimize Zt. ht is estimated as a black box, but can we solve for t? This equality isn’t obvious! Can be shown with algebra (telescoping sums)!

Summary: choose t to minimize error bound [Schapire, 1989] We can squeeze this bound by choosing t on each iteration to minimize Zt. For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire ’97]:

Strong, weak classifiers If each classifier is (at least slightly) better than random: t < 0.5 Another bound on error: What does this imply about the training error? Will reach zero! Will get there exponentially fast! Is it hard to achieve better than random training error? Why will it reach zero? our bound is exp(-x), for ever increasing x!!! (al long as we get better than 50%)

Boosting results – Digit recognition [Schapire, 1989] Test error Training error Boosting: Seems to be robust to overfitting Test error can decrease even after training error is zero!!!

Boosting generalization error bound [Freund & Schapire, 1996] Constants: T: number of boosting rounds Higher T  Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier Higher d  bigger hypothesis space  looser bound m: number of training examples more data  tighter bound

Boosting generalization error bound [Freund & Schapire, 1996] Constants: T: number of boosting rounds: Higher T  Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier Higher d  bigger hypothesis space  looser bound m: number of training examples more data  tighter bound Theory does not match practice: Robust to overfitting Test set error decreases even after training error is zero Need better analysis tools we’ll come back to this later in the quarter

Logistic Regression as Minimizing Loss Logistic regression assumes: And tries to maximize data likelihood, for Y={-1,+1}: Equivalent to minimizing log loss:

Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss: Boosting minimizes similar loss function: Both smooth approximations of 0/1 loss!