Matt Gormley Lecture 4 September 12, 2016

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Crash Course on Machine Learning
Collaborative Filtering Matrix Factorization Approach
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Efficient Logistic Regression with Stochastic Gradient Descent
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Matt Gormley Lecture 5 September 14, 2016
Neural networks and support vector machines
Matt Gormley Lecture 11 October 5, 2016
Matt Gormley Lecture 3 September 7, 2016
Gradient descent David Kauchak CS 158 – Fall 2016.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
A Fast Trust Region Newton Method for Logistic Regression
Dan Roth Department of Computer and Information Science
ECE 5424: Introduction to Machine Learning
Linear Regression (continued)
10701 / Machine Learning.
Discriminative and Generative Classifiers
ECE 5424: Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Probabilistic Models for Linear Regression
CSCI 5822 Probabilistic Models of Human and Machine Learning
Probabilistic Models with Latent Variables
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Softmax Classifier.
Recap: Naïve Bayes classifier
Presentation transcript:

Matt Gormley Lecture 4 September 12, 2016 School of Computer Science 10-601 Introduction to Machine Learning Logistic Regression Matt Gormley Lecture 4 September 12, 2016 Readings: Murphy Ch. 8.1-3, 8.6 Elken (2014) Notes Slides: Courtesy William Cohen

Reminders Homework 2: Recitation schedule posted on course website Extension: due Friday (9/16) at 5:30pm Recitation schedule posted on course website

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Background: Hyperplanes Why don’t we drop the generative model and try to learn this hyperplane directly?

Background: Hyperplanes Hyperplane (Definition 1): Hyperplane (Definition 2): Half-spaces:

Learning as Optimization

Learning as optimization The main idea today Make two assumptions: the classifier is linear, like naïve Bayes ie roughly of the form fw(x) = sign(x . w) we want to find the classifier w that “fits the data best” Formalize as an optimization problem Pick a loss function J(w,D), and find argminw J(w) OR: Pick a “goodness” function, often Pr(D|w), and find the argmaxw fD(w)

Learning as optimization: warmup Find: optimal (MLE) parameter θ of a binomial Dataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1

Learning as optimization: warmup Goal: Find the best parameter θ of a binomial Dataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1 = 0 θ= 0 θ= 1 k- kθ – nθ + kθ = 0 nθ = k θ = k/n

Learning as optimization with gradient ascent Goal: Learn the parameter w of … Dataset: D={(x1,y1)…,(xn , yn)} Use your model to define Pr(D|w) = …. Set w to maximize Likelihood Usually we use numeric methods to find the optimum i.e., gradient ascent: repeatedly take a small step in the direction of the gradient Difference between ascent and descent is only the sign: I’m going to be sloppy here about that.

Gradient ascent To find argminx f(x): Start with x0 For t=1…. xt+1 = xt + λ f’(x t) where λ is small

Gradient descent Likelihood: ascent Loss: descent

Pros and cons of gradient descent Simple and often quite effective on ML tasks Often very scalable Only applies to smooth functions (differentiable) Might find a local minimum, rather than a global one

Pros and cons of gradient descent There is only one local optimum if the function is convex

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Learning Linear Classifiers with Gradient ascent Part 1: The Idea

Using gradient ascent for linear classifiers Replace sign(x . w) with something differentiable: e.g. the logistic(x . w) sign(x)

Using gradient ascent for linear classifiers Replace sign(x . w) with something differentiable: e.g. the logistic(x . w) Define a function on the data that formalizes “how well w predicts the data” (loss function) Assume y=0 or y=1 Our optimization target is log of conditional likelihood

Using gradient ascent for linear classifiers Replace sign(x . w) with something differentiable: e.g. the logistic(x . w) Define a function on the data that formalizes “how well w predicts the data” (log likelihood) Differentiate the likelihood function and use gradient ascent Start with w0 For t=1…. wt+1 = wt + λ Loss’D(w t) where λ is small

Stochastic Gradient Descent (SGD) Difference between ascent and descent is only the sign: we’ll be sloppy and use the more common “SGD” Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ For large datasets this is expensive: we don’t want to load all the data D into memory, and the gradient depends on all the data An alternative: pick a small subset of examples B<<D approximate the gradient using them “on average” this is the right direction take a step in that direction repeat…. B = one example is a very popular choice

Using SGD for logistic regression P(y|x)=logistic(x . w) Define the likelihood: Differentiate the LCL function and use gradient descent to minimize Start with w0 For t=1,…,T - until convergence For each example x,y in D: wt+1 = wt + λ Lx,y(w t) where λ is small More steps, noisier path toward the minimum, but each step is cheaper

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Learning Linear Classifiers with Gradient Descent Part 2: The Gradient I will start by deriving the gradient for one example (eg SGD) and then move to “batch” gradient descent

Likelihood on one example is: We’re going to dive into this thing here: d/dw(p)

p

Breaking it down: SGD for logistic regression P(y|x)=logistic(x . w) Define a function Differentiate the function and use gradient descent Start with w0 For t=1,…,T - until convergence For each example x,y in D: wt+1 = wt + λ Lx,y(w t) where λ is small

Details: Picking learning rate Use grid-search in log-space over small values on a tuning set: e.g., 0.01, 0.001, … Sometimes, decrease after each pass: e.g factor of 1/(1 + dt), t=epoch sometimes 1/t2 Fancier techniques I won’t talk about: Adaptive gradient: scale gradient differently for each dimension (Adagrad, ADAM, ….)

Details: Debugging Check that gradient is indeed a locally good approximation to the likelihood “finite difference test”

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Logistic Regression: OBSERVATIONS

Convexity and logistic regression This LCL function is convex: there is only one local minimum. So gradient descent will give the global minimum.

Non-stochastic gradient descent In batch gradient descent, average the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Non-stochastic gradient descent This can be interpreted as a difference between the expected value of y|xj=1 in the data and the expected value of y|xj=1 as predicted by the model Gradient ascent tries to make those equal

This LCL function “overfits” This can be interpreted as a difference between the expected value of y|xj=1 in the data and the expected value of y|xj=1 as predicted by the model Gradient ascent tries to make those equal That’s impossible for some wj ! e.g., if xj =1 only in positive examples, the gradient is always positive

This LCL function “overfits” This can be interpreted as a difference between the expected value of y|xj=1 in the data and the expected value of y|xj=1 as predicted by the model Gradient ascent tries to make those equal That’s impossible for some wj e.g., if they appear only in positive examples, gradient is always possible. Using this LCL function for text: practically, it’s important to discard rare features to get good results.

This LCL function “overfits” Overfitting is often a problem in supervised learning. When you fit the data (minimize LCL) are you fitting “real structure” in the data or “noise” in the data? Will the patterns you see appear in a test set or not? hi error Error/LCL on an unseen test set Dtest Error/LCL on training set D more features

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Regularized Logistic Regression

Regularized logistic regression as a MAP Minimizing our LCL function maximizes log conditional likelihood of the data (LCL): like MLE: maxw Pr(D|w) … but focusing on just a part of D Another possibility: introduce a prior over w maximize something like a MAP Pr(w|D) = 1/Z * Pr(D|w)Pr(w) If the prior wj is zero-mean Gaussian then Pr(wj) = 1/Z exp(wj)-2

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So the update for wj becomes: Or (Log Conditional Likelihood, our LCL function)

Breaking it down: regularized SGD for logistic regression P(y|x)=logistic(x . w) Define a function Differentiate the function and use gradient descent Start with w0 For t=1,…,T - until convergence For each example x,y in D: wt+1 = wt + λ Lx,y(w t) where λ is small pushes w toward all-zeros vector (“weight decay”)

Outline Background: Hyperplanes Learning as Optimization MLE Example Gradient descent (in pictures) Gradient descent for Linear Classifiers Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) Logistic Regression and Overfitting (non-stochastic) Gradient Descent Difference of expectations Regularization L2 Regularization Regularization as MAP estimation Discriminative vs. Generative Classifiers

Discriminative and generative classifiers

Generative vs. Discriminative Generative Classifiers: Example: Naïve Bayes Define a joint model of the observations x and the labels y: Learning maximizes (joint) likelihood Use Bayes’ Rule to classify based on the posterior: Discriminative Classifiers: Example: Logistic Regression Directly model the conditional: Learning maximizes conditional likelihood

Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes

solid: NB dashed: LR

Naïve Bayes makes stronger assumptions about the data solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters “On Discriminative vs Generative Classifiers: ….” Andrew Ng and Michael Jordan, NIPS 2001.

Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled  Closed form solution for MLE Logistic Regression: Parameters are coupled  No closed form solution – must use iterative optimization techniques instead

Summary Discriminative classifiers directly model the conditional, p(y|x) Logistic regression is a simple linear classifier, that retains a probabilistic semantics Parameters in LR are learned by iterative optimization (e.g. SGD) Regularization helps to avoid overfitting