CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Linear Regression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
The loss function, the normal equation,
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
x – independent variable (input)
Classification and risk prediction
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Methods for Classification
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
University of Southern California Department Computer Science Bayesian Logistic Regression Model (Final Report) Graduate Student Teawon Han Professor Schweighofer,
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning – Lecture 6
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Roberto Battiti, Mauro Brunato
Classification Discriminant Analysis
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
The loss function, the normal equation,
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Linear Discrimination
Presentation transcript:

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Generative Models We have shown: Decision boundary:

Probabilistic Generative Models K >2 classes: We can show the following result:

Maximum likelihood solution We have a parametric functional form for the class- conditional densities: We can estimate the parameters and the prior class probabilities using maximum likelihood.  Two class case with shared covariance matrix.  Training data:

Maximum likelihood solution For a data point from class we have and therefore

Maximum likelihood solution Assuming observations are drawn independently, we can write the likelihood function as follows:

Maximum likelihood solution We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data. As usual, we consider the log of the likelihood:

Maximum likelihood solution We first maximize the log likelihood with respect to. The terms that depend on are

Maximum likelihood solution Thus, the maximum likelihood estimate of is the fraction of points in class The result can be generalized to the multiclass case: the maximum likelihood estimate of is given by the fraction of points in the training set that belong to

Maximum likelihood solution We now maximize the log likelihood with respect to. The terms that depend on are

Maximum likelihood solution Thus, the maximum likelihood estimate of is the sample mean of all the input vectors assigned to class By maximizing the log likelihood with respect to we obtain a similar result for

Maximum likelihood solution Maximizing the log likelihood with respect to we obtain the maximum likelihood estimate  Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance matrices associated with each of the classes.  This results extend to K classes.

Probabilistic Discriminative Models Two-class case: Multiclass case: Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum likelihood.

Probabilistic Discriminative Models Advantages:  Fewer parameters to be determined  Improved predictive performance, especially when the class-conditional density assumptions give a poor approximation of the true distributions.

Probabilistic Discriminative Models Two-class case: In the terminology of statistics, this model is known as logistic regression. Assuming how many parameters do we need to estimate?

Probabilistic Discriminative Models How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)?

Logistic Regression We use maximum likelihood to determine the parameters of the logistic regression model.

Logistic Regression We consider the negative logarithm of the likelihood:

Logistic Regression We compute the derivative of the error function with respect to w (gradient): We need to compute the derivative of the logistic sigmoid function:

Logistic Regression

 The gradient of E at w gives the direction of the steepest increase of E at w. We need to minimize E. Thus we need to update w so that we move along the opposite direction of the gradient: This technique is called gradient descent  It can be shown that E is a concave function of w. Thus, it has a unique minimum.  An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization).

Batch vs. on-line learning  The computation of the above gradient requires the processing of the entire training set (batch technique)  If the data set is large, the above technique can be costly;  For real time applications in which data become available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique).

On-line learning  After the presentation of each data point n, we compute the contribution of that data point to the gradient (stochastic gradient):  The on-line updating rule for the parameters becomes:

Multiclass Logistic Regression Multiclass case: We use maximum likelihood to determine the parameters of the logistic regression model.

Multiclass Logistic Regression We consider the negative logarithm of the likelihood:

Multiclass Logistic Regression We compute the gradient of the error function with respect to one of the parameter vectors:

Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function:

Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function:

Multiclass Logistic Regression Compact expression:

Multiclass Logistic Regression

 It can be shown that E is a concave function of w. Thus, it has a unique minimum.  For a batch solution, we can use the Newton-Raphson optimization technique.  On-line solution (stochastic gradient descent):