10701 / 15781 Machine Learning.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
x – independent variable (input)
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Visual Recognition Tutorial
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
Review of Lecture Two Linear Regression Normal Equation
Collaborative Filtering Matrix Factorization Approach
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
EM and expected complete log-likelihood Mixture of Experts
Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Lecture 2: Statistical learning primer for biologists
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Logistic Regression William Cohen.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Deep Feedforward Networks
Large Margin classifiers
Probability Theory and Parameter Estimation I
CSE 4705 Artificial Intelligence
CH 5: Multivariate Methods
Machine learning, pattern recognition and statistical data modelling
ECE 5424: Introduction to Machine Learning
Lecture 04: Logistic Regression
Machine Learning Basics
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Logistic Regression.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

10701 / 15781 Machine Learning

The distribution of w When the assumptions in the linear model are correct, the ML estimator follows a simple Gaussian distribution given by This results from our assumption about the distribution of y

Bias Many estimators can have no bias. Of these, we are interested in the minimum variance unbiased estimator (MVUE) Maximum likelihood estimators can be biased (though they are asymptotically unbiased)

Topics Experiment Selection Modeling Supervised Learning (Classification) Unsupervised Learning (Clustering) Data Analysis (Estimation and Regression)

Today: - Logistic regression, - Regularization Next time: Clustering

Categorical regression So far we discussed regression using real valued outcomes. However in some cases we are interested in categorical or binary outcomes. Examples: Documents, web pages, protein function etc. Not real valued output

Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: where for i<k and for k

Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: for i<k and for k Sum to 1 for all categories

Logistic regression The name comes from the logit transformation:

Binary logistic regression We only need one set of parameters This results in a “squashing function” which turns linear predictions into probabilities

Binary logistic regression We only need one set of parameters This results in a “squashing function” which turns linear predictions into probabilities

Logistic regression vs. Linear regression

Example

Log likelihood

Log likelihood Note: this likelihood is a concave

Maximum likelihood estimation Common (but not only) approaches: Numerical Solutions: Line Search Simulated Annealing Gradient Descent Newton’s Method No close form solution! prediction error

Gradient descent

Gradient ascent We eventually reach the maximum Iteratively updating the weights in this fashion increases likelihood each round. We eventually reach the maximum We are near the maximum when changes in the weights are small. Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number.

Example We get a monotonically increasing log likelihood of the training labels as a function of the iterations

Convergence The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: This condition means (again) that the prediction error is uncorrelated with the components of the input vector

Additive models We can extend the logistic regression models to additive (logistic) models We are free to choose the basis functions i(x) For example, each i can act as an adjustable “feature detector”:

Regularization With more basis functions and parameters the models are more flexible to fit any relation (or noise) in the data. We need to find ways of constraining or “regularizing” such models in ways other than limiting the number of parameters We can limit the “effective” number of parameter choices, where the number depends on the resolution

Regularization: basic idea The set of (0/1) coins parameterized by the probability p of getting “1” How many coins are there? - Case 1: uncountable number - Case 2: 5 coins (p1,...,p5) so that predictions of any other coin (indexed by p) is no more than  = 0.1 away: for any p, |p- pj|≤0.1 for at least one j - Case 3: only 1 coin if =0.5

Regularization: logistic regression Simple logistic regression model parameterized by w=(w0,w1) we assume that x  [-1,1], i.e., that the inputs remain bounded. We can now divide the parameter space into regions with centers w1,w2,... such that the predictions of any w (for any x  [-1,1]) are close to those of one of the centers:

Regularization: logistic regression We can view the different discrete parameter choices w1,w2,...,wn,… as “centroids” of regions in the parameter space such that within each region

Regularizing logistic regression We can regularize the models by imposing a penalty in the estimation criterion that encourages ||w|| to remain small Maximum penalized log likelihood criterion: where larger values of C impose stronger regularization. More generally, we can assign penalties based on prior distributions over the parameters, i.e., add logP(w) in the log likelihood criterion.

Regularization: gradient ascent As with the non-regularized version, we maximize the likelihood of the regularized function: Taking the derivative with respect to wj : Thus, our new update role becomes:

Training vs. test errors How do the training/test conditional log likelihoods behave as a function of the regularization parameter ?

Acknowledgment These slides are based in part on slides from previous machine learning classes taught by Andrew Moore at CMU and Tommi Jaakkola at MIT. I thank Andrew and Tommi for letting me use their slides.