Matt Gormley Lecture 5 September 14, 2016

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
x – independent variable (input)
Chapter 5 Orthogonality
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Ordinary least squares regression (OLS)
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Collaborative Filtering Matrix Factorization Approach
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Generative verses discriminative classifier
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Neural networks and support vector machines
Kernel Regression Prof. Bennett
Matt Gormley Lecture 4 September 12, 2016
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Large Margin classifiers
Probability Theory and Parameter Estimation I
Extreme Learning Machine
Introduction to Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Generative verses discriminative classifier
CSE 4705 Artificial Intelligence
Linear Regression (continued)
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Chapter 10: Solving Linear Systems of Equations
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Linear Classifier by Dr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
What is Regression Analysis?
Perceptron as one Type of Linear Discriminants
Integration of sensory modalities
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Image Classification & Training of Neural Networks
Mathematical Foundations of BME Reza Shadmehr
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Presentation transcript:

Matt Gormley Lecture 5 September 14, 2016 School of Computer Science 10-601 Introduction to Machine Learning Linear Regression Matt Gormley Lecture 5 September 14, 2016 Readings: Bishop, 3.1

Reminders Homework 2: Homework 3: Extension: due Sunday (9/18) at 5:30pm Homework 3: released today/tomorrow – 1.5 weeks to complete it Recitation schedule posted on course website Extra Octave recitation – Thursday (time TBD)

Reminders Having trouble finding a seat…? To attend this class, you must be formally registered (e.g. letter grade, audit, pass/fail)

Outline Linear Regression Learning Advanced Topics Simple example Model Learning Gradient Descent SGD Closed Form Advanced Topics Geometric and Probabilistic Interpretation of LMS L2 Regularization L1 Regularization Features

Outline Linear Regression Learning (aka. Least Squares) Simple example Model Learning (aka. Least Squares) Gradient Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equations) Advanced Topics Geometric and Probabilistic Interpretation of LMS L2 Regularization (aka. Ridge Regression) L1 Regularization (aka. LASSO) Features (aka. non-linear basis functions)

Linear regression Y Our goal is to estimate w from a training data of <xi,yi> pairs Optimization goal: minimize squared error (least squares): Why least squares? - minimizes squared distance between measurements and predicted line - has a nice probabilistic interpretation - the math is pretty X see HW

Solving linear regression To optimize – closed form: We just take the derivative w.r.t. to w and set to 0:

Linear regression Given an input x we would like to compute an output y In linear regression we assume that y and x are related with the following equation: y = wx+ where w is a parameter and  represents measurement or other noise Y Observed values What we are trying to predict X

Regression example Generated: w=2 Recovered: w=2.03 Noise: std=1

Regression example Generated: w=2 Recovered: w=2.05 Noise: std=2

Regression example Generated: w=2 Recovered: w=2.08 Noise: std=4

Bias term So far we assumed that the line passes through the origin What if the line does not? No problem, simply change the model to y = w0 + w1x+ Can use least squares to determine w0 , w1 Y w0 X

What are some example problems of this form? Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. What are some example problems of this form?

Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. Prediction: Output is a linear function of the inputs. (We assume x1 is 1) Learning: finds the parameters that minimize some objective function.

Least Squares Learning: finds the parameters that minimize some objective function. We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted hyperplane (line in 1D) Has a nice probabilistic interpretation

Least Squares This is a very general optimization setup. Learning: finds the parameters that minimize some objective function. We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted hyperplane (line in 1D) Has a nice probabilistic interpretation This is a very general optimization setup. We could solve it in lots of ways. Today, we’ll consider three ways.

Least Squares Learning: Three approaches to solving Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters)

Gradient Descent In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).

Gradient Descent There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance. Alternatively we could check that the reduction in the objective function from one iteration to the next is small.

Stochastic Gradient Descent (SGD) Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-example objective:

Stochastic Gradient Descent (SGD) Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-example objective:

Stochastic Gradient Descent (SGD) Let’s start by calculating this partial derivative for the Linear Regression objective function. Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-example objective:

Partial Derivatives for Linear Reg.

Partial Derivatives for Linear Reg. Used by SGD (aka. LMS) Used by Gradient Descent

Least Mean Squares (LMS) Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm

Optimization for Linear Reg. vs. Logistic Reg. Can use the same tricks for both: regularization tuning learning rate on development data shuffle examples out-of-core (if can’t fit in memory) and stream over them local hill climbing yields global optimum (both problems are convex) etc. But Logistic Regression does not have a closed form solution for MLE parameters… …what about Linear Regression?

Least Squares Learning: Three approaches to solving Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters)

Background: Matrix Derivatives For , define: Trace: Some fact of matrix derivatives (without proof) © Eric Xing @ CMU, 2006-2011

The normal equations Write the cost function in matrix form: To minimize J(θ), take derivative and set to zero: In most situations of practical interest, the number of data points N is larger than the dimensionality k of the inout space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessaruly invertable and thus we can express \theta explicity as The assumption that X^TX is invertible implies that it is positive definite, thus the critical point we have found is a minimum. What if X has less than full column rank. - regularization. The normal equations © Eric Xing @ CMU, 2006-2011

Comments on the normal equation In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible. The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum. What if X has less than full column rank?  regularization (later). The rank of a matrix A is the number of independent columns of A. A square matrix is full rank if all of its columns are independent. That is, a square full rank matrix has no column vector vi of A that can be expressed as a linear combination of the other column vectors. That is, a square full rank matrix has no column vector vi of A that can be expressed as a linear combination of the other column vectors. A simple test for determining if a square matrix is full rank is to calculate its determinant. If the determinant is zero, there are linearly dependent columns and the matrix is not full rank. Prof. John Doyle also mentioned during lecture that one can perform the singular value decomposition of a matrix, and if the lowest singular value is near or equal to zero the matrix is likely to be not full rank ("singular"). The column rank of a matrix A is the maximal number of linearly independent columns of A. Likewise, the row rank is the maximal number of linearly independent rows of A. There exists a unique lower triangular matrix L, with strictly positive diagonal elements, that allows the factorization of M into M = LL * © Eric Xing @ CMU, 2006-2011

Direct and Iterative methods Direct methods: we can achieve the solution in a single step by solving the normal equation Using Gaussian elimination or QR decomposition, we converge in a finite number of steps It can be infeasible when data are streaming in in real time, or of very large amount Iterative methods: stochastic or steepest gradient Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate a © Eric Xing @ CMU, 2006-2011

Convergence rate Theorem: the steepest descent equation algorithm converge to the minimum of the cost characterized by normal equation: If A formal analysis of LMS needs more math; in practice, one can use a small a, or gradually decrease a. From normal equation: we have a cauterization of where we converge to From steepest descent equating: we can characterize the LMS will be expected to following on average © Eric Xing @ CMU, 2006-2011

Convergence Curves For the batch method, the training MSE is initially large due to uninformed initialization In the online update, N updates for every epoch reduces MSE to a much smaller value. © Eric Xing @ CMU, 2006-2011

Least Squares Learning: Three approaches to solving Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) pros: conceptually simple, guaranteed convergence cons: batch, often slow to converge Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) pros: memory efficient, fast convergence, less prone to local optima cons: convergence in practice requires tuning and fancier variants Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) pros: one shot algorithm! cons: does not scale to large datasets (matrix inverse is bottleneck)

Matching Game Goal: Match the Algorithm to its Update Rule 1. SGD for Logistic Regression 2. Least Mean Squares 3. Perceptron (next lecture) 4. 5. 6. A. 1=5, 2=4, 3=6 B. 1=5, 2=6, 3=4 C. 1=6, 2=4, 3=4 D. 1=5, 2=6, 3=6 E. 1=6, 2=6, 3=6

Geometric Interpretation of LMS The predictions on the training data are: Note that and is the orthogonal projection of into the space spanned by the columns of X !! This is the best we can do when we assume that y from based on a linear model of x! © Eric Xing @ CMU, 2006-2011

Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation: where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: By independence assumption: © Eric Xing @ CMU, 2006-2011

Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is: Do you recognize the last term? Yes it is: Thus under independence assumption, LMS is equivalent to MLE of θ ! © Eric Xing @ CMU, 2006-2011

Non-Linear basis function So far we only used the observed values x1,x2,… However, linear regression can be applied in the same way to functions of these values Eg: to add a term w x1x2 add a new variable z=x1x2 so each example becomes: x1, x2, …. z As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

Non-linear basis functions What type of functions can we use? A few common examples: - Polynomial: j(x) = xj for j=0 … n - Gaussian: - Sigmoid: - Logs: Any function of the input values can be used. The solution for the parameters of the regression remains the same.

General linear regression problem Using our new notations for the basis function linear regression can be written as Where j(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined … and 0(x)=1 for the intercept term

An example: polynomial basis vectors on a small dataset From Bishop Ch 1

0th Order Polynomial n=10

1st Order Polynomial

3rd Order Polynomial

9th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Regularization Penalize large coefficient values

Regularization: +

Polynomial Coefficients none exp(18) huge

Over Regularization:

Example: Stock Prices Suppose we wish to predict Google’s stock price at time t+1 What features should we use? (putting all computational concerns aside) Stock prices of all other stocks at times t, t-1, t-2, …, t - k Mentions of Google with positive / negative sentiment words in all newspapers and social media outlets Do we believe that all of these features are going to be useful?

Ridge Regression Adds an L2 regularizer to Linear Regression Bayesian interpretation: MAP estimation with a Gaussian prior on the parameters prefers parameters close to zero where

LASSO Adds an L1 regularizer to Linear Regression Bayesian interpretation: MAP estimation with a Laplace prior on the parameters yields sparse parameters (exact zeros) where

Ridge Regression vs Lasso X X Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 β1 βs with constant l1 norm Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinates Good for high-dimensional problems – don’t have to store all coordinates! © Eric Xing @ CMU, 2006-2011

Data Set Size: 9th Order Polynomial