Regression 10-601 Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Linear Regression.
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 Regression and the Bias-Variance Decomposition William Cohen April 2008 Readings: Bishop 3.1,3.2.
Indian Statistical Institute Kolkata
Machine Learning Week 2 Lecture 1.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Collaborative Filtering Matrix Factorization Approach
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Crash Course on Machine Learning Part II
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Scaling up Decision Trees. Decision tree learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Matt Gormley Lecture 5 September 14, 2016
Matt Gormley Lecture 4 September 12, 2016
Linear Regression (continued)
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Probabilistic Models for Linear Regression
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instance Based Learning
Biointelligence Laboratory, Seoul National University
Overfitting and Underfitting
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Machine learning overview
Presentation transcript:

Regression Machine Learning

Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization  Gradient descent –As matrix inversion (Ordinary Least Squares) Overfitting and bias-variance Bias-variance decomposition for classification

What is regression?

Where we are Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real # √ Today √

Regression examples

Prediction of menu prices Chaheau Gimpel … and Smith EMNLP 2012 …

A decision tree: classification Play Don’t Play

A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32

Theme for the week: learning as optimization

Types of learners Two types of learners: 1. Generative: make assumptions about how to generate data (given the class) - e.g., naïve Bayes 2. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression Today: another discriminative learner, but for regression tasks

Regression for LMS as optimization Toy problem #2 11 Least Mean Squares

Linear regression Given an input x we would like to compute an output y For example: - Predict height from age - Predict Google’s price from Yahoo’s price - Predict distance from wall from sensors X Y

Linear regression Given an input x we would like to compute an output y In linear regression we assume that y and x are related with the following equation: y = wx+  where w is a parameter and  represents measurement or other noise X Y What we are trying to predict Observed values

Our goal is to estimate w from a training data of pairs Optimization goal: minimize squared error (least squares): Why least squares? - minimizes squared distance between measurements and predicted line - has a nice probabilistic interpretation - the math is pretty Linear regression X Y see HW

Solving linear regression To optimize: We just take the derivative w.r.t. to w …. prediction Compare to logistic regression…

Solving linear regression To optimize – closed form: We just take the derivative w.r.t. to w and set to 0: covar(X,Y)/var(X) if mean(X)=mean(Y)=0 covar(X,Y)/var(X) if mean(X)=mean(Y)=0

Regression example Generated: w=2 Recovered: w=2.03 Noise: std=1

Regression example Generated: w=2 Recovered: w=2.05 Noise: std=2

Regression example Generated: w=2 Recovered: w=2.08 Noise: std=4

Bias term So far we assumed that the line passes through the origin What if the line does not? No problem, simply change the model to y = w 0 + w 1 x+  Can use least squares to determine w 0, w 1 X Y w0w0

Bias term So far we assumed that the line passes through the origin What if the line does not? No problem, simply extend the model to y = w 0 + w 1 x+  Can use least squares to determine w 0, w 1 X Y w0w0 Simpler solution is coming soon…

Multivariate regression What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the Google prediction task This becomes a multivariate regression problem Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k +  Google’s stock price Yahoo’s stock price Microsoft’s stock price

Multivariate regression What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the Google prediction task This becomes a multivariate regression problem Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k + 

y=10+3x x  In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems? Yes. As long as the coefficients are linear the equation is still a linear regression problem! Not all functions can be approximated by a line/hyperplane…

Non-Linear basis function So far we only used the observed values x 1,x 2,… However, linear regression can be applied in the same way to functions of these values –Eg: to add a term w x 1 x 2 add a new variable z=x 1 x 2 so each example becomes: x 1, x 2, …. z As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

Non-Linear basis function How can we use this to add an intercept term? Add a new “variable” z=1 and weight w 0

Non-linear basis functions What type of functions can we use? A few common examples: - Polynomial:  j (x) = x j for j=0 … n - Gaussian: - Sigmoid: - Logs: Any function of the input values can be used. The solution for the parameters of the regression remains the same.

General linear regression problem Using our new notations for the basis function linear regression can be written as Where  j (x) can be either x j for multivariate regression or one of the non-linear basis functions we defined … and  0 (x)=1 for the intercept term

Learning/Optimizing Multivariate Least Squares Approach 1: Gradient Descent

Gradient descent 30

Gradient Descent for Linear Regression Goal: minimize the following loss function: sum over n examples sum over k+1 basis vectors

Gradient Descent for Linear Regression Goal: minimize the following loss function:

Gradient Descent for Linear Regression Learning algorithm: Initialize weights w=0 For t=1,… until convergence: Predict for each example x i using w: Compute gradient of loss: This is a vector g Update: w = w – λg λ is the learning rate.

Gradient Descent for Linear Regression We can use any of the tricks we used for logistic regression: –stochastic gradient descent (if the data is too big to put in memory) –regularization –…

Linear regression is a convex optimization problem proof: differentiate again to get the second derivative so again gradient descent will reach a global optimum

Multivariate Least Squares Approach 2: Matrix Inversion

OLS (Ordinary Least Squares Solution) Goal: minimize the following loss function:

Notation: n examples k+1 basis vectors

Goal: minimize the following loss function: n examples k+1 basis vectors

n examples k+1 basis vectors …

n examples k+1 basis vectors …

n examples k+1 basis vectors …

n examples …

k+1 basis … n examples

k+1 basis vectors …

recap: Solving linear regression To optimize – closed form: We just take the derivative w.r.t. to w and set to 0: covar(X,Y)/var(X) if mean(X)=mean(Y)=0 covar(X,Y)/var(X) if mean(X)=mean(Y)=0

n examples k+1 basis vectors …

LMS for general linear regression problem Deriving w we get: n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘pseudo inverse’ Another reason to start with an objective function: you can see when two learning methods are the same!

LMS versus gradient descent LMS solution: + Very simple in Matlab or something similar -Requires matrix inverse, which is expensive for a large matrix. Gradient descent: + Fast for large matrices + Stochastic GD is very memory efficient + Easily extended to other cases - Parameters to tweak (how to decide convergence? what is the learning rate? ….)

Regression and Overfitting

An example: polynomial basis vectors on a small dataset –From Bishop Ch 1

0 th Order Polynomial n=10

1 st Order Polynomial

3 rd Order Polynomial

9 th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Data Set Size: 9 th Order Polynomial

Regularization Penalize large coefficient values

Regularization: +

Polynomial Coefficients noneexp(18)huge

Over Regularization:

Regularized Gradient Descent for LR Goal: minimize the following loss function:

Probabilistic Interpretation of Least Squares

A probabilistic interpretation Our least squares minimization solution can also be motivated by a probabilistic in interpretation of the regression problem: The MLE for w in this model is the same as the solution we derived for least squares criteria: where ε is Gaussian noise

Understanding Overfitting: Bias-Variance

Example Tom Dietterich, Oregon St

Example Tom Dietterich, Oregon St Same experiment, repeated: with 50 samples of 20 points each

The true function f can’t be fit perfectly with hypotheses from our class H (lines)  Error 1 We don’t get the best hypothesis from H because of noise/small sample size  Error 2 Fix: more expressive set of hypotheses H Fix: less expressive set of hypotheses H noise is similar to error 1

Bias-Variance Decomposition: Regression

Bias and variance for regression For regression, we can easily decompose the error of the learned model into two parts: bias (error 1) and variance (error 2) –Bias: the class of models can’t fit the data. Fix: a more expressive model class. –Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. Fix: a less expressive model class.

Bias – Variance decomposition of error learned from D true function dataset and noise Fix test case x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear regressor h D using D 3. Draw one test example (x, f(x)+ε) 4. Measure squared error of h D on that one example x What’s the expected error? 72 noise

Bias – Variance decomposition of error learned from D true function dataset and noise noise Notation - to simplify this long-term expectation of learner’s prediction on this x averaged over many data sets D long-term expectation of learner’s prediction on this x averaged over many data sets D

Bias – Variance decomposition of error

Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE 75

bias variance x=5

Bias-variance decomposition This is something real that you can (approximately) measure experimentally –if you have synthetic data Different learners and model classes have different tradeoffs –large bias/small variance: few features, highly regularized, highly pruned decision trees, large-k k- NN… –small bias/high variance: many features, less regularization, unpruned trees, small-k k-NN…

Bias and variance For classification, we can also decompose the error of a learned classifier into two terms: bias and variance –Bias: the class of models can’t fit the data. –Fix: a more expressive model class. –Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. –Fix: a less expressive model class.

Another view of a decision tree Sepal_length<5.7 Sepal_width>2.8

Another view of a decision tree Sepal_length>5.7 N N Sepal_width>2.8 length>5.1 N N Y Y width>3.1 Y Y length>4.6 N N Y Y

Another view of a decision tree Sepal_length>5.7 N N Sepal_width>2.8 length>5.1 N N Y Y width>3.1 Y Y N N

Another view of a decision tree

Bias-Variance Decomposition: Measuring

Bias-variance decomposition This is something real that you can (approximately) measure experimentally –if you have synthetic data –…or if you’re clever –You need to somehow approximate E D {h D (x)} –I.e., construct many variants of the dataset D

Background: “Bootstrap” sampling Input: dataset D Output: many variants of D: D 1,…,D T For t=1,….,T: –D t = { } –For i=1…|D|: Pick (x,y) uniformly at random from D (i.e., with replacement) and add it to D t Some examples never get picked (~37%) Some are picked 2x, 3x, ….

Measuring Bias-Variance with “Bootstrap” sampling Create B bootstrap variants of D (approximate many draws of D) For each bootstrap dataset –T b is the dataset; U b are the “out of bag” examples –Train a hypothesis h b on T b –Test h b on each x in U b Now for each (x,y) example we have many predictions h 1 (x),h 2 (x), …. so we can estimate (ignoring noise) –variance: ordinary variance of h 1 (x),….,h n (x) –bias: average(h 1 (x),…,h n (x)) - y

Applying Bias-Variance Analysis By measuring the bias and variance on a problem, we can determine how to improve our model –If bias is high, we need to allow our model to be more complex –If variance is high, we need to reduce the complexity of the model Bias-variance analysis also suggests a way to reduce variance: bagging 88

Bagging

Bootstrap Aggregation (Bagging) Use the bootstrap to create B variants of D Learn a classifier from each variant Vote the learned classifiers to predict on a test example

Bagging (bootstrap aggregation) Breaking it down: –input: dataset D and YFCL –output: a classifier h D-BAG –use bootstrap to construct variants D 1,…,D T –for t=1,…,T: train YFCL on D t to get h t –to classify x with h D-BAG classify x with h 1,….,h T and predict the most frequently predicted class for x (majority vote) Note that you can use any learner you like! You can also test h t on the “out of bag” examples

Experiments Freund and Schapire

solid: NB dashed: LR

Bagged, minimally pruned decision trees

Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

Bagging (bootstrap aggregation) Experimentally: –especially with minimal pruning: decision trees have low bias but high variance. –bagging usually improves performance for decision trees and similar methods –It reduces variance without increasing the bias (much).