Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A AA A A

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Course Outline Fundamentals (2 weeks)  Bayes Decision Theory  Probability Density Estimation Discriminative Approaches (5 weeks)  Linear Discriminant Functions  Statistical Learning Theory & SVMs  Ensemble Methods & Boosting  Decision Trees & Randomized Trees  Model Selection  Regression Problems Generative Models (4 weeks)  Bayesian Networks  Markov Random Fields B. Leibe 2

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Topics of This Lecture Regression: Motivation  Polynomial fitting  General Least-Squares Regression  Overfitting problem  Regularization  Ridge Regression  Basis functions Regularization revisited  Regularized Least-squares  The Lasso  Discussion Kernels  Dual representation  Kernel Ridge Regression 3 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regression Learning to predict a continuous function value  Given: training set X = { x 1, …, x N } with target values T = { t 1, …, t N }.  Learn a continuous function y ( x ) to predict the function value for a new input x. Steps towards a solution  Choose a form of the function y ( x, w ) with parameters w.  Define an error function E ( w ) to optimize.  Optimize E ( w ) for w to find a good solution. (This may involve math).  Derive the properties of this solution and think about its limitations. 4 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Example: Polynomial Curve Fitting Toy dataset  Generated by function  Small level of random noise with Gaussian distribution added (blue dots) Goal: fit a polynomial function to this data  Note: Nonlinear function of x, but linear function of the w j. 5 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Error Function How to determine the values of the coefficients w ?  We need to define an error function to be minimized.  This function specifies how a deviation from the target value should be weighted. Popular choice: sum-of-squares error  Definition  Minimize the error: compute the derivative and set it to zero 6 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Recap: Minimizing the Error How do we minimize the error? Solution (Always!)  Compute the derivative and set it to zero.  Since the error is a quadratic function of w, its derivative will be linear in w.  Minimization has a unique solution. 7 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Least-Squares Regression We have given  Training data points:  Associated function values: Start with linear regressor:  Try to enforce  One linear equation for each training data point / label pair.  Same basic setup as in least-squares classification!  Only the values are now continuous. 8 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Least-Squares Regression Setup  Step 1: Define  Step 2: Rewrite  Step 3: Matrix-vector notation  Step 4: Find least-squares solution  Solution: 9 B. Leibe with Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regression with Polynomials How can we fit arbitrary polynomials using least-squares regression?  We introduce a feature transformation (using basis functions).  E.g.:  Fitting a cubic polynomial. 10 B. Leibe Slide credit: Bernt Schiele assume basis functions

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Varying the Order of the Polynomial. Which one should we pick? 11 Image source: C.M. Bishop, 2006 Massive overfitting!

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Analysis of the Results Results for different values of M  Best representation of the original function sin (2 ¼x ) with M = 3.  Perfect fit to the training data with M = 9, but poor representation of the original function. Why is that???  After all, M = 9 contains M = 3 as a special case! 12 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Overfitting Problem  Training data contains some noise  Higher-order polynomial fitted perfectly to the noise.  We say it was overfitting to the training data. Goal is a good prediction of future data  Our target function should fit well to the training data, but also generalize.  Measure generalization performance on independent test set. 13 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Measuring Generalization E.g., Root Mean Square Error (RMS): Motivation  Division by N lets us compare different data set sizes.  Square root ensures E RMS is measured on the same scale (and in the same units) as the target variable t. 14 B. Leibe Overfitting! Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Analyzing Overfitting Example: Polynomial of degree 9  Overfitting becomes less of a problem with more data. 15 B. Leibe Relatively little data Overfitting typical Enough data Good estimate Slide adapted from Bernt Schiele Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 What Is Happening Here? The coefficients get very large:  Fitting the data from before with various polynomials.  Coefficients: 16 B. Leibe Image source: C.M. Bishop, 2006 Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regularization What can we do then?  How can we apply the approach to data sets of limited size?  We still want to use relatively complex and flexible models. Workaround: Regularization  Penalize large coefficient values  Here we’ve simply added a quadratic regularizer, which is simple to optimize  The resulting form of the problem is called Ridge Regression.  (Note: w 0 is often omitted from the regularizer.) 17 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Ridge Regression Setting the gradient to zero: B. Leibe 18 Effect of regularization: Keeps the inverse well-conditioned

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Results with Regularization (M=9) 19 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 RMS Error for Regularized Case Effect of regularization  The trade-off parameter ¸ now controls the effective model complexity and thus the degree of overfitting. 20 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Summary We’ve seen several important concepts  Linear regression  Overfitting  Role of the amount of data  Role of model complexity  Regularization How can we approach this more systematically?  Would like to work with complex models.  How can we prevent overfitting systematically?  How can we avoid the need for validation on separate test data?  What does it mean to do regularization? 21 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Linear Basis Function Models Generally, we consider models of the following form  where Á j ( x ) are known as basis functions.  Typically, Á 0 ( x ) = 1, so that w 0 acts as a bias.  In the simplest case, we use linear basis functions: Á d ( x ) = x d. Let’s take a look at some other possible basis functions... 23 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Linear Basis Function Models (2) Polynomial basis functions Properties  Global  A small change in x affects all basis functions. 24 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Linear Basis Function Models (3) Gaussian basis functions Properties  Local  A small change in x affects only nearby basis functions.  ¹ j and s control location and scale (width). 25 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Linear Basis Function Models (4) Sigmoid basis functions  where Properties  Local  A small change in x affects only nearby basis functions.  ¹ j and s control location and scale (slope). 26 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Discussion General regression formulation  In principle, we can perform regression in arbitrary spaces and with many different types of basis functions  However, there is a caveat… Can you see what it is? Example: Polynomial curve fitting, M = 3  Number of coefficients grows with D M !  The approach becomes quickly unpractical for high dimensions.  This is again the curse of dimensionality. 27 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regularization Revisited Consider the error function With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by 29 B. Leibe Slide adapted from C.M. Bishop, 2006 Data term + Regularization term ¸ is called the regularization coefficient.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regularized Least-Squares Let’s look at more general regularizers “L q norms” 30 B. Leibe “Lasso”“Ridge Regression” Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Recall: Lagrange Multipliers 31 B. Leibe 31

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regularized Least-Squares We want to minimize This is equivalent to minimizing  subject to the constraint  (for some suitably chosen ´ ) 32 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Regularized Least-Squares Effect: Sparsity for q  1.  Minimization tends to set many coefficients to zero Why is this good? Why don’t we always do it, then? Any problems? 33 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Consider the following regressor  This formulation is known as the Lasso. Properties  L 1 regularization  The solution will be sparse (only few coefficients will be non-zero)  The L 1 penalty makes the problem non-linear.  There is no closed-form solution.  Need to solve a quadratic programming problem.  However, efficient algorithms are available with the same computational cost as for ridge regression. The Lasso 34 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Lasso as Bayes Estimation Interpretation as Bayes Estimation  We can think of | w j | q as the log-prior density for w j. Prior for Lasso ( q = 1) : Laplacian distribution 35 B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009 with

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Analysis Equicontours of the prior distribution Analysis  For q · 1, the prior is not uniform in direction, but concentrates more mass on the coordinate directions.  The case q = 1 (lasso) is the smallest q such that the constraint region is convex.  Non-convexity makes the optimization problem more difficult.  Limit for q = 0 : regularization term becomes  j=1..M 1 = M.  This is known as Best Subset Selection. 36 B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Discussion We might also try using other values of q besides 0, 1, 2 …  However, experience shows that this is not worth the effort.  Values of q 2 (1,2) are a compromise between lasso and ridge  However, | w j | q with q > 1 is differentiable at 0.  Loses the ability of lasso for setting coefficients exactly to zero. 37 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Kernel Methods for Regression Dual representations  Many linear models for regression and classification can be reformulated in terms of a dual representation, where predictions are based on linear combinations of a kernel function evaluated at training data points.  For models that are based on a fixed nonlinear feature space mapping Á ( x ), the kernel function is given by  We have seen that by substituting the inner product by the kernel, we can achieve interesting extensions of many well- known algorithms…  Let’s try this also for regression. 39 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Dual Representations: Derivation Consider a regularized linear regression model with the solution  We can write this as a linear combination of the Á ( x n ) with coefficients that are functions of w : with 40 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Dual Representations: Derivation Dual definition  Instead of working with w, we can formulate the optimization for a by substituting w = © T a into J ( w ) :  Define the kernel matrix K = ©© T with elements  Now, the sum-of-squares error can be written as 41 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Kernel Ridge Regression  Solving for a, we obtain Prediction for a new input x :  Writing k ( x ) for the vector with elements  The dual formulation allows the solution to be entirely expressed in terms of the kernel function k ( x, x ’ ).  The resulting form is known as Kernel Ridge Regression and allows us to perform non-linear regression. 42 B. Leibe Image source: Christoph Lampert

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 References and Further Reading More information on linear regression, including a discussion on regularization can be found in Chapters 1.5.5 and 3.1-3.2 of the Bishop book. Additional information on the Lasso, including efficient algorithms to solve it, can be found in Chapter 3.4 of the Hastie book. B. Leibe 43 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman Elements of Statistical Learning 2 nd edition, Springer, 2009

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe.

Similar presentations

Presentation on theme: "Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe.

Similar presentations

Presentation on theme: "Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression 12.12.2013 Bastian Leibe."— Presentation transcript:

Similar presentations

About project

Feedback