Chapter Outline 3.1 Introduction

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Lecture 4. Linear Models for Regression
Chapter 5 Multiple Linear Regression
Eigen Decomposition and Singular Value Decomposition
3.3 Hypothesis Testing in Multiple Linear Regression
Multiple Regression Analysis
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Statistical Techniques I EXST7005 Multiple Regression.
Prediction with Regression
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
L.M. McMillin NOAA/NESDIS/ORA Regression Retrieval Overview Larry McMillin Climate Research and Applications Division National Environmental Satellite,
CMPUT 466/551 Principal Source: CMU
Chapter 2: Lasso for linear models
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Curve-Fitting Regression
Chapter 4 Multiple Regression.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Chapter 11 Multiple Regression.
1 Linear Methods for Regression Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ordinary least squares regression (OLS)
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Data mining and statistical learning, lecture 3 Outline  Ordinary least squares regression  Ridge regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Dan Simon Cleveland State University
Classification and Prediction: Regression Analysis
Objectives of Multiple Regression
Summarized by Soo-Jin Kim
Linear methods for regression
PATTERN RECOGNITION AND MACHINE LEARNING
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Curve-Fitting Regression
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
CpSc 881: Machine Learning
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Chapter 13 Discrete Image Transforms
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Chapter 15 Multiple Regression Model Building
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Background on Classification
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Roberto Battiti, Mauro Brunato
Principal Component Analysis
Linear Model Selection and regularization
Feature space tansformation methods
Generally Discriminant Analysis
Basis Expansions and Generalized Additive Models (2)
Parametric Methods Berlin Chen, 2005 References:
Principal Component Analysis
Presentation transcript:

Chapter 3: Linear Methods for Regression The Elements of Statistical Learning Aaron Smalter

Chapter Outline 3.1 Introduction 3.2 Linear Regression Models and Least Squares 3.3 Multiple Regression from Simple Univariate Regression 3.4 Subset Selection and Coefficient Shrinkage 3.5 Computational Considerations

3.1 Introduction A linear regression model assumes the regression function, E(Y|X) is linear on the inputs X1, ..., Xp Simple, precomputer model Can outperform nonlinear models when low # training cases, low signal-noise ration, sparse Can be applied to transformations of the input

3.2 Linear Regression Models and Least Squares Vector of inputs: X = (X1, X2, ..., Xp)‏ Predict real-valued output: Y

3.2 Linear Regression Models and Least Squares (cont'd)‏ Inputs derived from various sources Quantitative inputs Transformations of inputs (log, square, sqrt)‏ Basis expansions (X2 = X1^2, X3 = X1^3)‏ Numeric coding (map one multi-level input into many Xi's)‏ Interactions between variables (X3 = X1 * X2)‏

3.2 Linear Regression Models and Least Squares (cont'd)‏ Set of training data used to estimate parameters, (x1,y1) ... (xN,yN)‏ Each xi is a vector of feature measurements, xi = (xi1, xi2, ... xip)^T

3.2 Linear Regression Models and Least Squares (cont'd)‏ Pick a set of coefficients, B = (B0, B1, ..., Bp)^T In order to minimize the residual sum of squares (RSS):

3.2 Linear Regression Models and Least Squares (cont'd)‏ How do we pick B so that we minimize the RSS? Let X be the N x (p+1) matrix Rows are input vectors (1 in first position)‏ Cols are feature vectors Let y be the N x 1 vector of outputs

3.2 Linear Regression Models and Least Squares (cont'd)‏ Rewrite RSS as, Differentiate with respect to B, Set first derivative to zero (assume X has full column rank, X^TX is positive definite), Obtain unique solution: Fitted values are:

3.2.2 The Gauss-Markov Theorum Asserts: least squares estimates have the smallest variance among all linear unbiased estimates. Estimate any linear combination of the parameters,

3.2.2 The Gauss-Markov Theorum (cont'd) Least squares estimate is, If X is fixed, this is a linear function c0^Ty of response vector y If linear model is correct, a^TB is unbiased. Gauss-Markov Theorum:

3.2.2 The Gauss-Markov Theorum (cont'd)‏ However, unbiased estimator is not always best. Biased estimator may exist with better MSE, trading bias for reduction in variance. In reality, most models are approximations of the truth and therefore biased; need to pick a model that balances bias and variance.

3.3 Multiple Regression from Simple Univariate Regression Linear model with p > 1 inputs is a multiple linear regression model. Suppose we first have univariate model (p = 1), Least squares estimate and residuals are,

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ With inner product notation we can rewrite as, Will use univariate regression to build multiple least squares regression.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ Suppose inputs x1,x2,...,xp are orthogonal (<xi,xj> = 0, for all j != k)‏ Multiple least squares estimates Bj are equal to the univariate estimates <xj,y> / <xj,xj> When inputs are orthogonal, they have no effect on each others parameter estimates.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ Observed data is not usually orthogonal, so we must make it so. Given intercept and single input x, the least squares coefficient of x has the form, This is the result of two simple regression applications: 1. regress x on 1 to produce residual z = x – x^-1 2. regress y on the residual z to give coefficient B1^^

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ In this procedure, the phrases ”regress b on a” or ”b is regressed on a” refers to: A simple univariate regression of b on a with no intercept, producing coefficient y^^ = <a,b> / <a,a>, And residual vector b – y^^a. We say b is adjusted for a, or is ”orthogonalized” with respect to a.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ We can generalize this approach to cases of p inputs (Gram-Schmidt procedure): The result of the algorithm is:

3.3.1 Multiple Outputs Suppose we have multiple outputs Y1,Y2,...,Yk, Predicted from inputs X0,X1,X1,...,Xp. Assume a linear model for each output: In matrix notation: Y is N x K response matrix, ik entry is yik X is N x (p+1) input matrix B is (p+1) x K parameter matrix E is N x K matrix of errors

3.3.1 Multiple Outputs (cont'd)‏ Generalization of the univariate loss function: Least squares estimate has same form: Multiple outputs do not affect each other's least squares estimates

3.4 Subset Selection and Coefficient Shrinkage As mentioned, unbiased estimators are not always the best. Two reasons why not satisfied by least squares estimates: 1. prediction accuracy – low bias but large variance; improve by shrinking or settings coefficients to zero. 2. interpretation – with a large number of predictors, signal gets lost in the noise, we want to find a subset with strongest effects.

3.4.1 Subset Selection Improve estimator performance by retaining only a subset of variables. Use least squares to estimate coefficients of remaining inputs. Several strategies: Best subset regression Forward stepwise selection Backwards stepwise selection Hybrid stepwise selection

3.4.1 Subset Selection (cont'd)‏ Best subset regression For each k in {0,1,2,...,p}, find subset of size k that gives smallest residual sum Leaps and Bounds procudure (Furnival and Wilson 1974)‏ Feasible for p up to 30-40. Typically choose k such that estimate of expected prediction error is minimized.

3.4.1 Subset Selection (cont'd)‏ Forward stepwise selection Searching all subsets is time consuming Instead, find a good path through them. Start with intercept, Sequentially add predictor that most improves the fit. ”Improved fit” based on F-statistic, add predictor that gives largest value of F Stop adding when no predictor gives a significantly greater F value.

3.4.1 Subset Selection (cont'd)‏ Backward stepwise selection Similar to previous procedure, but starts with the full model Sequentially deletes predictors. Drop predictor giving the smallest F value. Stop when dropping any other predictor leads to significanty decrease in F value.

3.4.1 Subset Selection (cont'd)‏ Hybrid stepwise selection Consider both forward and backward moves at each iteration Make ”best” move Need parameter to set threshold between choosing ”add” or ”drop” operation. The stopping rule (for all stepwise methods) leads to only a local max/min, not guarenteed to find the best model.

3.4.3 Shrinkage Methods Subset selection is a discrete process (variables retained or not) so may give high variance. Shrinkage methods are similar, but use a continuous process to avoid high variance. Examine two methods: Ridge regression Lasso

3.4.3 Shrinkage Methods (cont'd)‏ Ridge regression Shrinks regression coefficients by imposing a penalty on their size. Ridge coefficients minimize the penalized sum of squares, Where s is a complexity parameter controlling the amount of shrinkage. Mitigates high variance produced when many input variables are correlated.

3.4.3 Shrinkage Methods (cont'd)‏ Ridge regression Can be reparameterized using centered inputs, replacing each xij with Estimate, Ridge regression solutions give by,

3.4.3 Shrinkage Methods (cont'd)‏ Singular value decomposition (SVD) of centered input matrix X has form, U and V are N x p and p x p orthogonal matrices Cols of U span col space of X Cols of V span row space of X D is p x p diagonal matrix with d1 >= ... >= dp >= 0 (singular values)‏

3.4.3 Shrinkage Methods (cont'd)‏ Using SVD, rewrite least squared fitted vector: Now the ridge solutions are: Greater amount of shrinkage is applied to vectors with smaller dj^2

3.4.3 Shrinkage Methods (cont'd)‏ SVD of centered matrix X expresses the principle components of the variables in X. The the eigen decomposition of X^TX is The eigenvectors vj are the principle components directions of X. First principle component has largest sample variance, last one has minimum variance.

3.4.3 Shrinkage Methods (cont'd)‏ Ridge regression Assume that response varies most in direction of high variance of the inputs. Shrink the coefficients of the low-variance components more than high-variance ones

3.4.3 Shrinkage Methods (cont'd)‏ Lasso Shrinks coefficients, like ridge regression, but has some differences. Estimate is defined by, Again, reparameterize and fid model without intercept. Ridge penalty replaced by lasso penalty

3.4.3 Shrinkage Methods (cont'd)‏ Compute using quadratic programming. Lasso is kind of continuous subset selection Making t very small will make some coefficients zero If t is larger than some threshold, lasso estimates are same as least squares If t chosen as half of the this threshold, coefficients are shrunk by about 50%. Choose t to minimize estimate of prediction error.

3.4.4 Methods Using Derived Input Directions Sometimes we have a lot of inputs, and want to instead produce a smaller number of inputs which are a linear combinations of the original inputs. Different methods for constructing linear combintaions: Principal component regression Partial least squares

3.4.4 Methods Using Derived Input Directions (cont'd)‏ Principal Components Regression Constructs linear combinations using principal components Regression is a sum of univariate regressions: zm are each linear combinations of original xj Similar to ridge regression, instead of shrinking PC we discard p – M smallest eigenvalue components.

3.4.4 Methods Using Derived Input Directions (cont'd)‏ Partial least squares Linear combinations use y in addition to X. Begin by computing univariate regression coefficient of y on each xi, Construct the derived input In construction of each zm, inputs are weighted by strength of their univariate effect on y. The outcome y is regressed on z1 We orthogonalize with respect to z1 Iterate until M <= p directions obtained.

3.4.6 Multiple Outcome Shrinkage and Selection Least squares estimates in multi-output linear model are collection of individual estimates for each output. To apply selection and shrinkage we could simply apply a univariate technique individually to each outcome or simultaneously to all outcomes

3.4.6 Multiple Outcome Shrinkage and Selection (cont'd)‏ Canonical correlation analysis (CCA)‏ Data reduction technique that combines responses Similar to PCA Finds sequence of uncorrelated linear combinations of inputs, corresponding sequence of uncorrelated linear combinations of the responses, That maximizes correlations:

3.4.6 Multiple Outcome Shrinkage and Selection (cont'd)‏ Reduced-rank regression Formalizes CCA approach using regression model that explicitly pools information Solve multivariate regression problem:

Questions