Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Slides:

Advertisements

Similar presentations

Lecture 4. Linear Models for Regression

Advertisements

Chapter 5 Multiple Linear Regression

3.3 Hypothesis Testing in Multiple Linear Regression

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Multiple Regression Analysis

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

Chapter Outline 3.1 Introduction

Linear Regression.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Probability & Statistical Inference Lecture 9

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

11 Simple Linear Regression and Correlation CHAPTER OUTLINE

Linear regression models

Ch11 Curve Fitting Dr. Deshi Ye

Model Assessment, Selection and Averaging

Model assessment and cross-validation - overview

Chapter 2: Lasso for linear models

The General Linear Model. The Simple Linear Model Linear Regression.

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

The Simple Linear Regression Model: Specification and Estimation

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.

Chapter 4 Multiple Regression.

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Evaluating Hypotheses

Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.

Prediction and model selection

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Ordinary least squares regression (OLS)

Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Classification and Prediction: Regression Analysis

Objectives of Multiple Regression

Hypothesis Testing in Linear Regression Analysis

Linear methods for regression

Model Building III – Remedial Measures KNNL – Chapter 11.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.

1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

CpSc 881: Machine Learning

1 Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Dynamic Models, Autocorrelation and Forecasting ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.

Machine Learning 5. Parametric Methods.

Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.

LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

PREDICT 422: Practical Machine Learning Module 4: Linear Model Selection and Regularization Lecturer: Nathan Bastian, Section: XXX.

The simple linear regression model and parameter estimation

Chapter 7. Classification and Prediction

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Roberto Battiti, Mauro Brunato

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Linear Model Selection and Regularization

Linear Model Selection and regularization

Simple Linear Regression

Feature space tansformation methods

Generally Discriminant Analysis

Basis Expansions and Generalized Additive Models (2)

Basis Expansions and Generalized Additive Models (1)

Parametric Methods Berlin Chen, 2005 References:

Presentation transcript:

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University

Linear Methods for Regression2 Outline The simple linear regression model Multiple linear regression Model selection and shrinkage —the state of the art

Linear Methods for Regression3 Preliminaries Data – is the predictor (regressor, covariate, feature, independent variable) – is the response (dependent variable, outcome) We denote the regression function by This is the conditional expectation of Y given x. The linear regression model assumes a specific linear form for which is usually thought of as an approximation to the truth.

Linear Methods for Regression4 Fitting by least squares Minimize: Solutions are are called the fitted or predicted values are called the residuals

Gaussian Distribution The normal distribution with arbitrary center μ, and variance σ Linear Methods for Regression5

Linear Methods for Regression6 Standard errors & confidence intervals We often assume further that Under additional assumption of normality for the, a 95% confidence interval for is:

Linear Methods for Regression7 Fitted Line and Standard Errors

Linear Methods for Regression8 Fitted regression line with pointwise standard errors:

Linear Methods for Regression9 Multiple linear regression Model is Equivalently in matrix notation: f is N-vector of predicted values X is N × p matrix of regresses, with ones in the first column is a p -vector of parameters

Linear Methods for Regression10 Estimation by least squares

Linear Methods for Regression11 Z

Linear Methods for Regression12 The Bias-variance tradeoff ˆf (x) f 0 (x) f (x)A good measure of the quality of an estimator ˆf (x) is the mean squared error. Let f 0 (x) be the true value of f (x) at the point x. Then This can be written as variance + bias^2. Typically, when bias is low, variance is high and vice-versa. Choosing estimators often involves a tradeoff between bias and variance.

Linear Methods for Regression13 The Bias-variance tradeoff If the linear model is correct for a given problem, then the least squares prediction f is unbiased, and has the lowest variance among all unbiased estimators that are linear functions of y. But there can be (and often exist) biased estimators with smaller MSE. Generally, by regularizing (shrinking, dampening, controlling) the estimator in some way, its variance will be reduced; if the corresponding increase in bias is small, this will be worthwhile.

Linear Methods for Regression14 The Bias-variance tradeoff Examples of regularization: subset selection (forward, backward, all subsets); ridge regression, the lasso. In reality models are almost never correct, so there is an additional model bias between the closest member of the linear model class and the truth.

Linear Methods for Regression15 Model Selection Often we prefer a restricted estimate because of its reduced estimation variance.

Linear Methods for Regression16 Analysis of time series data Two approaches: frequency domain (fourier)—see discussion of wavelet smoothing. Time domain. Main tool is auto-regressive (AR) model of order k: Fit by linear least squares regression on lagged data

Linear Methods for Regression17 Variable subset selection We retain only a subset of the coefficients and set to zero the coefficients of the rest. There are different strategies: –All subsets regression finds for each s 0, 1, 2,... p the subset of size s that gives smallest residual sum of squares. The question of how to choose s involves the tradeoff between bias and variance: can use cross-validation (see below)

Linear Methods for Regression18 Variable subset selection –Rather than search through all possible subsets, we can seek a good path through them. Forward stepwise selection starts with the intercept and then sequentially adds into the model the variable that most improves the fit. The improvement in fit is usually based on the F ratio ：

Linear Methods for Regression19

Linear Methods for Regression20

Linear Methods for Regression21

Linear Methods for Regression22 Variable subset selection Backward stepwise selection starts with the full OLS model, and sequentially deletes variables. There are also hybrid stepwise selection strategies which add in the best variable and delete the least important variable, in a sequential manner. Each procedure has one or more tuning parameters: – subset size P-values – P-values for adding or dropping terms

Linear Methods for Regression23 Model Assessment Objectives: 1. Choose a value of a tuning parameter for a technique 2. Estimate the prediction performance of a given model For both of these purposes, the best approach is to run the procedure on an independent test set, if one is available If possible one should use different test data for (1) and (2) above: a validation set for (1) and a test set for (2) Often there is insufficient data to create a separate validation or test set. In this instance Cross-Validation is useful.

Linear Methods for Regression24 K-Fold Cross-Validation Primary method for estimating a tuning parameter (such as subset size) Divide the data into K roughly equal parts (typically K=5 or 10)

Linear Methods for Regression25 K-Fold Cross-Validation for each k = 1, 2,...K, fit the model with parameter to the other K − 1 parts, giving and compute its error in predicting the kth part: This gives the cross-validation error do this for many values of and choose the value of that makes smallest.

Linear Methods for Regression26 K-Fold Cross-Validation In our variable subsets example, is the subset size are the coefficients for the best subset of size, found from the training set that leaves out the k-th part of the data is the estimated test error for this best subset.

Linear Methods for Regression27 K-Fold Cross-Validation From the K cross-validation training sets, the K test error estimates are averaged to give: Note that different subsets of size will (probably) be found from each of the K cross-validation training sets. Doesn’t matter: focus is on subset size, not the actual subset.

Linear Methods for Regression28 The Bootstrap approach Bootstrap works by sampling N times with replacement from training set to form a “bootstrap” data set. Then model is estimated on bootstrap data set, and predictions are made for original training set. This process is repeated many times and the results are averaged. Bootstrap most useful for estimating standard errors of predictions. Sometimes produces better estimates than cross- validation (topic for current research)

Linear Methods for Regression29 Shrinkage methods Ridge regression The ridge estimator is defined by Equivalently,

Linear Methods for Regression30 Shrinkage methods The parameter > 0 penalizes proportional to its size. Solution is where I is the identity matrix. This is a biased estimator that for some value of > 0 may have smaller mean squared error than the least squares estimator. Note = 0 gives the least squares estimator; if, then

Linear Methods for Regression31 The Lasso The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by

Linear Methods for Regression32 The Lasso Notice that ridge penalty is replaced by this makes the solutions nonlinear in y, and a quadratic programming algorithm is used to compute them. because of the nature of the constraint, if t is chosen small enough then the lasso will set some coefficients exactly to zero. Thus the lasso does a kind of continuous model selection.

Linear Methods for Regression33 The Lasso The parameter t should be adaptively chosen to minimize an estimate of expected, using say cross-validation Ridge vs Lasso: if inputs are orthogonal, ridge multiplies least squares coefficients by a constant < 1, lasso translates them towards zero by a constant, truncating at zero.

Linear Methods for Regression34 A family of shrinkage estimators Consider the criterion for q >=0. The contours of constant value of are shown for the case of two inputs.

Linear Methods for Regression35 Use of derived input directions Principal components regression We choose a set of linear combinations of the xj s, and then regress the outcome on these linear combinations. The particular combinations used are the sequence of principal components of the inputs. These are uncorrelated and ordered by decreasing variance. If S is the sample covariance matrix of, then the eigenvector equations define the principal components of S.

Linear Methods for Regression36 Principal components of some input data points. The largest principal component is the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance. Ridge regression projects y onto these components, and then shrinks the coefficients of the low variance components more than the high-variance components.

Linear Methods for Regression37 PCA regression Write for the ordered principal components, ordered from largest to smallest value of. Then principal components regression computes the derived input columns and then regresses y on for some J<=p.

Linear Methods for Regression38 PCA regression Since the z j s are orthogonal, this regression is just a sum of univariate regressions: where is the univariate regression coefficient of y on z j. Principal components regression is very similar to ridge regression: both operate on the principal components of the input matrix.

Linear Methods for Regression39 PCA regression Ridge regression shrinks the coefficients of the principal components, with relatively more shrinkage applied to the smaller components than the larger; principal components regression discards the p-J+1 smallest eigenvalue components.

Linear Methods for Regression40 Partial least squares This technique also constructs a set of linear combinations of the x j s for regression, but unlike principal components regression, it uses y (in addition to X) for this construction. –We assume that x is centered and begin by computing the univariate regression coefficient

Linear Methods for Regression41 Partial least squares From this we construct the derived input which is the first partial least squares direction. The outcome y is regressed on z 1, giving coefficient then we orthogonalize y, x1,... xp with respect to We continue this process, until J directions have been obtained.

Linear Methods for Regression42

Linear Methods for Regression43 Ridge vs PCR vs PLS vs Lasso Recent study has shown that ridge and PCR outperform PLS in prediction, and they are simpler to understand. Lasso outperforms ridge when there are a moderate number of sizable effects, rather than many small effects. It also produces more interpretable models. These are still topics for ongoing research.