CpSc 881: Machine Learning

CpSc 881: Machine Learning
Regression

Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

Regression problems The goal is to make quantitative (real valued) predictions on the basis of a vector of features or attributes Examples: house prices, stock values, survival time, fuel efficiency of cars, etc. Questions: What can we assume about the problem? how do we formalize the regression problem? how do we evaluate predictions?

A generic regression problem
The input attributes are given as fixed length vectors x = [x1,...,xd]T , where each component such as xi may be discrete or real valued. The outputs are assumed to be real valued y  R (or a restricted subset of the real values) We have access to a set of n training examples, Dn = {(x1,y1),...,(xn,yn)}, sampled independently at random from some fixed but unknown distribution P(x,y) The goal is to minimize the prediction error/loss on new examples (x,y) drawn at random from the same distribution P(x,y). The loss may be, for example, the squared loss where denotes our prediction in response to x.

Linear regression We need to define a class of functions (types of predictions we will try to make) such as linear predictions: where w1,w0 are the parameters we need to set.

Estimation criterion We need an estimation criterion so as to be able to select appropriate values for our parameters (w1 and w0) based on the training set Dn = {(x1,y1),...,(xn,yn)}, For example, we can use the empirical loss:

Empirical loss Ideally, we would like to find the parameters w1,w0 that minimize the expected loss (assuming unlimited training data): where the expectation is over samples from P(x,y). When the number of training examples n is large, however, the empirical error is approximately what we want

Estimating the parameters
We minimize the empirical squared loss By setting the derivatives with respect to w1 and w0 to zero we get necessary conditions for the “optimal” parameter values

Types of error Structural error measures the error introduced by the limited function class (infinite training data): where (w0*,w1*) are the optimal linear regression parameters. Approximation error measures how close we can get to the optimal linear predictions with limited training data: where are the parameter estimates based on a small training set (therefore themselves random variables).

Multivariate Regression
Write matrix X and Y thus: (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The result is the same as in the one dimensional case:

Beyond linear regression
The linear regression functions are convenient because they are linear in the parameters, not necessarily in the input x. We can easily generalize these classes of functions to be non-linear functions of the inputs x but still linear in the parameters w. For example: mth order polynomial prediction m

Subset Selection and Shrinkage: Motivation
Bias Variance Trade Off Goal: choose model to minimize error Method: sacrifice a little bit of bias to reduce the variance Better interpretation: find the strongest factors from the input space

Shrinkage Intuition: continuous version of subset selection
Goal: imposing penalty on complexity of model to get lower variance Two example: Ridge regression Lasso

Ridge Regression Penalize by sum-of-squares of parameters Or

Understanding of Ridge Regression
Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance. Assumption: y vary most in the directions of high variance Intuitive example: stop words in text classification if assuming no covariance between words Relates to MAP Estimation If:  ~ N(0, I) , y ~ N(X, 2I) Then: Scale time: seconds or minutes Q: How does ridge regression shrinkage achieve variance reduction to account for the correlation present in the input matrix X? This didn't quite make sense in paragraph 2 on page 70 (weng-keen Wong). : this is done by using SVD A: V is the eigen vector of covariance matrix, D is the eigen value. If x_{I} and x_{j} has higher correlation, then we can get a bigger eigen vector in one direction, small on the other direction. When do feature selection the bigger direction get selected, and the smaller variance diretion is less likely to be selected because it is shrinked more Q: There is one claim in the book that when there are many correlated variables in a linear regression model, their coefficients can become poorly determined and exhibit high variance. What does it mean? According to my understanding, there seems to be a contradiction with the text classification problem: what result should we expect if two features are both relevant and they are also highly correlated (for example, the class is “politics” and there are two words say “Bush” and “Clinton”)? Should we filter out these two words? (Liu Yan) Q: p60. How does the ridge regression correspond to being the posterior mean of the B distribution?(Ashish R Venugopal): Q: About ridge regression: Could you explain "the mean or mode of a postrior distribution, with a suitable chosen prior distr..." in p60? (Yiming) A: Mode(maximum) is mean because the posterior is a Guaasian. See my notes on page 71 Q: Ridge regression is designed to solve the ill-posed problems. However,someone mentioned that the "ridge regression" is equivalent to the "weight decay" used in Neural Network.I have no idea about it, could you explain a little? (Jian) A: see NN chapter. We can discuss it later after learned that. Page 356

Lasso(Least Absolute Shrinkage and Selection Operator)
A popular model selection and shrinkage estimation method. In a linear regression set-up: :continuous response :design matrix : parameter vector The lasso estimator is then defined as: where , and larger set some exactly to 0.

Lasso(Least Absolute Shrinkage and Selection Operator)
Features Sparse Solutions Let be the full least square estimates and Value will cause the shrinkage Let as the scaled Lasso parameter

Why Lasso? LASSO is proposed because: LASSO can outperform RR if:
Ridge regression is not parsimonious. Ridge regression may generate huge prediction errors under sparse matrix of true (unknown) coefficients. LASSO can outperform RR if: True (unknown) coefficients are composed of a lot of zeros.

Why Lasso? Prediction Accuracy Assume , and ,
then the prediction error of the estimate is OLS estimates often have low bias but large variance, the Lasso can improve the overall prediction accuracy by sacrifice a little bias to reduce the variance of the predicted value.

Why Lasso? Interpretation
In many cases, the response is determined by just a small subset of the predictor variables.

How to solve the problem?
The absolute inequality constraints can be translated into inequality constraints. ( p stands for the number of predictor variables ) Where is an matrix, corresponding to linear inequality constraints. But direct application of this procedure is not practical due to the fact that may be very large. Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.

How to solve the problem?
Outline of the Algorithm Sequentially introduce the inequality constraints In practice, the average iteration steps required is in the range of (0.5p, 0,75p), so the algorithm is acceptable. Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.

Group Lasso In some cases not only continuous but also categorical predictors (factors) are present, the lasso solution is not satisfactory with only selecting individual dummy variables but the whole factor. Extended from the lasso penalty, the group lasso estimator is: : the index set belonging to the th group of variables. The penalty does the variable selection at the group level , belonging to the intermediate between and type penalty. It encourages that either or for all

Elastic Net Compromise Between ℓ1 and ℓ2 to Improve Reliability

Elastic Net ridge penalty λ2 elastic net penalty lasso penalty λ1

Principle component regression
Goal: Using linear combinations of inputs as inputs in the regression Usually the derived input directions are orthogonal to each other Principle component regression Get vm using SVD Use as inputs in the regression

Partial Least squares Idea: find directions that have high variance and have high correlation with y. Unlike general multiple linear regression, the PLS regression can handle strong collinear data and the data in which number of predictors is larger than the number of observations. The PLS build the relationship between response and predictors through a few latent variables constructed from predictors. The number of latent variables is much smaller than that of the original predictors. Can you please provide a geometric explanation of what each step of the Partial Least Squares algorithm is doing?(weng-keen wong)

Partial Least squares Let vector y (n×1) denotes the single response; matrix X (n×p) denotes the n observations of p predicators and matrix T (n×h) denotes n values of the h latent variables. The latent variables are linear combinations of the original predictors: where matrix W (p×h) is the weights. Then, the response and observations of predictors can be expressed using T as follow (Wold S., et al. 2001): where matrix P (h×p) is the is called loadings and matrix C (h×1) is the regression coefficients of T. The matrix E (n×p) and vector f (n×1) are the random errors of X and y. The PLS regression decomposes the X and y simultaneously to find a set of latent variables that explain the covariance between X and y as much as possible. Can you please provide a geometric explanation of what each step of the Partial Least Squares algorithm is doing?(weng-keen wong)

Partial Least squares The PLS regression has also established the relation between the response y and original predictors X as a multiple regression model: where vector f’ (n×1) is the regression errors and matrix B (p×1) is the PLS regression coefficients and can be calculated by: Then, the significant predictors can be selected based on the values of regression coefficients from PLS regression, which is called the PLS-Beta method

PCR vs. PLS vs. Ridge Regression
PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves: PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves: Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more

CpSc 881: Machine Learning

Similar presentations

Presentation on theme: "CpSc 881: Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CpSc 881: Machine Learning

Similar presentations

Presentation on theme: "CpSc 881: Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback