Statistical Learning Dong Liu Dept. EEIS, USTC
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression A motivating example 1/2 What is the height of Mount Qomolangma? A piece of knowledge – one variable How do we achieve this “knowledge” from data? We have a series of measurements: For example, we can use the (arithmetic) mean: def HeightOfQomolangma(): return 8848.0 def SLHeightOfQomolangma(data): return sum(data) / len(data) 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression A motivating example 2/2 Or in this way hQomo = 0 def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): return hQomo Learning Using 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Why arithmetic mean? Least squares In statistical learning, we often formulate such optimization problems and try to solve them How to formulate? How to solve? 2018/11/10 Chap 1. Linear Regression
From the statistical perspective The height of Qomolangma is a random variable , which subjects to a specific probability distribution For example, Gaussian (normal) distribution The measurements are observations of the random variable, and are used to estimate the distribution Assumption: independent and identical distribution (iid) 2018/11/10 Chap 1. Linear Regression
Maximum likelihood estimation Likelihood function: as a function of Overall likelihood function (recall iid): We need to find a parameter that maximizes the overall likelihood: And it reduces to least squares! 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression More is implied We can also estimate other parameters, e.g. We can use other estimators, like unbiased: We can give range estimation rather than point estimation 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Correlated variables The height of Mount Qomolangma is correlated to the season So what is the correlation between two variables? Why not an affine function: def UseSeasonalHeight(x, a, b): return a*x+b Spring Summer Fall Winter 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Least squares We formulate the optimization problem as And (fortunately) have the closed-form solution Result ↗ Seemingly not good, how to improve? 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Variable (re)mapping Previously we use Now we use Result ↗ Spring 1 Summer 2 Fall 3 Winter 1 Summer 2 Spring/Fall 3 Winter def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: 3.6146646254138832 Remapped season: 0.9404394822254982 2018/11/10 Chap 1. Linear Regression
From the statistical perspective We have two random variables Height: a dependent, continuous variable Season: an independent, discrete variable The season’s probability distribution The height’s probability distribution The overall likelihood function: 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression History review Adrien-Marie Legendre (French, 1752-1833) Carl Friedrich Gauss (German, 1777-1855) 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Notes Correlation is not Causality, but inspires efforts to interpret Remapped/Latent variables are important 2018/11/10 Chap 1. Linear Regression
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
As we are not confident about our data Height is correlated to season, but also correlated to other variables Can we constrain the level of correlation between height and season? So we want to constrain the slope parameter We have two choices Given a range of possible values of slope parameter, find the least squares Minimize the least squares and the (e.g. square of) slope parameter simultaneously 2018/11/10 Chap 1. Linear Regression
The Lagrange multiplier Constraint form Unconstrained form Solution Reg 0: y = 1.184060 * x +8846.369904, error: 0.940439 Reg 1: y = 1.014908 * x +8846.708207, error: 1.112113 Reg 2: y = 0.888045 * x +8846.961934, error: 1.466188 Reg 3: y = 0.789373 * x +8847.159277, error: 1.875104 Reg 4: y = 0.710436 * x +8847.317152, error: 2.286357 Reg 5: y = 0.645851 * x +8847.446322, error: 2.678452 Reg 6: y = 0.592030 * x +8847.553964, error: 3.043435 Reg 7: y = 0.546489 * x +8847.645045, error: 3.379417 Reg 8: y = 0.507454 * x +8847.723115, error: 3.687209 Reg 9: y = 0.473624 * x +8847.790776, error: 3.968753 Reg 10: y = 0.444022 * x +8847.849978, error: 4.226370 2018/11/10 Chap 1. Linear Regression
More about the Lagrange multiplier method 1/2 Example: No constraint With equality as constraint With inequality as constraint 2018/11/10 Chap 1. Linear Regression
More about the Lagrange multiplier method 2/2 A necessary (but not sufficient) condition for convex optimization: Using the Lagrange multiplier method: KKT condition 2018/11/10 Chap 1. Linear Regression
What & Why is regularization? A process of introducing additional information in order to solve an ill-posed problem Why Want to introduce additional information Have difficulty in solving the ill-posed problem Without regularization: With regularization: 2018/11/10 Chap 1. Linear Regression
From the statistical perspective The Bayes formula Maximum a posterior (MAP) estimation (Bayesian estimation) We need to specify a prior, e.g.: Finally it reduce to the regularized least squares with 2018/11/10 Chap 1. Linear Regression
Bayesian interpretation of regularization The prior is “additional information” Many statisticians question this point How much regularization depends on How confident we are about the data How confident we are about the prior 2018/11/10 Chap 1. Linear Regression
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
Polynomial curve fitting Basis functions Weights Another form: weights and bias 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline Sigmoid 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Variable remapping Using basis functions will remap the variable(s) in a non-linear manner Change the dimensionality To enable a simpler (linear) model 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Maximum likelihood Assume observations are from a deterministic function with additive Gaussian noise Then Given observed inputs and targets The likelihood function is 2018/11/10 Chap 1. Linear Regression
Maximum likelihood and least squares Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) 2018/11/10 Chap 1. Linear Regression
Maximum likelihood solution Solution is The design matrix The pseudo-inverse 2018/11/10 Chap 1. Linear Regression
Geometrical interpretation Let And let the columns of be They span a subspace Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance 2018/11/10 Chap 1. Linear Regression
Regularized least squares Construct the “joint” error function Use SSE as data term, and quadratic regularization term (ridge regression): The solution is Data term + Regularization term 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Equivalent kernel For a new input, the predicted output is Predictions can be calculated directly from the equivalent kernel, without calculating the parameters Equivalent kernel 2018/11/10 Chap 1. Linear Regression
Equivalent kernel for Gaussian basis functions 2018/11/10 Chap 1. Linear Regression
Equivalent kernel for other basis functions Polynomial Sigmoidal Equivalent kernel is “local”: nearby points have more weights 2018/11/10 Chap 1. Linear Regression
Properties of equivalent kernel Sums to 1 if λ is 0 May have negative values Can be seen as inner product 2018/11/10 Chap 1. Linear Regression
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Example Reproduced from PRML Generate 100 data sets, each having 25 points A sine function plus Gaussian noise Perform ridge regression on each data set with 24 Gaussian basis functions and different values of regularization weight 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 1/3 High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 2/3 Moderate regularization Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 3/3 Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression
Bias-variance decomposition The second term is intrinsic “noise”, consider the first term Suppose we have a dataset and we can calculate the parameter based on the dataset Then we take expectation with respect to dataset Finally we have: expected “loss” = (bias)2 + variance + noise 2018/11/10 Chap 1. Linear Regression
Bias-variance trade-off Over-regularized model will have a high bias, while under-regularized model will have a high variance How can we achieve the trade-off? For example, by cross validation (will be discussed later) 2018/11/10 Chap 1. Linear Regression
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Other forms? Least squares Ridge regression Norm regularized regression Lq norm: 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Different norms What about 0 and ∞? 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Best subset selection Define l0 “norm” as Best subset selection regression: Also known as “sparse” Unfortunately, this is NP-hard 2018/11/10 Chap 1. Linear Regression
Example: Why we need sparsity? fMRI data help to understand brain’s functionality Brain fMRI data may consists of 10,000~100,000 voxels We want to identify the most relevant anchor points 2018/11/10 Chap 1. Linear Regression
L1 norm replacing l0 “norm” Interestingly, we can use L1 norm to replace l0 “norm,” and still achieve a sparse solution* Geometric interpretation Left: L1 norm; Right: L2 norm Least squares solution Sparsity here! * Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797-829. 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression LASSO regression LASSO: least absolute shrinkage and selection operator* Bayesian interpretation Laplace distribution as prior Laplace distribution * Tibshiranit, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1), 267-288. 2018/11/10 Chap 1. Linear Regression
Solution of LASSO regression Consider a special case: The least-squares solution is And the solution is 2018/11/10 Chap 1. Linear Regression
Comparison between best subset, LASSO, and ridge Consider the special case: orthonormal design matrix Best subset: Hard thresholding Ridge: Uniformly shrink the LS solution LASSO: Soft thresholding 2018/11/10 Chap 1. Linear Regression
Implications of different norms Best subset LASSO ridge q = 0 q = ∞ Sparse solution Convex optimization 2018/11/10 Chap 1. Linear Regression
Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression
Bayesian linear regression Define prior for the parameters Note the likelihood function is The posterior is 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression MAP estimation The maximum a posteriori (MAP) estimation Compared with the maximum likelihood estimation Compared with the ridge regression solution 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression How to set the prior If using zero-mean Gaussian prior, the Bayesian estimation is equivalent to ridge regression solution If using zero-mean Laplace prior, the Bayesian estimation (no closed- form expression) is equivalent to lasso regression solution Conjugate prior: to make the posterior and prior follow the same category of distribution, e.g. Gaussian 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Example Reproduced from PRML 0 data points observed Parameters for simulation: 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 1/3 1 data point observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 2/3 2 data points observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 3/3 20 data points observed Posterior Data Space Variance of the posterior decreases as the number of data points increases 2018/11/10 Chap 1. Linear Regression
Predictive distribution In the Bayesian framework, every variable has a distribution, such as the predicted output given an input As N increases this term will vanish 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Example Reproduced from PRML Sinusoidal data, 9 Gaussian basis functions, 1 data point True function Predictive mean Predictive variance Different predicted functions 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 1/3 Sinusoidal data, 9 Gaussian basis functions, 2 data points 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 2/3 Sinusoidal data, 9 Gaussian basis functions, 4 data points 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Simulation results 3/3 Sinusoidal data, 9 Gaussian basis functions, 25 data points 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Model selection Polynomial curve fitting: How to set the order? 2018/11/10 Chap 1. Linear Regression
From the statistical perspective Bayesian model selection: Given the dataset, estimate the posterior of different models “ML” model selection: Choose the model that maximizes the model evidence function 2018/11/10 Chap 1. Linear Regression
Calculation of model evidence for Bayesian linear regression Details c.f. PRML 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Example Reproduced from PRML 2018/11/10 Chap 1. Linear Regression
More about the hyper-parameters We can “estimate” the hyper-parameters based on e.g. ML criterion Define the eigenvalues as 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Interpretation 1/3 The eigenvalues of are 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Interpretation 2/3 By decreasing α, more parameters become “learnt from data” γ measures the number of “learnt” parameters 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Interpretation 3/3 Recall that for estimating Gaussian parameters, we have Now for Bayesian linear regression, we have 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Notes The hyper-parameters can be further regarded as random variables, and integrated into the Bayesian framework 2018/11/10 Chap 1. Linear Regression
Chap 1. Linear Regression Chapter summary Dictionary Toolbox Bias-variance decomposition Equivalent kernel Gaussian distribution Laplace distribution KKT condition Model selection Prior; conjugate ~ Posterior Regularization Sparsity Basis functions Best subset selection Lagrange multiplier LASSO regression Least squares MAP (Bayesian) estimation ML estimation Ridge regression 2018/11/10 Chap 1. Linear Regression