Statistical Learning Dong Liu Dept. EEIS, USTC.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Tests of Static Asset Pricing Models
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Model Assessment, Selection and Averaging
The General Linear Model. The Simple Linear Model Linear Regression.
Visual Recognition Tutorial
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Today Wrap up of probability Vectors, Matrices. Calculus
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Pattern Recognition and Machine Learning
Oliver Schulte Machine Learning 726
Chapter 7. Classification and Prediction
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Introduction to Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Linear Regression (continued)
Machine learning, pattern recognition and statistical data modelling
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Statistical Learning Dong Liu Dept. EEIS, USTC.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Modelling data and curve fitting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Singular Value Decomposition SVD
Integration of sensory modalities
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Machine learning overview
Learning From Observed Data
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

Statistical Learning Dong Liu Dept. EEIS, USTC

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression A motivating example 1/2 What is the height of Mount Qomolangma? A piece of knowledge – one variable How do we achieve this “knowledge” from data? We have a series of measurements: For example, we can use the (arithmetic) mean: def HeightOfQomolangma(): return 8848.0 def SLHeightOfQomolangma(data): return sum(data) / len(data) 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression A motivating example 2/2 Or in this way hQomo = 0 def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): return hQomo Learning Using 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Why arithmetic mean? Least squares In statistical learning, we often formulate such optimization problems and try to solve them How to formulate? How to solve? 2018/11/10 Chap 1. Linear Regression

From the statistical perspective The height of Qomolangma is a random variable , which subjects to a specific probability distribution For example, Gaussian (normal) distribution The measurements are observations of the random variable, and are used to estimate the distribution Assumption: independent and identical distribution (iid) 2018/11/10 Chap 1. Linear Regression

Maximum likelihood estimation Likelihood function: as a function of Overall likelihood function (recall iid): We need to find a parameter that maximizes the overall likelihood: And it reduces to least squares! 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression More is implied We can also estimate other parameters, e.g. We can use other estimators, like unbiased: We can give range estimation rather than point estimation 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Correlated variables The height of Mount Qomolangma is correlated to the season So what is the correlation between two variables? Why not an affine function: def UseSeasonalHeight(x, a, b): return a*x+b Spring Summer Fall Winter 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Least squares We formulate the optimization problem as And (fortunately) have the closed-form solution Result ↗ Seemingly not good, how to improve? 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Variable (re)mapping Previously we use Now we use Result ↗ Spring 1 Summer 2 Fall 3 Winter 1 Summer 2 Spring/Fall 3 Winter def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: 3.6146646254138832 Remapped season: 0.9404394822254982 2018/11/10 Chap 1. Linear Regression

From the statistical perspective We have two random variables Height: a dependent, continuous variable Season: an independent, discrete variable The season’s probability distribution The height’s probability distribution The overall likelihood function: 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression History review Adrien-Marie Legendre (French, 1752-1833) Carl Friedrich Gauss (German, 1777-1855) 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Notes Correlation is not Causality, but inspires efforts to interpret Remapped/Latent variables are important 2018/11/10 Chap 1. Linear Regression

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

As we are not confident about our data Height is correlated to season, but also correlated to other variables Can we constrain the level of correlation between height and season? So we want to constrain the slope parameter We have two choices Given a range of possible values of slope parameter, find the least squares Minimize the least squares and the (e.g. square of) slope parameter simultaneously 2018/11/10 Chap 1. Linear Regression

The Lagrange multiplier Constraint form Unconstrained form Solution Reg 0: y = 1.184060 * x +8846.369904, error: 0.940439 Reg 1: y = 1.014908 * x +8846.708207, error: 1.112113 Reg 2: y = 0.888045 * x +8846.961934, error: 1.466188 Reg 3: y = 0.789373 * x +8847.159277, error: 1.875104 Reg 4: y = 0.710436 * x +8847.317152, error: 2.286357 Reg 5: y = 0.645851 * x +8847.446322, error: 2.678452 Reg 6: y = 0.592030 * x +8847.553964, error: 3.043435 Reg 7: y = 0.546489 * x +8847.645045, error: 3.379417 Reg 8: y = 0.507454 * x +8847.723115, error: 3.687209 Reg 9: y = 0.473624 * x +8847.790776, error: 3.968753 Reg 10: y = 0.444022 * x +8847.849978, error: 4.226370 2018/11/10 Chap 1. Linear Regression

More about the Lagrange multiplier method 1/2 Example: No constraint With equality as constraint With inequality as constraint 2018/11/10 Chap 1. Linear Regression

More about the Lagrange multiplier method 2/2 A necessary (but not sufficient) condition for convex optimization: Using the Lagrange multiplier method: KKT condition 2018/11/10 Chap 1. Linear Regression

What & Why is regularization? A process of introducing additional information in order to solve an ill-posed problem Why Want to introduce additional information Have difficulty in solving the ill-posed problem Without regularization: With regularization: 2018/11/10 Chap 1. Linear Regression

From the statistical perspective The Bayes formula Maximum a posterior (MAP) estimation (Bayesian estimation) We need to specify a prior, e.g.: Finally it reduce to the regularized least squares with 2018/11/10 Chap 1. Linear Regression

Bayesian interpretation of regularization The prior is “additional information” Many statisticians question this point How much regularization depends on How confident we are about the data How confident we are about the prior 2018/11/10 Chap 1. Linear Regression

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Polynomial curve fitting Basis functions Weights Another form: weights and bias 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline Sigmoid 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Variable remapping Using basis functions will remap the variable(s) in a non-linear manner Change the dimensionality To enable a simpler (linear) model 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Maximum likelihood Assume observations are from a deterministic function with additive Gaussian noise Then Given observed inputs and targets The likelihood function is 2018/11/10 Chap 1. Linear Regression

Maximum likelihood and least squares Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) 2018/11/10 Chap 1. Linear Regression

Maximum likelihood solution Solution is The design matrix The pseudo-inverse 2018/11/10 Chap 1. Linear Regression

Geometrical interpretation Let And let the columns of be They span a subspace Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance 2018/11/10 Chap 1. Linear Regression

Regularized least squares Construct the “joint” error function Use SSE as data term, and quadratic regularization term (ridge regression): The solution is Data term + Regularization term 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Equivalent kernel For a new input, the predicted output is Predictions can be calculated directly from the equivalent kernel, without calculating the parameters Equivalent kernel 2018/11/10 Chap 1. Linear Regression

Equivalent kernel for Gaussian basis functions 2018/11/10 Chap 1. Linear Regression

Equivalent kernel for other basis functions Polynomial Sigmoidal Equivalent kernel is “local”: nearby points have more weights 2018/11/10 Chap 1. Linear Regression

Properties of equivalent kernel Sums to 1 if λ is 0 May have negative values Can be seen as inner product 2018/11/10 Chap 1. Linear Regression

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Example Reproduced from PRML Generate 100 data sets, each having 25 points A sine function plus Gaussian noise Perform ridge regression on each data set with 24 Gaussian basis functions and different values of regularization weight 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 1/3 High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 2/3 Moderate regularization Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 3/3 Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Bias-variance decomposition The second term is intrinsic “noise”, consider the first term Suppose we have a dataset and we can calculate the parameter based on the dataset Then we take expectation with respect to dataset Finally we have: expected “loss” = (bias)2 + variance + noise 2018/11/10 Chap 1. Linear Regression

Bias-variance trade-off Over-regularized model will have a high bias, while under-regularized model will have a high variance How can we achieve the trade-off? For example, by cross validation (will be discussed later) 2018/11/10 Chap 1. Linear Regression

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Other forms? Least squares Ridge regression Norm regularized regression Lq norm: 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Different norms What about 0 and ∞? 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Best subset selection Define l0 “norm” as Best subset selection regression: Also known as “sparse” Unfortunately, this is NP-hard 2018/11/10 Chap 1. Linear Regression

Example: Why we need sparsity? fMRI data help to understand brain’s functionality Brain fMRI data may consists of 10,000~100,000 voxels We want to identify the most relevant anchor points 2018/11/10 Chap 1. Linear Regression

L1 norm replacing l0 “norm” Interestingly, we can use L1 norm to replace l0 “norm,” and still achieve a sparse solution* Geometric interpretation Left: L1 norm; Right: L2 norm Least squares solution Sparsity here! * Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797-829. 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression LASSO regression LASSO: least absolute shrinkage and selection operator* Bayesian interpretation Laplace distribution as prior Laplace distribution * Tibshiranit, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1), 267-288. 2018/11/10 Chap 1. Linear Regression

Solution of LASSO regression Consider a special case: The least-squares solution is And the solution is 2018/11/10 Chap 1. Linear Regression

Comparison between best subset, LASSO, and ridge Consider the special case: orthonormal design matrix Best subset: Hard thresholding Ridge: Uniformly shrink the LS solution LASSO: Soft thresholding 2018/11/10 Chap 1. Linear Regression

Implications of different norms Best subset LASSO ridge q = 0 q = ∞ Sparse solution Convex optimization 2018/11/10 Chap 1. Linear Regression

Chapter 1. Linear Regression From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Bayesian linear regression Define prior for the parameters Note the likelihood function is The posterior is 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression MAP estimation The maximum a posteriori (MAP) estimation Compared with the maximum likelihood estimation Compared with the ridge regression solution 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression How to set the prior If using zero-mean Gaussian prior, the Bayesian estimation is equivalent to ridge regression solution If using zero-mean Laplace prior, the Bayesian estimation (no closed- form expression) is equivalent to lasso regression solution Conjugate prior: to make the posterior and prior follow the same category of distribution, e.g. Gaussian 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Example Reproduced from PRML 0 data points observed Parameters for simulation: 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 1/3 1 data point observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 2/3 2 data points observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 3/3 20 data points observed Posterior Data Space Variance of the posterior decreases as the number of data points increases 2018/11/10 Chap 1. Linear Regression

Predictive distribution In the Bayesian framework, every variable has a distribution, such as the predicted output given an input As N increases this term will vanish 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Example Reproduced from PRML Sinusoidal data, 9 Gaussian basis functions, 1 data point True function Predictive mean Predictive variance Different predicted functions 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 1/3 Sinusoidal data, 9 Gaussian basis functions, 2 data points 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 2/3 Sinusoidal data, 9 Gaussian basis functions, 4 data points 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Simulation results 3/3 Sinusoidal data, 9 Gaussian basis functions, 25 data points 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Model selection Polynomial curve fitting: How to set the order? 2018/11/10 Chap 1. Linear Regression

From the statistical perspective Bayesian model selection: Given the dataset, estimate the posterior of different models “ML” model selection: Choose the model that maximizes the model evidence function 2018/11/10 Chap 1. Linear Regression

Calculation of model evidence for Bayesian linear regression Details c.f. PRML 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Example Reproduced from PRML 2018/11/10 Chap 1. Linear Regression

More about the hyper-parameters We can “estimate” the hyper-parameters based on e.g. ML criterion Define the eigenvalues as 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Interpretation 1/3 The eigenvalues of are 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Interpretation 2/3 By decreasing α, more parameters become “learnt from data” γ measures the number of “learnt” parameters 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Interpretation 3/3 Recall that for estimating Gaussian parameters, we have Now for Bayesian linear regression, we have 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Notes The hyper-parameters can be further regarded as random variables, and integrated into the Bayesian framework 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression Chapter summary Dictionary Toolbox Bias-variance decomposition Equivalent kernel Gaussian distribution Laplace distribution KKT condition Model selection Prior; conjugate ~ Posterior Regularization Sparsity Basis functions Best subset selection Lagrange multiplier LASSO regression Least squares MAP (Bayesian) estimation ML estimation Ridge regression 2018/11/10 Chap 1. Linear Regression