Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning
Linear Models for Regression

Linear models have significant limitations
Introduction Predict continuous target variable t using a D-dimensional vector x of input variables. Linear functions of the adjustable parameters Not linear function of the input variables. Linear combinations of a fixed set of nonlinear functions, called basis functions. We aim to model p(t|x), in order to be able to minimize expected loss. Not just to find a function t=y(x) Linear models have significant limitations

Linear Basis Function Models
The Bias-Variance Decomposition Bayesian Linear Regression Bayesian Model Comparison The Evidence Approximation Limitations of Fixed Basis Functions

The simplest model: Fixed non-linear basis functions: Linear regression Usually φ0(x)=1. In this case, w0 is called the bias parameter. Can be seen as feature extraction

Examples (1/2) Polynomial basis functions ‘Gaussian’ basis functions
Limitation: Global functions Alternative: Spline functions ‘Gaussian’ basis functions

Examples (2/2) Sigmoidal basis functions Fourier basis functions
‘tanh’ function: tanh(a)=2σ(a)-1 Fourier basis functions Infinite spatial extent Wavelets: Localized in both space and frequency In the rest of the chapter we won’t consider a particular form of basis functions, so it is ok to use even φ(x)=x.

Maximum likelihood and least squares (1/2)
unknown function Suppose that: If we assume a squared loss function, then the optimal prediction is the conditional mean of t: Suppose X=(x1,…,xN) and t=(t1,…,tN), then: x will always be a conditioning variable, so it can be omitted

Maximum likelihood and least squares (2/2)
Using maximum likelihood to determine w and β: where: Finally: normal equations for the least squares problem design matrix

Geometry of least squares
Consider an N-dimensional space. t is a vector in this space. Each basis function evaluated over the N data points is also a vector, denoted by φj. If M<N, φj’s span a linear subspace S of dimensionality M. Let y be an N-dimensional vector, such that: y is an arbitrary combination of φj’s. Then, sum-of-square is equal to ½|y-t|. Solution obtained by projecting t into S. Difficulties when φj’s are near-colinear.

Least Mean Squares (LMS) algorithm
Sequential learning Sequential gradient descent: For the case of sum-of-squares error function: we have: learning rate error of the nst point Least Mean Squares (LMS) algorithm

Regularized least squares (1/2)
Regularization term: Weighted decay regularizer: Minimizes for:

Regularized least squares (2/2)
More general regularizer: Contours for M=2: lasso The regularization term can be seen as a constraint, which in case of q1 drives many coefficients to zero

Decouples for the different output variables in:
Multiple outputs K>1 outputs, denoted with target vector t. We can use different basis function for each output, however it is common to use the same ones: In this case: and in case of N observations T=(t1,t2,…, tN)T : Decouples for the different output variables in:

Expected loss (again) Suppose the squared loss function for which:
Expected squared loss: p(t|x) may be computed using any error function optimal prediction Our goal Regression function, not known precisely The second integrand is irrelevant to the choice of y(x) (Intrinsic noise, cannot be reduced)

A frequentist treatment
Suppose y(x)=y(x,w). For a given data set D of size N we make an estimate of w resulting in y(x;D). Suppose we have several i.i.d. data sets of size N, we can compute ED[y(x;D)]. Then the expected loss can be written: expected loss = (bias)2 + variance + noise

The bias-variance trade-off
(bias)2: How much the average prediction over all datasets varies from the actual value. variance: How much the individual predictions differ from their average. noise

Average(l)(y(l)(x)), l=1..100
Example (1/3) L=100 sinusoidal data set, each with N=25 points. For each dataset D(l) we fit a model with 24 Gaussian basis functions (plus a constant term, thus M=25) by minimizing the regularized error for various λ’s. y(l)(x), l=1..100 Average(l)(y(l)(x)), l=1..100

Example (2/3)

Example (3/3) Approximations: Optimal value: lnλ=-0.31
(coincides with test error) Bias-variance decomposition is of limited practical use; if we have multiple data sets, we usually merge them…

Introduction Maximum likelihood always leads to overfitting.
Regularization term is a way to alleviate the problem. Test data is another way, provided we have plenty of data. Bayesian learning is an alternative approach. No need for test data.

Parameter distribution (1/2)
Given β and omitting X, likelihood is: Prior of w: So, posterior of w is (using the Bayes’ theorem for Gaussian variables): wMAP=wN Allows for sequential learning

Parameter distribution (2/2)
In the following we will consider that: which leads to: Log of the posterior: zero-mean isotropic Gaussian a=0  wMAP=wML a  p(w|t,a)=p(w|a) sum-of-squares error function quadratic regularization term

Example: Straight-line fitting
Linear model: y(x,w)=w0+w1x. Actual function (to be learnt): f(x,a)=a0+a1x, with a0=-0.3 and a1=0.5. Synthetic data: U(x|-1,1)  f(xn,a)  Add Gaussian noise with standard deviation 0,2 . Suppose that the noise variance is known, so β=(1/0.2)2=25. Let’s also consider a=2.0.

Predictive distribution
Usually we are not interested in p(w|t) but in: which evaluates to: where:

Example (1/2) Data generated from y=sin(2πx) plus Gaussian noise.
red lines: the means of the predictive distributions Data generated from y=sin(2πx) plus Gaussian noise. Model: Linear combination of Gaussian-basis functions. shaded areas: the standard deviations

Predictive distribution integrates over such individual y(x,w)
Example (2/2) y(x,w) for various samples from p(w|t). Predictive distribution integrates over such individual y(x,w)

Equivalent kernel (1/2) Combining and w=mN : More concise: where:
Linear smoother smoother matrix or equivalent kernel k(x,x’) for 11 Gaussian basis functions over [-1,1] and 200 values of x equally spaced over the same interval.

Nearby points have highly correlated targets
Equivalent kernel (2/2) Other types of basis functions k(x, x’) for x=0, plotted as a function of x’: Covariance between y(x) and y(x’): General result: Equivalent kernel is a localized function around x, even if the basis functions are not localized. Thus, nearby “training” points give higher weights polynomial sigmoidal Nearby points have highly correlated targets

Properties of the equivalent kernel
For every new data point x: It can be written in the form of an inner product: where:

Introduction Over-fitting in maximum likelihood can be avoided by marginalizing over the model parameters (instead of making point predictions). Models can be compared directly on the training data!

model evidence or marginal likelihood
Probability of a model Suppose we wish to compare a set of L models {Mi}. A model refers to a probability distribution over the observed data D. Then: Bayes factor for two models: model evidence or marginal likelihood Note: A model is defined by the type/number of the basis functions. p(D|Mi) marginalizes over the parameters of the model.

Predictive distribution over models
Actually, a mixture distribution over models!: Approximation: Use just the most probable model Model selection Model evidence:

The penalty term grows with M
Example Suppose a single model parameter w. Suppose that wprior and wposterior have a rectangle form. Then: and: In case of M parameters: negative The penalty term grows with M

Models of different complexity
Model evidence can favor models of intermediate complexity.

Remarks on Bayesian Model Comparison
If the correct model is within the compared ones, then Bayesian Model Comparison favors on average the correct model. Model evidence is sensitive on the model’s parameters prior. It is always wise to keep aside an independent test set of data.

Evidence approximation (1/2)
Two hyperparameters: Noise precision β: w prior precision a: Marginalizing over all parameters (x is omitted): Approximation: p(a,β|t) sharply peaked around and Analytically intractable Evidence approximation

Evidence approximation (2/2)
We need to estimate and . Obviously, their values will depend on the training data: If p(a,β) is relatively flat, then we have to maximize the likelihood function p(t|a,β). Two approaches for maximizing log likelihood: Analytically (our approach) Expectation maximization

Evaluation of the evidence function
Integrating over the weight parameters w: and after several steps we find: where: evidence function (ln)

Example: Polynomial regression (again)
Model evidence wrt M: a=5x10-3 Sinusoidal underlying function

Maximizing the evidence function wrt a
By defining: We obtain: where: Iterative estimation between α, γ and mN.

Maximizing the evidence function wrt β
Iterative estimation: If both a and β need to be determined, then their values must be re-estimated together after each update on γ.

Effective number of parameters (1/2)
Eigenvalues λi measure the curvature of the likelihood function: Well determined parameters: 0<λ1<λ2 New parameters space, aligned with eigenvectors ui. γ: effective total number of well determined parameters

Effective number of parameters (2/2)
Compare: and: and recall (chapter 1) that the unbiased βML in case of a single parameter is given by:

Blue curve: Test set error
Example: Estimating a In the sinusoidal data set with 9 Gaussian basis functions we have M=10 parameters. Let’s set β to its true value (11.1) and determine a. Red curve: γ Blue curve: amNTmN Red curve: lnp(t|a,β) Blue curve: Test set error

Example: γ versus wi’s γ is positively related to a
When a goes from 0 to , γ goes from 0 to M.

No need to compute eigenvalues,
Large data sets For N>>M, γM. In this case the equations for a and β become: No need to compute eigenvalues, still need for iterations.

Limitations The basis functions φj(x) (form and number) are fixed before the training data set is observed. Number of basis functions grows exponentially with the dimensionality D. More complex models are needed Neural networks Support vector machines However, usually: Data form non-linear manifolds of lower dimensions. Target values depend on a subset of input variables.

Pattern Recognition and Machine Learning

Similar presentations

Presentation on theme: "Pattern Recognition and Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Recognition and Machine Learning

Similar presentations

Presentation on theme: "Pattern Recognition and Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback