Download presentation
Presentation is loading. Please wait.
Published byJames Franklin Modified over 6 years ago
1
Machine learning, pattern recognition and statistical data modelling
Lecture 12. The last lecture Coryn Bailer-Jones
2
What is machine learning?
Data description and interpretation finding simpler relationship between variables (predictors and responses) finding naural groups or classes in data relating observables to physical quantities Prediction capturing relationship between “inputs” and “outputs” for a set of labelled data with the goal of predicting outputs for unlabelled data (“pattern recognition”) Learning from data dealing with noise coping with high dimensions (many potentially relevant variables) fitting models to data generalizing
3
Concepts: types of problems
supervised learning predictors (x) and responses (y) infer P(y | x), perhaps modelled as f(x ; w) discrete y is a classification problem; real-valued is regression unsupervised learning no distinction between predictors and responses infer P(x), or things about this e.g. no. of modes/classes (mixture modelling, peak finding) low dimensional projections (descriptions) (PCA, SOM, MDS) outlier detection (discovery)
4
Concepts: probabilities and Bayes
likelihood of x given y 𝑝𝑦∣𝑥= 𝑝𝑥∣𝑦𝑝𝑦 𝑝𝑥 𝑝𝑦,𝑥=𝑝𝑦∣𝑥𝑝𝑥=𝑝𝑥∣𝑦𝑝𝑦 Two levels of inference (learning): 1. Prediction:𝑥=predictors (input),𝑦=response(s) (output) 2. Model fitting:𝑥=data,𝑦=model parameters 𝑝𝑦∣𝑥,= 𝑝𝑥∣𝑦,𝑝𝑦∣ 𝑝𝑥∣ denominator ′just′ a normalization constant 𝑝𝑥∣= 𝑝 𝑥∣𝑦,𝑝𝑦∣𝑑𝑦 prior over y posterior probability of y given x evidence for model
5
Concepts: solution procedure
need some kind of expression for P(y | x) or P(x) e.g. f(x ; w) = P(y | x ) parametric, semi-, or non-parametric. E.g. density estimation and nonlinear regression parametric: Gaussian distribution P(x), spline f(x) semi-parametric: sum of several Gaussians, additive model, local regression non-parametric: k-nn, kernel estimate, k-nn parametric models: fit to data need to infer adjustable parameters, w, from data generally minimize a loss function on a labelled data set w.r.t w compare different models
6
Concepts: objective function
Different functions suitable for continuous (regression) or discrete (class.) problems. Let𝑓 𝐱 𝑖 be (real−valued) model prediction for target 𝑦 𝑖 𝑖 𝑦 𝑖 −𝑓 𝑥 𝑖 2 residual sum of squares (RSS) 𝑖 ∣ 𝑦 𝑖 −𝑓 𝑥 𝑖 ∣ 𝐿 𝐿−norm 𝑖 exp − 𝑦 𝑖 𝑓 𝑥 𝑖 exponential 𝑖 𝑣 𝑖 where 𝑣 𝑖 = 0if∣ 𝑟 𝑖 ∣≤ ∣ 𝑟 𝑖 ∣−otherwise −insensitive 𝑣 𝑖 = 𝑟 𝑖 2 2 if∣ 𝑟 𝑖 ∣≤𝑐 𝑐∣ 𝑟 𝑖 ∣− 𝑐 2 2 otherwise Huber where 𝑟 𝑖 = 𝑦 𝑖 −𝑓 𝐱 𝑖 For discrete outputs (e.g. via𝑎𝑟𝑔𝑚𝑎𝑥) we have 1−0 loss and cross−entropy.
7
Loss functions
8
Models: linear modelling (linear least squares)
Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution: = 𝑚𝑖𝑛 𝑖=1 𝑁 𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗 𝑗 2 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗 minimize w.r.tand the solution is 𝑟𝑖𝑑𝑔𝑒 = 𝐗 𝐓 𝐗 −1 𝐗 𝐓 𝐲
9
Concepts: maximum likelihood (as a loss function)
Let𝑓 𝐱 𝑖 ∣𝐰be function estimate for 𝑦 𝑖 . Probability of getting these for all𝑁training points is 𝑝Data∣𝐰= 𝑖=1 𝑁 𝑝 𝑓 𝐱 𝑖 ∣𝐰≡𝐿 𝑓 𝐱 𝑖 ∣𝐰 is the likelihood. In practice we minimize the negative log likelihood 𝐸=−ln𝐿=− 𝑖=1 𝑁 ln 𝑝𝑓 𝐱 𝑖 ∣𝐰 If we assume that the model predictions follow an i.i.d Gaussian distribution about the true values, then 𝑝𝑓 𝐱 𝑖 ∣𝐰= 1 2 exp − 𝑓 𝐱 𝑖 − 𝑦 𝑖 2 2 −ln𝐿𝐰= 𝑁 2 ln2𝑁ln 1 2 2 𝑖=1 𝑁 𝑓 𝐱 𝑖 ∣𝐰− 𝑦 𝑖 2 i.e. ML with constant (unknown) noise corresponds to minimizing RSS w.r.t the model parameters𝐰
10
Concepts: generalization and regularization
given a specific set of data, we nonetheless want a general solution therefore, must make some kind of assumption(s) smoothness in functions priors on model parameters (or functions, or predictions) restricting model space regularization involves a free parameter, although this can also be inferred from the data
11
Models: penalized linear modelling (ridge regression)
Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution: = 𝑚𝑖𝑛 𝑖=1 𝑁 𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗 𝑗 2 𝑗=1 𝑝 𝑗 2 where≥0 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗 𝐓 minimize w.r.tand the solution is 𝑟𝑖𝑑𝑔𝑒 = 𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix
12
Models: ridge regression (as regularization)
𝐲 =𝑋 = 𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲=𝐀𝐲 = 𝑗=1 𝑝 𝐮 𝑗 𝑑 𝑗 2 𝑑 𝑗 2 𝐮 𝑗 𝑇 𝐲 The eigenvectors 𝑑 𝑗 2 measure the variance in the projection onto 𝐯 𝑗 𝑑𝑓=𝑡𝑟𝐀= 𝑗=1 𝑝 𝑑 𝑗 2 𝑑 𝑗 2 the regularization projects the data onto the PCs and downweights (“shrinks”) them inversely proportional to their variance limits the model space one free parameter: large implies large degree of regularization, df() is small
13
Models: ridge regression vs. df()
© Hastie, Tibshirani, Friedman (2001)
14
Models: splines © Hastie, Tibshirani, Friedman (2001)
15
Concepts: regularization (in splines)
Avoid know selection by selecting all points as knots Avoid overfitting via regularization that is, minimise a penalized sum-of-squares 𝑅𝑆𝑆𝑓,= 𝑖=1 𝑁 𝑦 𝑖 −𝑓 𝑥 𝑖 2 𝑓′′𝑡 2 𝑑𝑡 𝑓is the fitting function with continuous second derivatives =0⇒𝑓is any function which interpolates the data (could be wild) =∞⇒straight line least squares fit (no second derivative tolerated)
16
Concepts: regularization (in smoothing splines)
Solution is a cubic spline with knots at each of the 𝑥 𝑖 i.e. 𝑓𝑥= 𝑗=1 𝑁 ℎ 𝑗 𝑥 𝑗 the residual sum of squares (error) to be minimized is 𝑅𝑆𝑆𝑓,= 𝐲−𝐇 𝑇 𝐲−𝐇 𝑇 𝐍 where 𝐇 𝑖𝑗 = ℎ 𝑗 𝑥 𝑖 and 𝑁 𝑗𝑘 = ℎ ′ ′ 𝑗 𝑡ℎ′ ′ 𝑘 𝑡𝑑𝑡 The solution is = 𝐇 𝐓 𝐇 𝐍 −1 𝐇 𝐓 𝐲 Compare to ridge regression 𝑟𝑖𝑑𝑔𝑒 = 𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix
17
Concepts: regularization (in smoothing splines)
18
Concepts: regularization in ANNs and SVMs
In feedforward neural network regularization can be done with weight decay 𝐸= 1 2 𝑁 𝑘 𝑂 𝑘,𝑛 − 𝑇 𝑘,𝑛 2 1 2 𝑤 2 In SVMs the comes regularization is in the initial formulation (margin maximization) with the error (loss) function as the constraint 𝐸= ∥𝐰∥ 2 𝐶 𝑖 𝑛 𝑖 s.t 𝑦 𝑖 𝐱 𝑖 .𝐰𝑏−1 𝑖 ≥0, 𝑖 ≥0 Regularization parameter is 1 𝐶
19
Concepts: model comparison and selection
cross validation n-fold, leave-one-out, generalized compare and select models using just the training set account for model complexity plus bias from finite-sized training set Bayes Information Criterion Akaike Information Criterion k is no. of parameters; N is no. of training vectors smallest BIC or AIC corresponds to optimal model Bayesian evidence for model (hypothesis) H, P(D | H) probability that data arises from model, marginalized over all model parameters 𝐴𝐼𝐶=−2ln𝐿2𝑘 𝐵𝐼𝐶=−2ln𝐿ln𝑁𝑘
20
Concepts: Occam's razor and Bayesian evidence
D = data H = hypothesis (model) w = model parameters Simpler model, H1, predicts less of the data space Evidence naturally penalizes more complex models 𝑝𝐰∣𝐷, 𝐻 𝑖 = 𝑝𝐷∣𝐰, 𝐻 𝑖 𝑝𝐰∣ 𝐻 𝑖 𝑝𝐷∣ 𝐻 𝑖 Posterior= Likelihood×Prior Evidence after MacKay (1992)
21
Concepts: curse of dimensionality
to retain density, no. vectors must grow exponentially with no. dimensions generally cannot do this overcome curse in various ways make assumptions: structured regression limit model space generalized additive models basis functions and kernels
22
Models: basis expansions
p−dimensional data:𝐗= 𝑋 1, 𝑋 2, ...., 𝑋 𝑗 ,..., 𝑋 𝑝 Basis expansion:𝑓 𝐗 = 𝑚=1 𝑀 𝑚 ℎ 𝑚 𝐗 linear model quadractic terms higher order terms other transformations, e.g. split range with an indicator function generalized additive models ℎ 𝑚 𝑋 = 𝑋 𝑚 𝑚=1,...,𝑝 ℎ 𝑚 𝑋 = 𝑋 𝑗 𝑋 𝑘 ℎ 𝑚 𝑋 =log 𝑋 𝑗 , 𝑋 𝑗 ℎ 𝑚 𝑋 =𝐼 𝐿 𝑚 ≤ 𝑋 𝑗 ≤ 𝑈 𝑚 ℎ 𝑚 𝑋 = ℎ 𝑚 𝑋 𝑚 𝑚=1,...,𝑝
23
Models: MLP neural network basis functions
𝑦= 𝑗=1 𝐽 𝑤 𝑗,𝑘 𝐻 𝑗 𝐻 𝑗 =𝑔 𝑣 𝑗 where 𝑣 𝑗 = 𝑖=1 𝑝 𝑤 𝑖,𝑗 𝑥 𝑖 𝑔 𝑣 𝑗 = 1 1 𝑒 − 𝑣 𝑗 𝐽sigmoidal basis function function
24
Models: radial basis function neural networks
𝑦𝐱= 𝑤 𝑘0 𝑗=1 𝐽 𝑤 𝑗,𝑘 𝑗 𝐱where 𝐱=exp − ∥𝐱− 𝑗 ∥ 2 2 𝑗 2 are the radial basis functions.
25
Concepts: optimization
With gradient information gradient descent add second derivative (Hessian): Newton, quasi-Newton, Levenberg- Marquardt, conjugate gradients pure gradient methods get stuck in local minima random restart committee/ensemble of models momentum terms (non-gradient info.) without gradient information expectation-maximization (EM) algorithm simulated annealing genetic algorithms
26
Concepts: marginalization (Bayes again)
We are often not interested in the actual model parameters,𝐰; these are just a means to an end. That is, we are interested in𝑃𝑦∣𝐱 whereas model inference gives𝑃𝑦∣𝐱,𝐰 A Bayesian𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙𝑖𝑧𝑒𝑠over parameters of no interest 𝑃𝑦∣𝐱= 𝑃 𝑦∣𝐱,𝐰𝑃𝐰∣𝐱𝑑𝐰 𝑃𝐰∣𝐱is the prior over the model weights (conditioned on the input data, but we could assume independence).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.