Machine learning, pattern recognition and statistical data modelling

Machine learning, pattern recognition and statistical data modelling
Lecture 12. The last lecture Coryn Bailer-Jones

What is machine learning?
Data description and interpretation finding simpler relationship between variables (predictors and responses) finding naural groups or classes in data relating observables to physical quantities Prediction capturing relationship between “inputs” and “outputs” for a set of labelled data with the goal of predicting outputs for unlabelled data (“pattern recognition”) Learning from data dealing with noise coping with high dimensions (many potentially relevant variables) fitting models to data generalizing

Concepts: types of problems
supervised learning predictors (x) and responses (y) infer P(y | x), perhaps modelled as f(x ; w) discrete y is a classification problem; real-valued is regression unsupervised learning no distinction between predictors and responses infer P(x), or things about this e.g. no. of modes/classes (mixture modelling, peak finding) low dimensional projections (descriptions) (PCA, SOM, MDS) outlier detection (discovery)

Concepts: probabilities and Bayes
likelihood of x given y 𝑝𝑦∣𝑥= 𝑝𝑥∣𝑦𝑝𝑦 𝑝𝑥 𝑝𝑦,𝑥=𝑝𝑦∣𝑥𝑝𝑥=𝑝𝑥∣𝑦𝑝𝑦 Two levels of inference (learning): 1. Prediction:𝑥=predictors (input),𝑦=response(s) (output) 2. Model fitting:𝑥=data,𝑦=model parameters 𝑝𝑦∣𝑥,= 𝑝𝑥∣𝑦,𝑝𝑦∣ 𝑝𝑥∣ denominator ′just′ a normalization constant 𝑝𝑥∣= 𝑝 𝑥∣𝑦,𝑝𝑦∣𝑑𝑦 prior over y posterior probability of y given x evidence for model 

Concepts: solution procedure
need some kind of expression for P(y | x) or P(x) e.g. f(x ; w) = P(y | x ) parametric, semi-, or non-parametric. E.g. density estimation and nonlinear regression parametric: Gaussian distribution P(x), spline f(x) semi-parametric: sum of several Gaussians, additive model, local regression non-parametric: k-nn, kernel estimate, k-nn parametric models: fit to data need to infer adjustable parameters, w, from data generally minimize a loss function on a labelled data set w.r.t w compare different models

Concepts: objective function
Different functions suitable for continuous (regression) or discrete (class.) problems. Let𝑓 𝐱 𝑖 be (real−valued) model prediction for target 𝑦 𝑖 𝑖 𝑦 𝑖 −𝑓 𝑥 𝑖  2 residual sum of squares (RSS) 𝑖 ∣ 𝑦 𝑖 −𝑓 𝑥 𝑖 ∣ 𝐿 𝐿−norm 𝑖 exp − 𝑦 𝑖 𝑓 𝑥 𝑖  exponential 𝑖 𝑣 𝑖 where 𝑣 𝑖 = 0if∣ 𝑟 𝑖 ∣≤ ∣ 𝑟 𝑖 ∣−otherwise −insensitive 𝑣 𝑖 = 𝑟 𝑖 2 2 if∣ 𝑟 𝑖 ∣≤𝑐 𝑐∣ 𝑟 𝑖 ∣− 𝑐 2 2 otherwise Huber where 𝑟 𝑖 = 𝑦 𝑖 −𝑓 𝐱 𝑖  For discrete outputs (e.g. via𝑎𝑟𝑔𝑚𝑎𝑥) we have 1−0 loss and cross−entropy.

Loss functions

Models: linear modelling (linear least squares)
Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution:   = 𝑚𝑖𝑛  𝑖=1 𝑁  𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗  𝑗  2 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗 minimize w.r.tand the solution is   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗 −1 𝐗 𝐓 𝐲

Concepts: maximum likelihood (as a loss function)
Let𝑓 𝐱 𝑖 ∣𝐰be function estimate for 𝑦 𝑖 . Probability of getting these for all𝑁training points is 𝑝Data∣𝐰= 𝑖=1 𝑁 𝑝 𝑓 𝐱 𝑖 ∣𝐰≡𝐿 𝑓 𝐱 𝑖  ∣𝐰 is the likelihood. In practice we minimize the negative log likelihood 𝐸=−ln𝐿=− 𝑖=1 𝑁 ln 𝑝𝑓 𝐱 𝑖 ∣𝐰 If we assume that the model predictions follow an i.i.d Gaussian distribution about the true values, then 𝑝𝑓 𝐱 𝑖 ∣𝐰= 1 2  exp − 𝑓 𝐱 𝑖 − 𝑦 𝑖  2  2 −ln𝐿𝐰= 𝑁 2 ln2𝑁ln 1 2  2 𝑖=1 𝑁 𝑓 𝐱 𝑖 ∣𝐰− 𝑦 𝑖  2 i.e. ML with constant (unknown) noise corresponds to minimizing RSS w.r.t the model parameters𝐰

Concepts: generalization and regularization
given a specific set of data, we nonetheless want a general solution therefore, must make some kind of assumption(s) smoothness in functions priors on model parameters (or functions, or predictions) restricting model space regularization involves a free parameter, although this can also be inferred from the data

Models: penalized linear modelling (ridge regression)
Data: 𝐱 𝑖 , 𝑦 𝑖 𝐱= 𝑥 1, 𝑥 2, ..., 𝑥 𝑗 ,..., 𝑥 𝑝 Model:𝐲=𝐱 Least squares solution:   = 𝑚𝑖𝑛  𝑖=1 𝑁  𝑦 𝑖 − 𝑗=1 𝑝 𝑥 𝑖,𝑗  𝑗  2  𝑗=1 𝑝  𝑗 2 where≥0 In matrix form this is 𝑅𝑆𝑆= 𝐲−𝐗 𝑇 𝐲−𝐗  𝐓  minimize w.r.tand the solution is   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix

Models: ridge regression (as regularization)
𝐲  =𝑋   =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲=𝐀𝐲 = 𝑗=1 𝑝 𝐮 𝑗 𝑑 𝑗 2 𝑑 𝑗 2  𝐮 𝑗 𝑇 𝐲 The eigenvectors 𝑑 𝑗 2 measure the variance in the projection onto 𝐯 𝑗 𝑑𝑓=𝑡𝑟𝐀= 𝑗=1 𝑝 𝑑 𝑗 2 𝑑 𝑗 2  the regularization projects the data onto the PCs and downweights (“shrinks”) them inversely proportional to their variance limits the model space one free parameter: large  implies large degree of regularization, df() is small

Models: ridge regression  vs. df()
© Hastie, Tibshirani, Friedman (2001)

Models: splines © Hastie, Tibshirani, Friedman (2001)

Concepts: regularization (in splines)
Avoid know selection by selecting all points as knots Avoid overfitting via regularization that is, minimise a penalized sum-of-squares 𝑅𝑆𝑆𝑓,= 𝑖=1 𝑁 𝑦 𝑖 −𝑓 𝑥 𝑖  2  𝑓′′𝑡 2 𝑑𝑡 𝑓is the fitting function with continuous second derivatives =0⇒𝑓is any function which interpolates the data (could be wild) =∞⇒straight line least squares fit (no second derivative tolerated)

Concepts: regularization (in smoothing splines)
Solution is a cubic spline with knots at each of the 𝑥 𝑖 i.e. 𝑓𝑥= 𝑗=1 𝑁 ℎ 𝑗 𝑥  𝑗 the residual sum of squares (error) to be minimized is 𝑅𝑆𝑆𝑓,= 𝐲−𝐇 𝑇 𝐲−𝐇  𝑇  𝐍  where 𝐇 𝑖𝑗 = ℎ 𝑗  𝑥 𝑖 and  𝑁 𝑗𝑘 = ℎ ′ ′ 𝑗 𝑡ℎ′ ′ 𝑘 𝑡𝑑𝑡 The solution is   =  𝐇 𝐓 𝐇  𝐍  −1 𝐇 𝐓 𝐲 Compare to ridge regression   𝑟𝑖𝑑𝑔𝑒 =  𝐗 𝐓 𝐗𝐈 −1 𝐗 𝐓 𝐲𝐈is the𝑝×𝑝identity matrix

Concepts: regularization (in smoothing splines)

Concepts: regularization in ANNs and SVMs
In feedforward neural network regularization can be done with weight decay 𝐸= 1 2 𝑁 𝑘  𝑂 𝑘,𝑛 − 𝑇 𝑘,𝑛  2  1 2  𝑤 2 In SVMs the comes regularization is in the initial formulation (margin maximization) with the error (loss) function as the constraint 𝐸= ∥𝐰∥ 2 𝐶 𝑖 𝑛  𝑖 s.t 𝑦 𝑖  𝐱 𝑖 .𝐰𝑏−1  𝑖 ≥0,  𝑖 ≥0 Regularization parameter is 1 𝐶

Concepts: model comparison and selection
cross validation n-fold, leave-one-out, generalized compare and select models using just the training set account for model complexity plus bias from finite-sized training set Bayes Information Criterion Akaike Information Criterion k is no. of parameters; N is no. of training vectors smallest BIC or AIC corresponds to optimal model Bayesian evidence for model (hypothesis) H, P(D | H) probability that data arises from model, marginalized over all model parameters 𝐴𝐼𝐶=−2ln𝐿2𝑘 𝐵𝐼𝐶=−2ln𝐿ln𝑁𝑘

Concepts: Occam's razor and Bayesian evidence
D = data H = hypothesis (model) w = model parameters Simpler model, H1, predicts less of the data space Evidence naturally penalizes more complex models 𝑝𝐰∣𝐷, 𝐻 𝑖 = 𝑝𝐷∣𝐰, 𝐻 𝑖 𝑝𝐰∣ 𝐻 𝑖  𝑝𝐷∣ 𝐻 𝑖  Posterior= Likelihood×Prior Evidence after MacKay (1992)

Concepts: curse of dimensionality
to retain density, no. vectors must grow exponentially with no. dimensions generally cannot do this overcome curse in various ways make assumptions: structured regression limit model space generalized additive models basis functions and kernels

Models: basis expansions
p−dimensional data:𝐗= 𝑋 1, 𝑋 2, ...., 𝑋 𝑗 ,..., 𝑋 𝑝  Basis expansion:𝑓 𝐗 = 𝑚=1 𝑀  𝑚 ℎ 𝑚 𝐗 linear model quadractic terms higher order terms other transformations, e.g. split range with an indicator function generalized additive models ℎ 𝑚 𝑋 = 𝑋 𝑚 𝑚=1,...,𝑝 ℎ 𝑚 𝑋 = 𝑋 𝑗 𝑋 𝑘 ℎ 𝑚 𝑋 =log 𝑋 𝑗 ,  𝑋 𝑗  ℎ 𝑚 𝑋 =𝐼 𝐿 𝑚 ≤ 𝑋 𝑗 ≤ 𝑈 𝑚  ℎ 𝑚 𝑋 = ℎ 𝑚  𝑋 𝑚 𝑚=1,...,𝑝

Models: MLP neural network basis functions
𝑦= 𝑗=1 𝐽 𝑤 𝑗,𝑘 𝐻 𝑗 𝐻 𝑗 =𝑔 𝑣 𝑗  where 𝑣 𝑗 = 𝑖=1 𝑝 𝑤 𝑖,𝑗 𝑥 𝑖 𝑔 𝑣 𝑗 = 1 1 𝑒 − 𝑣 𝑗 𝐽sigmoidal basis function function

Models: radial basis function neural networks
𝑦𝐱= 𝑤 𝑘0  𝑗=1 𝐽 𝑤 𝑗,𝑘  𝑗 𝐱where 𝐱=exp − ∥𝐱−  𝑗 ∥ 2 2  𝑗 2  are the radial basis functions.

Concepts: optimization
With gradient information gradient descent add second derivative (Hessian): Newton, quasi-Newton, Levenberg- Marquardt, conjugate gradients pure gradient methods get stuck in local minima random restart committee/ensemble of models momentum terms (non-gradient info.) without gradient information expectation-maximization (EM) algorithm simulated annealing genetic algorithms

Concepts: marginalization (Bayes again)
We are often not interested in the actual model parameters,𝐰; these are just a means to an end. That is, we are interested in𝑃𝑦∣𝐱 whereas model inference gives𝑃𝑦∣𝐱,𝐰 A Bayesian𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙𝑖𝑧𝑒𝑠over parameters of no interest 𝑃𝑦∣𝐱= 𝑃 𝑦∣𝐱,𝐰𝑃𝐰∣𝐱𝑑𝐰 𝑃𝐰∣𝐱is the prior over the model weights (conditioned on the input data, but we could assume independence).

Machine learning, pattern recognition and statistical data modelling

Similar presentations

Presentation on theme: "Machine learning, pattern recognition and statistical data modelling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine learning, pattern recognition and statistical data modelling

Similar presentations

Presentation on theme: "Machine learning, pattern recognition and statistical data modelling"— Presentation transcript:

Similar presentations

About project

Feedback