Machine learning, pattern recognition and statistical data modelling Lecture 6. Kernel methods and additive models Coryn Bailer-Jones
Topics Think globally, act locally: kernel methods Generalized Additive Models (GAMs) for regression (classification next week) confidence intervals
Kernel methods In the first lecture we looked at kernel methods for density estimation E.g. Gaussian kernel of width 2h in d dimensions estimated from N data points 𝑓𝑥= 1 𝑁 𝑛=1 𝑛=𝑁 1 2 ℎ 2 𝑑 2 exp ∥𝑥− 𝑥 𝑛 ∥ 2 2h 2
K-NN kernel density estimation K = no. neighbours N = total no. points V = volume occupied by K neighbours Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours 𝑓𝑥= 𝐾 𝑁𝑉 © Bishop (1995)
One-dimensional kernel smoothers: k-nn 𝑘−nn smoother:𝐸𝑌∣𝑋=𝑥= 𝑓 𝑥=Ave 𝑦 𝑖 ∣ 𝑥 𝑖 ∈ 𝑁 𝑘 𝑥 𝑁 𝑘 𝑥is the set of𝑘points nearest to𝑥in (e.g.) squared distance Drawback is that the estimator is not smooth in𝑥. 𝑘=30 © Hastie, Tibshirani, Friedman (2001)
One-dimensional kernel smoothers: Epanechnikov Instead give more distant points less weight, e.g. with the Nadaraya−Watson kernel−weighted average 𝑓 𝑥 0 = 𝑖=1 𝑁 𝐾 𝑥 0, 𝑥 𝑖 𝑦 𝑖 𝑖=1 𝑁 𝐾 𝑥 0, 𝑥 𝑖 using the Epanechnikov kernel 𝐾 𝑥 0, 𝑥 𝑖 =𝐷 ∣𝑥− 𝑥 0 ∣ where𝐷𝑡= 3 4 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Could generalize kernel to have variable width 𝐾 𝑥 0, 𝑥 𝑖 =𝐷 ∣𝑥− 𝑥 0 ∣ ℎ 𝑥 0 © Hastie, Tibshirani, Friedman (2001)
Kernel comparison Epanechnikov:𝐷𝑡= 3 4 1− 𝑡 2 if∣𝑡∣≤1 0otherwise Tri−cube:𝐷𝑡= 1− ∣𝑡∣ 3 3 if∣𝑡∣≤1 0otherwise © Hastie, Tibshirani, Friedman (2001)
k-nn and Epanechnikov kernels 𝑘=30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or © Hastie, Tibshirani, Friedman (2001)
Locally-weighted averages can be biased at boundaries © Hastie, Tibshirani, Friedman (2001) Kernel is asymmetric at the boundary
Local linear regression solve linear least squares in local region to predict at a single point green points: effective kernel © Hastie, Tibshirani, Friedman (2001)
Local quadratic regression © Hastie, Tibshirani, Friedman (2001)
Bias-variance trade-off higher order local fits reduce bias at cost of increased variance, esp. at boundary (see previous page) © Hastie, Tibshirani, Friedman (2001)
Kernels in higher dimensions kernel smoothing and local regression generalize to higher dimensions... ...but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification
Generalized Additive Model Could model a p−dimensional set of data using 𝑌 𝑋 1, 𝑋 2, ..., 𝑋 𝑝 = 𝑓 1 𝑋 1 𝑓 2 𝑋 2 ... 𝑓 𝑝 𝑋 𝑝 Idea is to fit each 1D function separately and then provide an algorithm to iteratively combine them. Do this by minimizing penalized RSS 𝑃𝑅𝑆𝑆= 𝑖=1 𝑁 𝑦 𝑖 −− 𝑗=1 𝑝 𝑓 𝑗 𝑥 𝑖𝑗 2 𝑗=1 𝑝 𝑗 𝑓 𝑗 ′′ 𝑡 𝑗 2 𝑑𝑡 𝑗 Could use a variety of smoothers for each 𝑓 𝑗 and the corresponding penalty. Here use cubic splines. To make solution unique must fix, e.g. = 1 𝑁 𝑖=1 𝑁 𝑦 𝑖 in which case 𝑖=1 𝑁 𝑓 𝑗 𝑥 𝑖𝑗 =0∀𝑗 Avoiding the curse: Split p−dimensional problem into p 1−dimensional ones
Backfitting algorithm for additive models in principle this step is not required 𝑆 𝑗 is a smoothing splinefit as a function of 𝑥 𝑖𝑗 to the residuals, i.e. what𝑠ℎ𝑜𝑢𝑙𝑑be explained by 𝑓 𝑗 © Hastie, Tibshirani, Friedman (2001)
Generalized additive models on the rock data Application of the gam{gam} package on the rock{MASS} data set. See the R scripts on the lecture web site
Confidence intervals with splines Spline function estimate is 𝐟 𝑥=𝐇 =𝐇 𝐇 𝐓 𝐇 𝐍 −1 𝐇 𝐓 𝐲 = 𝐒 𝐲 The smoother matrix 𝐒 depends only on 𝑥 𝑖 andbut not on𝐲. 𝐕=Variance 𝐟 𝑥 = 𝐒 𝐒 𝑇 = 𝑑𝑖𝑎𝑔𝐕 gives the pointwise error estimates on either the training data or new data.
R packages for Generalized Additive Models gam{gam} same as the package implemented in S-PLUS gam{mgcv} a variant on the above brutto{mda} automatically selects between smooth fit (cubic spline), linear fit and omitting variable altogether
Summary Kernel methods improvements over nearest neighbours to reduce (or control) bias local linear, quadratic regression Generalized Additive Models defeat (cheat?) the curse of dimensionality by dividing into p 1- dimensional fitting problems typically use kernel or spline smoothers iterative backfitting algorithm MARS (multiple adaptive regression splines) piecewise linear basis functions if prevent pairwise interactions of dimensions, it is an additive model