1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10. Model selection and combination Coryn Bailer-Jones
2 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Topics ● SVMs for regression ● How many clusters in mixture models? – model selection – BIC and AIC ● Combining weak learners – boosting – classification and regression trees
3 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A reminder of Support Vector Machines
4 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A reminder of Support Vector Machines ● SVMs operate on principle of separating hyperplanes – maximize margin – only data points near margin are relevant (the support vectors) ● Nonlinearity via kernels – possible because data appear only as dot products – project data into a higher dimensional space – projection is only implicit (big computational saving) ● Solution – Lagrangian dual – convex cost function therefore unique solution
5 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression
6 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression
7 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression In Teff prediction problem using PS1 simulated data, the training set errror increases monotonically with increasing for fixed C and gamma
8 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Mixture models
9 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Gaussian mixture model applied to the geyser data Application of the Mclust{mclust} program faithful{datasets} data set. See the R scripts on the lecture web site
10 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model selection methods ● Two levels of inference – optimal parameters for a given model – optimal model. How do we choose between two different models, even different types (e.g. SVM or mixture model)? – both involve a fundamental trade-off between fit on training data and model complexity: within a given model we can include a regularization term ● One approach is cross validation – let predictive error be your guide – variants: k-fold, leave-one-out, generalized – strictly need a third data set for model comparison (second used for regularization parameter fixing) – CV is also slow and still depends on a specific (and finite) data set
11 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Akaike Information Criterion (AIC) smaller for better model fits expectation values taken w.r.t truth C is constant over all models, g
12 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Akaike Information Criterion (AIC) factor 2 for “historical” reasons
13 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Bayesian Information Criterion (BIC)
14 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Mclust models (covariance parametrizations)
15 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model selection using R mclust package Application of the Mclust{mclust} package to the faithful{datasets} data set. See the R scripts on the lecture web site Note that what mclust reports as the BIC is actually negative BIC!
16 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosting: combining weak classifiers
17 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination © Hastie, Tibshirani, Friedman (2001)
18 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A weak learner: classification trees ● N continuous or discrete input variables; multiple output classes ● make successive binary splits on a single variable – makes a hierarchical partitioning of the data space – fits simple model (a constant!) to each, i.e. all objects in partition are of a single class ● iteratively grow tree by splitting each node on variable j at point s which reduces loss (error) the most ● properties of resulting trees – partitioning can form non-contiguous regions for each class – class boundaries are parallel to axes ● regression trees – split by minimizing sum of squares in partition (fitted value in partition is just the average) ● regularization – grow large tree, then prune back
19 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Classification tree © Venables & Ripley (2002)
20 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Regression tree © Hastie, Tibshirani, Friedman (2001)
21 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Example of boosting a classification tree
22 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosted tree performance © Hastie, Tibshirani, Friedman (2001)
23 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosting ● boosting fits an additive model: each classifier is a basis function ● it is equivalent to doing “forward stagewise modelling”: – it adds a new term (basis function) without modifying existing ones – the next term is the one which minimizes the loss function over the choice of models ● it can be shown that it does this by using an exponential rather than square error loss – this is more logical anyway for indicator variables – see Hastie et al. sections 10.3 & 10.4 for a proof ● trees often used as the weak learner – they are fast to fit ● boosting shows good performance on many problems ● many variants on the basic model
24 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Forward stagewise additive modelling algorithm
25 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination R packages for boosting ● ada – many types of boosting – nice manual/document (Culp et al. 2006) which covers the chisq exampeCulp et al ● adabag – Adaboost.M1 and bagging with trees ● boost ● gbm – various loss functions, also for regression ● GAMBoost – fits GAMs using boosting ● rpart – classification and regression trees (basis for many boosting algorithms)
26 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension ● effective number of parameters not very general ● consider class of functions f(x, ) for separating two class data ● given set of points is shattered by this class if, no matter how class labels are assigned, a member of this class can separate them ● VC dimension of the class of functions is the largest number of points in some configuration which can be shattered by at least one member of class – not necessary that all configurations be shattered ● VC dimension is an alternative measure of the “capacity” of a model to fit complex data
27 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension © Hastie, Tibshirani, Friedman (2001)
28 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension © Hastie, Tibshirani, Friedman (2001)
29 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination VC dimension of models ● Generally a linear indicator function in p dimensions has VC dimension p+1 ● sin(x) has an infinite VC dimension (see Burges 1998) – but note that four equally spaced points cannot be shattered ● k-nn also has infinite VC dimension ● SVMs have very large, even infinite VC dimension ● VC dimension measures “capacity” of model – larger VC dimension gives more flexibility... –...but potentially poor performance (unconstrained) ● VC dimension can be used for model selection – it can be used to calculate upper bound on true error given error on a (finite) training set (cf. AIC, BIC) – larger VC dimension sometimes means poorer generalization ability (bias/variance trade off)
30 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Summary ● model selection – account for model complexity and bias from finite-sized training set – evalute error (log likelihood) on training sample and apply a 'correction' ● classification and regression trees (CART) – greedy, top-down partitioning algorithm (then 'prune' back) – splits (partition boundaries) are parallel to axis – constant fit to partitions ● boosting – combine weak learners (e.g. CART) to get a powerful additive model – recursively build up models by reweighting data