Steep learning curves Reading: Bishop Ch. 3.0, 3.1
Administrivia Reminder: Microsoft on campus for recruiting Next Mon, Feb 5 FEC141, 11:00 AM All welcome
Viewing and re-viewing Last time: (4)5 minutes of math: function optimization Measuring performance Today: Cross-validation Learning curves
Separation of train & test Fundamental principle (1st amendment of ML): Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!
Holdout data Usual to “hold out” a separate set of data for testing; not used to train classifier A.k.a., test set, holdout set, evaluation set, etc. E.g., is training set (or empirical) accuracy is test set (or generalization) accuracy
Gotchas... What if you’re unlucky when you split data into train/test? E.g., all train data are class A and all test are class B? No “red” things show up in training data Best answer: stratification Try to make sure class (+feature) ratios are same in train/test sets (and same as original data) Why does this work?
Gotchas... What if you’re unlucky when you split data into train/test? E.g., all train data are class A and all test are class B? No “red” things show up in training data Almost as good: randomization Shuffle data randomly before split Why does this work?
Gotchas What if the data is small? N=50 or N=20 or even N=10 Can’t do perfect stratification Can’t get representative accuracy from any single train/test split
Gotchas No good answer Common answer: cross-validation Shuffle data vectors Break into k chunks Train on first k-1 chunks Test on last 1 Repeat, with a different chunk held-out Average all test accuracies together
Gotchas In code: for (i=0;i<k;++i) { [Xtrain,Ytrain,Xtest,Ytest]= splitData(X,Y,N/k,i); model[i]=train(Xtrain,Ytrain); cvAccs[i]=measureAcc(model[i],Xtest,Ytest); } avgAcc=mean(cvAccs); stdAcc=stddev(cvAccs);
CV in pix [X;y][X;y] Original data [X’;y’] Random shuffle k -way partition [X1’ Y1’] [X2’ Y2’] [Xk’ Yk’]... k train/ test sets k accuracies 53.7%85.1%73.2%
But is it really learning? Now we know how well our models are performing But are they really learning? Maybe any classifier would do as well E.g., a default classifier (pick the most likely class) or a random classifier How can we tell if the model is learning anything?
The learning curve Train on successively larger fractions of data Watch how accuracy (performance) changes Learning Static classifier (no learning) Anti-learning (forgetting)
Measuring variance Cross validation helps you get better estimate of accuracy for small data Randomization (shuffling the data) helps guard against poor splits/ordering of the data Learning curves help assess learning rate/asymptotic accuracy Still one big missing component: variance Definition: Variance of a classifier is the fraction of error due to the specific data set it’s trained on
Measuring variance Variance tells you how much you expect your classifier/performance to change when you train it on a new (but similar) data set E.g., take 5 samplings of a data source; train/test 5 classifiers Accuracies: 74.2, 90.3, 58.1, 80.6, 90.3 Mean accuracy: 78.7% Std dev of acc: 13.4% Variance is usually a function of both classifier and data source High variance classifiers are very susceptible to small changes in data
Putting it all together Suppose you want to measure the expected accuracy of your classifier, assess learning rate, and measure variance all at the same time? for (i=0;i<10;++i) { // variance reps shuffle data do 10-way CV partition of data for each train/test partition { // xval for (pct=0.1;pct+=0.1;pct<=0.9) { // LC Subsample pct fraction of training set train on subsample, test on test set } avg across all folds of CV partition generate learning curve for this partition } get mean and std across all curves
Putting it all together “hepatitis” data
5 minutes of math... Decision trees make very few assumptions about data Don’t know anything about relations between instances, except sets induced by feature splits No sense of spatial/topological relations among data Often, our data is real, honest-to-Cthulhu, mathematically sound vector data As opposed to the informal sense of vector that I have used so far Often comes endowed with a natural inner product and norm
5 minutes of math Mathematicians like to study the properties of spaces in general From linear algebra, you’ve already met the notion of a vector space: Definition: a vector space, V, is a set of elements (vectors) plus a scalar field, F, such that the following properties hold: Vector addition: Scalar multiplication: Linearity; commutativity; associativity; etc.
5 minutes of math By itself, vector spaces only partially useful Gets more useful when you add a norm and an inner product
5 minutes of math Definition: a norm, ||.||, is a function of a single vector ( ∈ V ) that returns a scalar ( ∈ F ) such that for all a, b ∈ V and c ∈ F : ||a|| ≥ 0 ||c a|| = |c| ||a|| ||a+b||≤||a|| + ||b|| Intutition: norm gives you the length of a vector A vector space+norm ⇒ Banach space (*)
5 minutes of math Definition: an inner product, 〈 ∙, ∙ 〉, is a function of two vectors ( ∈ V ) that returns a scalar ( ∈ F ) such that: Symmetry Linearity in first variable Non-negativity Non-degeneracy A vector space+inner product ⇒ Hilbert space (*)