The basic notions related to machine learning
Feature extraction It is a vital step before the actual learning: we have to create the input feature vector Obviously, the optimal feature set is task-dependent Ideally, the features are recommended by an expert of the given domain In practice, however, we (engineers) have to solve it Good feature set: contains relevant and few features In many practical tasks it is not clear what are the relevant features Eg. influenza – fever: relevant, color of eye: irrelevant, age ??? When we are unsure, let’s include the feature It’s not that simple: including irrelevant features makes the learning more difficult for two reasons Curse of dimensionality It introduces noise in the data that many algorithms have difficulties to handle
Curse of Dimensionality Too many features make learning more difficult Number of features = dimensions of the feature space Learning becomes harder at larger dimensional spaces Example: let’s consider the following simple algorithm Learning: we divide the feature space into little hypercubes, and count the examples falling into them. We label each cube by the class that has the most examples in it Classification: a new test case is always labeled by the label of the cube it falls into The number of cubes increases exponentially with the number of dimensions! With a fixed number of examples more and more cubes remain empty More and more examples are required to reach a certain density of examples Real learning algorithms are more clever, but the problem is the same More features we need much more training examples
The effect of irrelevant features The irrelevant features may make the learning algorithms less efficient Example: nearest neighbor method Learning: we simply store the training examples Classify: we identify a new example by the label of its nearest neighbor Good features: the points of the same class fall close to each other What if we include a noise-like feature: the points are randomly scattered along the new dimension, the distance relations fall apart Most of the learning algorithms are more clever, but their operation is also disturbed by an irrelevant (noise-like) feature
Optimizing the feature space We usually try to pick the best features manually But of course, there are also automatic methods for this Feature selection algorithms They retain M<N features from the original set of N features We can reduce the feature space not only by throwing away less relevant features, but also by transforming the feature space Feature space transformation methods The new feature are obtained by some combination of the old features We usually also reduce the number of dimensions at the same time (the new feature space has fewer dimensions than the old one)
Evaluating the trained model Based on the training examples, the algorithm constructs a model (hypothesis) from the function (x1,…,xN)c függvényre This function can guess the value of the function for any (x1,…,xN) Our main goal is not to perfectly learn the labels of the training samples, but to generalize to examples not seen during training Hoe can we give an estimate on the generalization ability? We leave out a subset of the training examples during training test set Evaluation: We evaluate the model on the test set estimated class labels We compare the estimated and the guessed labels
Evaluating the trained model 2 How to quantify the error of estimation for a regression task: Example: the algorithm outputs a straight line – the error is shown by the yellow arrows Summarizing the error indicated by the yellow arrows: Mean squared error or Root-mean-squared error
Evaluating the trained model 3 Quantifying the error for a classification task: Simplest solution: classification error rate Number of incorrectly classified test samples/Number of all test samples More detailed error analysis: with the help of the confusion matrix It helps understand which classes are missed by algorithm It also allows defining an error function that counts different mistakes by different weights For this we can define a weight matrix for the different cells „0-1 loss”: it weights the elements of the main diagonal by 0, the other cells by 1 Same as the classification error rate
Evaluating the trained model 4 We can also weight the different mistakes differently The most usual when we have only too classes Example: diagnosing an illness The cost matrix is sized 2x2 : Error 1: False negative: the patient is ill, but the machine said no Error 2: False positive: the machine said yes, but the patient is not ill These have different costs! Metrics: see fig. Metrics preferred by doctors: Sensitivity: tp/(tp+fn) Specificity: tn/(tn+fp)
„No Free Lunch” theorem There exists no such universal learning algorithm that would outperform all other algorithms on all possible tasks The optimal learning algorithm is always task-dependent For every learning algorithm one can find task on which it performs well, and task for which it performs poorly Demonstration: Hypothesis of Method 1 and method 2 on the same examples: Which hypothesis is correct? It depends on the real distribution:
„No Free Lunch” theorem 2 Put another way: The average performance (over „all posible tasks”) of all training algorithms is the same Ok, but then… what is the sense in constructing machine learning algorithms? We should concentrate on just one type of tasks rather than trying to solve all tasks by one algorithm! It makes sense to look for a good algorithm for eg. speech recognition or face recognition You should be very careful when making claims like algorithm A is better than algorithm B Machine learning databases: for the purpose of objective evaluation of machine learning algorithms over a broad range of tasks Pl: UCI Machine Learning Repository
Generalization vs. overfitting No Free Lunch theorem: we can never be sure that the trained model generalizes correctly to the cases not seen during training But then, how should it chose from the possible hypotheses? Experience: increasing the complexity of the model increases its flexibility, so it becomes and more correct on the training examples However, its performance starts dropping on the test examples! This phenomenon is called overfitting: after learning the general properties, the model starts to learn the pecularities of the given finite training set
The „Occam’s razor” heuristics Experience: usually the simpler model generalizes better But of course, a too simple model is not good either Einstein: „Things should be explained as simple as possible. But no simpler.” – this is practically the same as the Occam’s razor heuristics The optimal model complexity is different for each task How can we find the optimum point shown in the figure? Theoretical approach: we formalize the complexity of a hypothesis Minimum Description Length principle: we seek that hypothesis h for which K(h,D)=K(h)+K(D|h) is minimal K(h): the complexity of hypothesis h K(D|h): the complexity of representing set D by the hypothesis h K(): Kolmogorov-complexity
Bias and variance Another formalism for a model being „too simple” or „too complex” For the case of regression Example: we fit the red polinomial on the blue points, green is the optimal solution Polinomial of too low degree: cannot fit on the examplesbias Too high degree: fits on the examples, but oscillates in between them variance Formally: Let’s select a random D training set with n elements, and run the training on them Repeat this many times, and analyze expectation of the squared error between the g(x,D) approximation and the original F(x) function at a given x point
Bias-variance trade-off Bias: The difference between the average of the estimates and F(x) If it is not 0, then the model is biased: it has a tendency to over- or under-estimate the F(x) By increasing the model complexity (in our example the order of the polinom) the bias decreases Variance: The variance of the estimates (their average difference from the average estimate) A large variance is not good (we get quite different estimates depending on the choice of D) Increasing model complexity increases the variance Optimum: somewhere in between
Finding the optimal complexity – A practical approach (Almost) all machine learning algorithms have meta-parameters These allow us to tune the complexity of the model E.g. polinomial fitting: the degree of the polinomial These are called meta-parameters (or hyperparameters) , to separate them from the real parameters (eg. polinomials: coefficients) Different meta-parameter values result in slightly different models How can we find the optimal meta-parameters? We separate a small validation (also called development) set from the training set Over all, our data is divided into train-dev-test sets We repeat training on the train set several times with several meta-parameters We evalute the models obtained on the dev set (to estimate the red curve of Fig.) Finally, the we evaluate the model that performed best on the dev set on the test