MACHINE LEARNING 3. Supervised Learning
Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Class C of a “family car” Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output: Positive (+) and negative (–) examples Input representation: Expert suggestions x 1 : price, x 2 : engine power Ignore other attributes
Training set X Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3
Class C Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Assume class model (rectangle) (p1 ≤ price ≤ p2) & (e1 ≤ engine power ≤ e2)
S, G, and the Version Space Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997)
Hypothesis class H 6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Error of h on H
Generalization Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 Problem of generalization: how well our hypothesis will correctly classify future examples In our example: hypothesis is characterized by 4 numbers (p1,p2,e1,e2) Choose the best one Include all positive and none negative Infinitely many hypothesis for real-valued parameters
Doubt Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 In some applications, a wrong decision is very costly May reject an instance if fall between S (most specific) and G (most general)
VC Dimension Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Assumed that H (hypothesis space) includes true class C H should be flexible enough or have enough capacity to include C Need some measure of hypothesis space “flexibility” complexity Can try to increase complexity of hypothesis space
VC Dimension Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 N points can be labeled in 2 N ways as +/– H shatters N if there exists h H consistent for any of these: VC( H ) = N An axis-aligned rectangle shatters 4 points only !
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Fix a probability of target classification error (planned future) Actual error depends on training sample(past) Want the actual probability error(actual future) be less than a target with high probability Probably Approximately Correct (PAC) Learning
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ? (Blumer et al., 1989) Let’s calculate how many samples wee need for S Each strip is at most ε /4 Pr that we miss a strip 1 ‒ ε /4 Pr that N instances miss a strip (1 ‒ ε /4) N Pr that N instances miss 4 strips 4(1 ‒ ε /4) N 1-4(1 ‒ ε /4) N >1- δ and (1 ‒ x)≤exp( ‒ x) 4exp( ‒ ε N/4) ≤ δ and N ≥ (4/ ε )log(4/ δ )
Noise Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Imprecision in recording the input attributes Error in labeling data points (teacher noise) Additional attributes not taken into account (hidden or latent) Same price/engine with different label due to a color Effect of this attributes modeled as a noise Class boundary might be not simple Need more complicated hypothesis space/model
Noise and Model Complexity Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 Use the simpler one because Simpler to use (lower computational complexity) Easier to train (lower space complexity) Easier to explain (more interpretable) Generalizes better (lower variance - Occam’s razor)
Occam Razor Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 If actual class is simple and there is mislabeling or noise, the simpler model will generalized better Simpler model result in more errors on training set Will generalized better, won’t try to explain noise in training sample Simple explanations are more plausible
Multiple Classes Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 General case K classes Family, Sport, Luxury cars Classes can overlap Can use different/same hypothesis class Fall into two classes? Sometimes worth to reject
Multiple Classes, C i i=1,...,K Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Train hypotheses h i (x), i =1,...,K:
Regression Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Output is not Boolean (yes/no) or label but numeric value Training Set of examples Interpolation: fit function (polynomial) Extrapolation: predict output for any x Regression : added noise Assumption: hidden variables Approximate output by model: g(x)
Regression Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Empirical error on training set Hypothesis space is linear functions Calculate best parameters to minimize error by taking partial derivatives
Higher-order polynomials Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20
Model Selection & Generalization Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 Learning is an ill-posed problem; data is not sufficient to find a unique solution Each sample remove irrelevant hypothesis The need for inductive bias, assumptions about H E.g. rectangles in our example Generalization: How well a model performs on new data Overfitting: H more complex than C or f Underfitting: H less complex than C or f
Triple Trade-Off Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c ( H ), 2. Training set size, N, 3. Generalization error, E, on new data As N E As c ( H ) first E and then E
Cross-Validation Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23 To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) To train a model Validation set (25%) To select a model (e.g. degree of polynomials) Test (publication) set (25%) Estimate the error Resampling when there is few data
Dimensions of a Supervised Learner 1. Model: 2. Loss function: 3. Optimization procedure: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24