Download presentation
Presentation is loading. Please wait.
1
Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker
2
2 Learning Learning is essential for unknown environments –i.e., when designer lacks omniscience Learning is useful as a system construction method –i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent's decision mechanisms to improve performance
3
3 Learning Agents
4
4 Learning Element Design of a learning element is affected by: – Which components of the performance element are to be learned – What feedback is available to learn these components – What representation is used for the components Type of feedback: – Supervised learning: correct answers for each example – Unsupervised learning: correct answers not given – Reinforcement learning: occasional rewards
5
5 Inductive Learning Simplest form: learn a function from examples - f is the target function - an example is a pair (x, f(x)) Problem: find a hypothesis h such that h ≈ f given a training set of examples This is a highly simplified model of real learning: - ignores prior knowledge - assumes examples are given
6
6 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
7
7 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
8
8 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
9
9 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
10
10 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
11
11 Inductive Learning Method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Occam’s razor: prefer the simplest hypothesis consistent with data
12
12 Occam’s Razor William of Occam (1285-1349, England) “If two theories explain the facts equally well, then the simpler theory is to be preferred.” Rationale: There are fewer short hypotheses than long hypotheses. A short hypothesis that fits the data is unlikely to be a coincidence. A long hypothesis that fits the data may be a coincidence. Formal treatment in computational learning theory
13
13 The Problem Why does learning work? How do we know that the learned hypothesis h is close to the target function f if we do not know what f is? answer provided by computational learning theory
14
14 The Answer Any hypothesis h that is consistent with a sufficiently large number of training examples is unlikely to be seriously wrong. Therefore it must be: P robably A pproximately C orrect PAC
15
15 The Stationarity Assumption The training and test sets are drawn randomly from the same population of examples using the same probability distribution. Therefore training and test data are I ndependently and I dentically D istributed IID “the future is like the past”
16
16 How many examples are needed? Number of examples Probability that h and f disagree on an example Probability of existence of a wrong hypothesis consistent with all examples Size of hypothesis space Sample complexity
17
17 Formal Derivation H (the set of all possible hypothese) f H BAD (the set of “wrong” hypotheses)
18
18 What if hypothesis space is infinite? Can’t use our result for finite H Need some other measure of complexity for H –Vapnik-Chervonenkis dimension
19
19
20
20
21
21
22
22 Shattering two binary dimensions over a number of classes In order to understand the principle of shattering sample points into classes we will look at the simple case of two dimensions of binary value
23
23 2-D feature space 0 0 1 1 f1f1 f2f2
24
24 2-D feature space, 2 classes 0 0 1 1 f1f1 f2f2
25
25 the other class… 0 0 1 1 f1f1 f2f2
26
26 2 left vs 2 right 0 0 1 1 f1f1 f2f2
27
27 top vs bottom 0 0 1 1 f1f1 f2f2
28
28 right vs left 0 0 1 1 f1f1 f2f2
29
29 bottom vs top 0 0 1 1 f1f1 f2f2
30
30 lower-right outlier 0 0 1 1 f1f1 f2f2
31
31 lower-left outlier 0 0 1 1 f1f1 f2f2
32
32 upper-left outlier 0 0 1 1 f1f1 f2f2
33
33 upper-right outlier 0 0 1 1 f1f1 f2f2
34
34 etc. 0 0 1 1 f1f1 f2f2
35
35 2-D feature space 0 0 1 1 f1f1 f2f2
36
36 2-D feature space 0 0 1 1 f1f1 f2f2
37
37 2-D feature space 0 0 1 1 f1f1 f2f2
38
38 XOR configuration A 0 0 1 1 f1f1 f2f2
39
39 XOR configuration B 0 0 1 1 f1f1 f2f2
40
40 2-D feature space, two classes: 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 “hypothesis” = possible class partioning of all data samples
41
41 2-D feature space, two classes, 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 two XOR class configurations: 2/16 of hypotheses requires a non-linear separatrix
42
42 XOR, a possible non-linear separation 0 0 1 1 f1f1 f2f2
43
43 XOR, a possible non-linear separation 0 0 1 1 f1f1 f2f2
44
44 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 ……………………
45
45 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 =1 0 1 2 3 4 5 6 7 8 …………………… 3 4 = 81 possible hypotheses
46
46 Maximum, discrete space Four classes: 4 4 = 256 hypotheses Assume that there are no more classes than discrete cells Nhypmax = ncells nclasses
47
47 2-D feature space, three classes… 0 0 1 1 f1f1 f2f2 In this example, is linearly separatable from the rest, as is . But is not linearly separatable from the rest of the classes.
48
48 2-D feature space, four classes… 0 0 1 1 f1f1 f2f2 Minsky & Papert: simple table lookup or logic will do nicely.
49
49 2-D feature space, four classes… 0 0 1 1 f1f1 f2f2 Spheres or radial-basis functions may offer a compact class encapsulation in case of limited noise and limited overlap (but in the end the data will tell: experimentation required!)
50
50 SVM (1): Kernels Complicated separation boundary Simple separation boundary: Hyperplane f1f1 f2f2 f1f1 f2f2 f3f3 Kernels Polynomial Radial basis Sigmoid Implicit mapping to a higher dimensional space where linear separation is possible.
51
51 SVM (2): Max Margin Support vectors Max Margin “Best” Separating Hyperplane From all the possible separating hyperplanes, select the one that gives Max Margin. Solution found by Quadratic Optimization – “Learning”. f1f1 f2f2 Good generalization
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.