Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Learning from Observations Chapter 18 Section 1 – 3.
Lecture 9 Support Vector Machines
VC theory, Support vectors and Hedged prediction technology.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CS 4700: Foundations of Artificial Intelligence
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Learning from Observations Chapter 18 Section 1 – 4.
Learning From Observations
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
A Kernel-based Support Vector Machine by Peter Axelberg and Johan Löfhede.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
CS 4700: Foundations of Artificial Intelligence
Learning: Introduction and Overview
Computational Learning Theory
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Part I: Classification and Bayesian Learning
Prof. dr. Lambert Schomaker Shattering two binary dimensions over a number of classes Kunstmatige Intelligentie / RuG.
Support Vector Machines
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning CSE 681 CH2 - Supervised Learning.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Learning from observations
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Chapter 18 Section 1 – 3 Learning from Observations.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Learning from Observations
Learning from Observations
Machine Learning Inductive Learning and Decision Trees
Introduce to machine learning
Presented By S.Yamuna AP/CSE
CH. 2: Supervised Learning
LEARNING Chapter 18b and Chuck Dyer
Learning from Observations
Lecture 14 Learning Inductive inference
Learning from Observations
Supervised machine learning: creating a model
Machine Learning: Decision Tree Learning
INTRODUCTION TO Machine Learning 3rd Edition
Instructor: Vincent Conitzer
Presentation transcript:

Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker

2 Learning  Learning is essential for unknown environments –i.e., when designer lacks omniscience  Learning is useful as a system construction method –i.e., expose the agent to reality rather than trying to write it down  Learning modifies the agent's decision mechanisms to improve performance

3 Learning Agents

4 Learning Element  Design of a learning element is affected by: – Which components of the performance element are to be learned – What feedback is available to learn these components – What representation is used for the components  Type of feedback: – Supervised learning: correct answers for each example – Unsupervised learning: correct answers not given – Reinforcement learning: occasional rewards

5 Inductive Learning  Simplest form: learn a function from examples - f is the target function - an example is a pair (x, f(x))  Problem: find a hypothesis h such that h ≈ f given a training set of examples  This is a highly simplified model of real learning: - ignores prior knowledge - assumes examples are given

6 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

7 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

8 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

9 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

10 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:

11 Inductive Learning Method  Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting: Occam’s razor: prefer the simplest hypothesis consistent with data

12 Occam’s Razor William of Occam ( , England)  “If two theories explain the facts equally well, then the simpler theory is to be preferred.”  Rationale:  There are fewer short hypotheses than long hypotheses.  A short hypothesis that fits the data is unlikely to be a coincidence.  A long hypothesis that fits the data may be a coincidence.  Formal treatment in computational learning theory

13 The Problem Why does learning work? How do we know that the learned hypothesis h is close to the target function f if we do not know what f is? answer provided by computational learning theory

14 The Answer Any hypothesis h that is consistent with a sufficiently large number of training examples is unlikely to be seriously wrong. Therefore it must be: P robably A pproximately C orrect PAC

15 The Stationarity Assumption The training and test sets are drawn randomly from the same population of examples using the same probability distribution. Therefore training and test data are I ndependently and I dentically D istributed IID “the future is like the past”

16 How many examples are needed? Number of examples Probability that h and f disagree on an example Probability of existence of a wrong hypothesis consistent with all examples Size of hypothesis space Sample complexity

17 Formal Derivation H (the set of all possible hypothese) f  H BAD (the set of “wrong” hypotheses)

18 What if hypothesis space is infinite?  Can’t use our result for finite H  Need some other measure of complexity for H –Vapnik-Chervonenkis dimension

19

20

21

22 Shattering two binary dimensions over a number of classes  In order to understand the principle of shattering sample points into classes we will look at the simple case of  two dimensions  of binary value

23 2-D feature space f1f1 f2f2

24 2-D feature space, 2 classes f1f1 f2f2

25 the other class… f1f1 f2f2

26 2 left vs 2 right f1f1 f2f2

27 top vs bottom f1f1 f2f2

28 right vs left f1f1 f2f2

29 bottom vs top f1f1 f2f2

30 lower-right outlier f1f1 f2f2

31 lower-left outlier f1f1 f2f2

32 upper-left outlier f1f1 f2f2

33 upper-right outlier f1f1 f2f2

34 etc f1f1 f2f2

35 2-D feature space f1f1 f2f2

36 2-D feature space f1f1 f2f2

37 2-D feature space f1f1 f2f2

38 XOR configuration A f1f1 f2f2

39 XOR configuration B f1f1 f2f2

40 2-D feature space, two classes: 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 = “hypothesis” = possible class partioning of all data samples

41 2-D feature space, two classes, 16 hypotheses f 1 =0 f 1 =1 f 2 =0 f 2 = two XOR class configurations: 2/16 of hypotheses requires a non-linear separatrix

42 XOR, a possible non-linear separation f1f1 f2f2

43 XOR, a possible non-linear separation f1f1 f2f2

44 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 = ……………………

45 2-D feature space, three classes, # hypotheses? f 1 =0 f 1 =1 f 2 =0 f 2 = …………………… 3 4 = 81 possible hypotheses

46 Maximum, discrete space  Four classes: 4 4 = 256 hypotheses  Assume that there are no more classes than discrete cells  Nhypmax = ncells nclasses

47 2-D feature space, three classes… f1f1 f2f2 In this example,   is linearly separatable  from the rest, as is .  But  is not linearly separatable from the rest of the classes.

48 2-D feature space, four classes… f1f1 f2f2 Minsky & Papert: simple table lookup or logic will do nicely.

49 2-D feature space, four classes… f1f1 f2f2 Spheres or radial-basis functions may offer a compact class encapsulation in case of limited noise and limited overlap (but in the end the data will tell: experimentation required!)

50 SVM (1): Kernels Complicated separation boundary Simple separation boundary: Hyperplane f1f1 f2f2 f1f1 f2f2 f3f3 Kernels  Polynomial  Radial basis  Sigmoid  Implicit mapping to a higher dimensional space where linear separation is possible.

51 SVM (2): Max Margin Support vectors Max Margin “Best” Separating Hyperplane  From all the possible separating hyperplanes, select the one that gives Max Margin.  Solution found by Quadratic Optimization – “Learning”. f1f1 f2f2 Good generalization