1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machine
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
Separating Hyperplanes
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machine (SVM) Classification
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines and Kernel Methods
Support Vector Machines
Unconstrained Optimization Problem
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Lecture 10: Support Vector Machines
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Classification and Regression
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Support vector machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Geometrical intuition behind the dual problem
Support Vector Machines Introduction to Data Mining, 2nd Edition by
CSCI B609: “Foundations of Data Science”
Support vector machines
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004

2 General Research Question “ Under what conditions is successful learning possible and impossible? ” “ Under what conditions is a particular learning algorithm assured of learning successfully? ” -Mitchell, ‘ 97

3 Computational Learning Theory 1.Sample Complexity 2.Computational Complexity 3.Mistake Bound -Mitchell, ‘ 97

4 Problem Setting Instance Space: X, with a stable distribution D Concept Class: C, s.t. c: X  {0,1} Hypothesis Space: H General Learner: L

5 Error of a Hypothesis + Where c and h disagree h c

6 PAC Learnability True Error: Difficulties in getting 0 error: 1.Multiple hypothesis consistent with training examples 2.Training examples can mislead the Learner

7 PAC-Learnable Learner L will output a hypothesis h with probability (1-  ) s.t. in time that is polynomial in where n = size of a training example size(c) = encoding length of c in C

8 Consistent Learner & Version Space Consistent Learner – Outputs hypotheses that perfectly fit the training data whenever possible Version Space: VS H,E is  -exhausted with respect to c and D if:

9 Version Space Hypothesis space H (  =.21). error=.1 r=.2. error=.3 r=.1. error=.3 r=.4. error=.2 r=.3. error=.2 r=0. error=.1 r=0 VS H,E

10 Sample Complexity for Finite Hypothesis Spaces Theorem -  -exhausting the version space: If H is finite, the probability that VS H,D is NOT  -exhausted (with respect to c) is:  |H|e -  m where m  1, sequence of i.r.d. examples of some target concept c; 0    1

11 Upper bound on sufficient number of training examples If we set probability of failure below some level,  then … … however, too loose of a bound due to |H|

12 Agnostic Learning What if concept c  H? Agnostic Learner: simply finds the h with min. training error Find upper bound on m s.t. Where h best = h with lowest training error

13 Upper bound on sufficient number of training examples - error E (h best )  0 From Chernoff Bounds, we have: then … thus …

14 Example: Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error <.1 95% of the time? |H|=?  =?  =?

15 Example: Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error <.1 95% of the time? |H|=3 10  =.1  =.05

16 Sample Complexity for Infinite Hypothesis Spaces Consider subset of instances: S  X, and h  H s.t. h imposed dichotomy on S: 2 subsets: {x  S | h(x)=1 } & {x  S | h(x)=0 } Thus for any instance set S, there are 2 |S| possible dichotomies. Definition: A set of instance S is shattered by hypothesis space H iff for every dichotomy of S there exist some h  H consistent with this dichotomy

17 3 Instances Shattered by 8 Hypotheses Instance Space X

18 Vapnik-Chervonenkis Dimension Definition: VC(H), is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H)=  For any finite H, VC(H)  log 2 |H|

19 Example of VC Dimension Along a line … In a plane …

20 VC Dimension Example 2

21 VC dimensions in R n Theorem: Consider some set of m points in R n. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes iff the position vectors of the remaining points are linearly independent. So VC dimension of the set of oriented hyperplanes in R 10 is ?

22 Bounds on m with VC Dimension VC(H)  log 2 |H| Upper Bound: Lower Bound:

23 Mistake Bound Model of Learning “ How many mistakes will the learner make in its predictions before it learns the target concept? ” The best algorithm in worst case scenario (hardest target concept, hardest training sequence) will make Opt(C) mistakes, where

24 Linear Support Vector Machines Consider a binary classification problem: Training data: {x i, y i }, i=1, …,  ; y i  {-1, +1}; x i  R d Points x lie on the separating hyperplane satisfy: w  x+b=0 where w is normal to the hyperplane |b|/||w|| is the perpendicular distance to origin ||w|| is the Euclidean norm of w

25 Linear Support Vector Machine, Definitions Let d + (d - ) be the shortest distance from the separating hyperplane to the closest positive (negative) example Margin of a separating hyperplane= d + + d - =1/||W||+1/||W||=2/||w|| Constraints:

26 Linear Separating Hyperplane for the Separable Case

27 Problem of Maximizing the Margins H 1 and H 2 are parallel, & with no training points in between Thus we reformulate the problem as: Maximize margin by minimizing ||W|| 2 s.t.

28 Ties to Least Squares y x  b Loss Function:

29 Lagrangian Formulation 1.Transform constraints into Lagrange multipliers 2.Training data will only appear in dot products form Let be positive Lagrange multipliers We have the Lagrangian:

30 Transform the convex quadratic programming problem Observations: minimizing L P w.r.t. w, b, and simultaneously require that subject to is a convex quadratic programming problem that can be easily solved in its Dual form

31 Transform the convex quadratic programming problem – the Dual L P ’ s Dual: Maximize L P, subject to gradients of L P w.r.t. w and b vanish, and  i  0

32 Observations about the Dual There is a Lagrangian multiplier  i for every training point In the solution, points for which  i > 0 are called “ support vectors ”. They lie on either H 1 or H 2 Support vectors are critical elements of the training set, they lie closest to the “ boundary ” If all other points are removed or moved around (but not crossing H 1 or H 2 ), the same separating hyperplane would be found

33 Prediction Solving the SVM problem is equivalent to finding a solution for the Karush-Kuhn-Tucker (KTT) conditions (KTT conditions are satisfied at the solution of any constrained optimization problem) Once we solved for w, b, we predict x to be sign(w  x+b)

34 Linear SVM: The Non-Separable Case We account for outliers by introducing slack conditions: We penalize outliers by changing the cost function to:

35 Example of Linear SVM with slacks

36 Linear SVM Classification Examples Linearly Separable Linearly Non-Separable

37 Nonlinear SVM Observation: data appear as dot products in the training problem So we can use a mapping function , to map data into a high dimensional space where points are linearly separable: To make things easier, we define a kernel function K s.t.

38 Nonlinear SVM (cont.) Kernel functions can compute dot products in the high dimensional space without explicitly work with  Example: Rather than computing w, we make prediction on x via:

39 Example of  mapping Image, in , of the square [-1,1]x[-1,1]  R 2 under the mapping 

40 Example Kernel Functions Kernel functions must satisfy the Mercer ’ s condition, or simple, the Hessian Matrix must be positive semidefinite. (non-negative eigenvalues) Example Kernels:

41 Nonlinear SVM Classification Examples (Degree 3 Polynomial Kernel) Linearly Separable Linearly Non-Separable

42 Multi-Class SVM 1.One-against-all 2.One-against-one (majority vote) 3.One-against-one (DAGSVM)

43 Global Solution and Uniqueness Every local solution is also global (property of any convex programming problem) Solution is guaranteed unique if the objective function is strictly convex (Hessian matrix is positive definite)

44 Complexity and Scalability Curse of dimensionality: 1.The proliferation of parameters causing intractable complexity 2.The proliferation of parameters causing overfitting SVM circumvent these via the use of 1.Kernel functions (trick) that computes at O(d L ) 2.Support vectors that focus on the “ boundary ”

45 Structural Risk Minimization Empirical Risk: Expected Risk:

46 Structural Risk Minimization Nested subsets of functions, ordered by VC dimensions