Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Machine Learning Week 3 Lecture 1. Programming Competition
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
SVM—Support Vector Machines
Support vector machine
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Binary Classification Problem Linearly Separable Case
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Practical Issues with SVM. Handwritten Digits:
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Support Vector Machines
Support Vector Machines Tao Department of computer science University of Illinois.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Support Vector Machines
Vapnik–Chervonenkis Dimension
Computational Learning Theory
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Computational Learning Theory
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Support Vector Machines and Kernels
CSSE463: Image Recognition Day 14
Presentation transcript:

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors. Structural Risk Minimization.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 2. Induction: History. Francis Bacon described empiricism. Formulate hypotheses and test by experiments. English Empiricist School of Philosophy. David Hume. Scottish. Scepticism. “Why should the Sun rise tomorrow just because it always has”? Karl Popper. The Logic of Scientific Induction. Falsifiability Principle. “A hypothesis is useless unless it can be disproven”.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Risk and Empirical Risk Risk Specialize: Two classes: M=2. Loss Function is the number of misclassifications. I.e. Empirical Risk: dataset -- set of learning machines (e.g. all thresholded hyperplanes).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 4. Risk and Empirical Risk Key Concept: the Vapnik-Chervonenkis (VC) dimension h. The VC dimension is a function of the set of classifiers It is independent of the distribution P(x,y) of the dataset. The VC dimension is a measure of the “degrees of freedom” of the set of classifiers. Intuitively, the size of the dataset n must be larger than the VC dimension before you can learn. E.G. Cover’s theorem. Hyperplanes in d space must have at least 2(d+1) samples to prevent the chance of finding a chance dichotomy.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 5. PAC. Probably Approximately Correct (PAC). If h < n, is the VC dimension of the classifier set, then with probability at least where For hyperplanes h = d+1.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 6. PAC Generalizability: Small empirical risk implies, with high probability, small risk provided is small. Probably Approximately Correct (PAC). Because we can never be completely sure that we havn’t been mislead by rare samples. In practice, require h/n to be small with small

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 7. PAC This is the basic Machine Learning result. There are a number of variants. VC dimension is one measure of the capacity of the set of classifiers. Other measures give tighter bounds but are harder to compute: annealed VC entropy, and growth function. VC dimension is d+1 for thresholded hyperplanes. It can also be bounded nicely for separable kernels. (Later this lecture). Forthcoming lecture will sketch the derivation of PAC. It makes use of probability of rare events (e.g. Cramer’s theorem, Sanov’s theorem).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 8. VC for Margins VC is the largest number of data points which can be shattered by the classifier set. Shattered means that all possible dichotomies of the dataset can be expressed by a classifier in the set. (c.f. Cover’s hyperplane) VC dimension is (d+1) for thresholded hyperplanes in d dimensions. But we can tighter VC dimensions by considering the margins. These bounds can be extended directly to kernel hyperplanes.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 9. VC Margin Hyperplanes Hyperplanes The are normalized wrt data Then the set of classifiers satisfying, has VC dimension satisfying: where is the radius of the smallest sphere containing the datapoints. Recall is the margin. (Margin >. Enforcing a large margin effectively limits the VC dimension.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 10. VC Margin: Kernels. Same technique applies to kernels. Claim: finding the minimum sphere R than encloses the data depends only on the feature vectors by the kernel (kernel trick). Primal: minimize Lagrange multipliers. Dual: maximize s.t. Depends on dot-product only!

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 11. Generalizability for Kernels The capacity term is a monotonic function of h. Use the Margin VC bound to decide which kernels will do best for learning the US Post Office handwritten dataset. For each kernel choice, solve the dual problem to estimate R. Assume that the empirical risk is negligible – because it is possible to classify digits correctly using kernels (but not linear). This predicts that the fourth order kernel has the best generalization – this compares nicely with the results of the classifiers when tested.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 12. Generalization for Kernels

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 13. Structural Risk Minimization Standard Learning says: pick Traditional : use cross-validation to determine if is generalizing. VC theory says, evaluate the bound Ensure there are sufficient number of samples to ensure that is is small. Alternative: Structural Risk Minimization. Divide the set of classifiers into a hierarchy of sets.,... with corresponding VC-dims...

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 14. Structural Risk Minimization Select classifiers to minimize: Empirical Risk + Capacity Term. Capacity Term determines the “generalizability” of the classifier. Increasing the amount of training data allows you to increase p and use a richer class of classifiers. Is the bound tight enough?

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 15. Structural Risk Minimization

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 16 Summary PAC Learning and the VC dimension. The VC dimension is a measure of capacity of the set of classifiers. The risk is bounded by the empirical risk plus a capacity term. VC dimensions can be bounded for linear and kernels by the margin concept. This can predict which filters are best able to generalize. Structured Risk Minimization – penalize classifiers that have poor bounds for generalization.