T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer 25.11.2004.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
On-line learning and Boosting
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Longin Jan Latecki Temple University
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1. Introduction Consistency of learning processes To explain when a learning machine that minimizes empirical risk can achieve a small value of actual.
Instructor : Dr. Saeed Shiry
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Vapnik-Chervonenkis Dimension
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Model Selection. Outline Motivation Overfitting Structural Risk Minimization Cross Validation Minimum Description Length.
SVM Support Vectors Machines
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Support Vector Machines
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Machine Learning with Discriminative Methods Lecture 00 – Introduction CS Spring 2015 Alex Berg.
Ensemble Methods in Machine Learning
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Classification Ensemble Methods 1
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Learning with General Similarity Functions Maria-Florina Balcan.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CS 9633 Machine Learning Support Vector Machines
Empirical risk minimization
CH. 2: Supervised Learning
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Empirical risk minimization
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

Motivation Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization How well does the learned function perform on unseen test examples? Central question in supervised learning

What you will hear New Idea: Stability implies predictivity learning algorithm is stable if small pertubations of training set do not change hypothesis much Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis

Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

Some Definitions 1/2 Training Data: S = {z 1 =(x 1,y 1 ),..., z n =(x n, y n )} Z = X  Y Unknown Distribution  (x, y) Hypothesis Space: H Hypothesis f S  H: X  Y Learning Algorithm: Regression: f S is real-valued / Classification: f S is binary symmetric learning algorithm (ordering irrelevant)

Some Definitions 2/2 Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y) 2 Assume that V is bounded Empirical Error (Training Error) Expected Error (True Error)

Generalization and Consistency Convergence in Probability Generalization Performance on training examples must be a good indicator of performance on future examples Consistency Expected error converges to most accurate one in H

Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

Empirical Risk Minimization (ERM) Focus of classical learning theory research exact and almost ERM Minimize training error over H: take best hypothesis on training data For ERM: Generalization  Consistency

What algorithms are ERM? All these belong to class of ERM algorithms Least Squares Regression Decision Trees ANN Backpropagation (?)... Are all learning algorithms ERM? NO! Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization...

Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM?

Classical Results for ERM 1 Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class: convergence of empirical mean to true expected value uniform convergence in probability of loss functions induced by H and V 1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

VC-Dimension Binary functions f: X  {0, 1} VC-dim(H) = size of largest finite set in X that can be shattered by H e.g. linear separation in 2D yields VC-dim = 3 Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite 1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

Achievements of Classical Learning Theory Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the hypothesis space?

Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM?

Stability Small pertubations of the training set should not change the hypothesis much especially deleting one training example S i = S \ {z i } How can this be mathematically defined? Original Training Set S Perturbed Training Set S i Hypothesis Space Learning Map

Uniform Stability 1 A learning algorithm L is uniformly stable if After deleting one training sample the change must be small at all points z  Z Uniform stability implies generalization Requirement is too strong Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

CV loo stability 1 Cross-Validation leave- one-out stability considers only errors at removed training points strictly weaker than uniform stability remove z i error at x i 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Equivalence for ERM 1 Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CV loo stable ERM generalizes and is universally consistent H is a uGC class Question: Does CV loo stability ensure generalization for all learning algorithms? 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

CV loo Counterexample 1 X be uniform on [0, 1] Y  {-1, +1} Target f * (x) = 1 Learning algorithm L: No change at removed training point  CV loo stable Algorithm does not generalize at all! 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Additional Stability Criteria Error (E loo ) stability Empirical Error (EE loo ) stability Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) Not sufficient for generalization

CVEEE loo Stability Learning Map L is CVEEE loo stable if it is CV loo stable and E loo stable and EE loo stable Question: Does this imply generalization for all L?

CVEEE loo implies Generalization 1 Theorem: If L is CVEEE loo stable and the loss function is bounded, then f S generalizes Remarks: Neither condition (CV, E, EE) itself is sufficient E loo and EE loo stability are not sufficient For ERM CV loo stability alone is necessary and sufficient for generalization and consistency 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Consistency CVEEE loo stability in general does NOT guarantee consistency Good generalization does NOT necessarily mean good prediction but poor expected performance is indicated by poor training performance

CVEEE loo stable algorithms Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost) For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) For all these algorithms generalization is guaranteed by the shown theorems!

Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

Implications Classical „VC-style“ conditions  Occams Razor: prefer simple hypotheses CV loo stability  Incremental Change online-algorithms Inverse Problems: stability  well-posedness condition numbers characterize stability Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery

Language Learning Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more insight into real language learning? Language learning algorithm or Class of all learnable grammars? Focus on algorithms shift focus to stability

Conclusion Stability implies generalization intuitive (CV loo ) and technical (E loo, EE loo ) criteria Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms Restrictions on learning map rather than hypothesis space New approach for designing learning algorithms

Open Questions Easier / other necessary and sufficient conditions for generalization Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms

Thank you!

Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S , 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo , MIT, 2003 T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale- sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997