T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Motivation Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization How well does the learned function perform on unseen test examples? Central question in supervised learning
What you will hear New Idea: Stability implies predictivity learning algorithm is stable if small pertubations of training set do not change hypothesis much Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis
Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion
Some Definitions 1/2 Training Data: S = {z 1 =(x 1,y 1 ),..., z n =(x n, y n )} Z = X Y Unknown Distribution (x, y) Hypothesis Space: H Hypothesis f S H: X Y Learning Algorithm: Regression: f S is real-valued / Classification: f S is binary symmetric learning algorithm (ordering irrelevant)
Some Definitions 2/2 Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y) 2 Assume that V is bounded Empirical Error (Training Error) Expected Error (True Error)
Generalization and Consistency Convergence in Probability Generalization Performance on training examples must be a good indicator of performance on future examples Consistency Expected error converges to most accurate one in H
Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion
Empirical Risk Minimization (ERM) Focus of classical learning theory research exact and almost ERM Minimize training error over H: take best hypothesis on training data For ERM: Generalization Consistency
What algorithms are ERM? All these belong to class of ERM algorithms Least Squares Regression Decision Trees ANN Backpropagation (?)... Are all learning algorithms ERM? NO! Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization...
Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM?
Classical Results for ERM 1 Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class: convergence of empirical mean to true expected value uniform convergence in probability of loss functions induced by H and V 1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997
VC-Dimension Binary functions f: X {0, 1} VC-dim(H) = size of largest finite set in X that can be shattered by H e.g. linear separation in 2D yields VC-dim = 3 Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite 1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997
Achievements of Classical Learning Theory Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the hypothesis space?
Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion
Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM?
Stability Small pertubations of the training set should not change the hypothesis much especially deleting one training example S i = S \ {z i } How can this be mathematically defined? Original Training Set S Perturbed Training Set S i Hypothesis Space Learning Map
Uniform Stability 1 A learning algorithm L is uniformly stable if After deleting one training sample the change must be small at all points z Z Uniform stability implies generalization Requirement is too strong Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001
CV loo stability 1 Cross-Validation leave- one-out stability considers only errors at removed training points strictly weaker than uniform stability remove z i error at x i 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Equivalence for ERM 1 Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CV loo stable ERM generalizes and is universally consistent H is a uGC class Question: Does CV loo stability ensure generalization for all learning algorithms? 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
CV loo Counterexample 1 X be uniform on [0, 1] Y {-1, +1} Target f * (x) = 1 Learning algorithm L: No change at removed training point CV loo stable Algorithm does not generalize at all! 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Additional Stability Criteria Error (E loo ) stability Empirical Error (EE loo ) stability Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) Not sufficient for generalization
CVEEE loo Stability Learning Map L is CVEEE loo stable if it is CV loo stable and E loo stable and EE loo stable Question: Does this imply generalization for all L?
CVEEE loo implies Generalization 1 Theorem: If L is CVEEE loo stable and the loss function is bounded, then f S generalizes Remarks: Neither condition (CV, E, EE) itself is sufficient E loo and EE loo stability are not sufficient For ERM CV loo stability alone is necessary and sufficient for generalization and consistency 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Consistency CVEEE loo stability in general does NOT guarantee consistency Good generalization does NOT necessarily mean good prediction but poor expected performance is indicated by poor training performance
CVEEE loo stable algorithms Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost) For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) For all these algorithms generalization is guaranteed by the shown theorems!
Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion
Implications Classical „VC-style“ conditions Occams Razor: prefer simple hypotheses CV loo stability Incremental Change online-algorithms Inverse Problems: stability well-posedness condition numbers characterize stability Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery
Language Learning Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more insight into real language learning? Language learning algorithm or Class of all learnable grammars? Focus on algorithms shift focus to stability
Conclusion Stability implies generalization intuitive (CV loo ) and technical (E loo, EE loo ) criteria Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms Restrictions on learning map rather than hypothesis space New approach for designing learning algorithms
Open Questions Easier / other necessary and sufficient conditions for generalization Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms
Thank you!
Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S , 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo , MIT, 2003 T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale- sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997