The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Informational Complexity Notion of Reduction for Concept Classes Shai Ben-David Cornell University, and Technion Joint work with Ami Litman Technion.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Data Mining Classification: Alternative Techniques

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Machine Learning Week 2 Lecture 2.
The Nature of Statistical Learning Theory by V. Vapnik
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory; The Tradeoff between Computational Complexity and Statistical Soundness Shai Ben-David CS Department, Cornell and Technion,
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Online Learning Algorithms
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
SVM by Sequential Minimal Optimization (SMO)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.
Machine Learning CSE 681 CH2 - Supervised Learning.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Ensemble Methods in Machine Learning
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CS 9633 Machine Learning Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computational Learning Theory
CH. 2: Supervised Learning
Computational Learning Theory
Computational Learning Theory
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion Haifa, Israel

Introduction The complexity of leaning is measured mainly along : Information computation. two axis: Information and computation. Information complexity Information complexity is concerned with the generalization performance of learning. Namely, how many training examples are needed? What is the convergence rate of a learner’s estimate to the true population parameters? Computational complexity The Computational complexity of learning is concerned with the computation applied to the data in order to deduce from it the Learner’s hypothesis. It seems that when an algorithm improves with respect to one of these measures it deteriorates with respect to the other.

Outline of this Talk 1.Some background. 2. Survey of recent pessimistic computational hardness results. 3. A discussion of three different directions for solutions: a. The Support Vector Machines approach. b. The Boosting approach (an agnostic learning variant). c. Algorithms that are efficient for `well behaved’ inputs.

The Label Prediction Problem Given some domain X set X S A sample S of labeled X members of X is generated by some (unknown) distribution x For a next point x, predict its label Data extracted from grant applications. Should current application be funded? Applications in a sample are labeled by success/failure of resulting projects. Formal DefinitionExample

Two Basic Competing Models Sample labels are consistent hH with some h in H Learner hypothesis required to meet absolute Upper bound on its error No prior restriction on the sample labels The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class) PAC frameworkAgnostic framework

Basic Agnostic Learning Paradigm  H X  Choose a Hypothesis Class H of subsets of X.  Sh HS  For an input sample S, find some h in H that fits S well.  h  For a new point x, predict a label according to its membership in h.

The Mathematical Justification Assume both the training sample and the test point X x {0,1} are generated by the same distribution over X x {0,1} then, H If H is not too rich (e.g., has small VC-dimension) hH then, for every h in H, hS the agreement ratio of h on the sample S x is a good estimate of its probability of success on a new x.

The Computational Problem  :{0, 1}  Input: A finite set of {0, 1}-labeled S R n points S in R n.  :  Output: Some ‘hypothesis’ function h in H that maximizes the number of correctly classified points of S.

Half-spaces of Linear We shall focus on the class NP Hard Find best hyperplane for arbitrary samples S ? Find hyperplane approximating the optimal for arbitrary S Feasible (Perceptron Algorithms) Find best hyperplane for separable S

For each of the following classes, approximating the H best agreement rate for h in H (on a given input S sample S ) up to some constant ratio, is NP-hard : MonomialsConstant width Monotone MonomialsHalf-spaces Balls Axis aligned Rectangles Threshold NN’s with constant 1st-layer width BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results

Gaps in Our Knowledge   The additive constants in the hardness- 1%-2%. of-approximation results are 1%-2%. They do not rule out efficient algorithms achieving, say, 90%(optimal success rate).   However, currently, there are no efficient algorithm performing significantly above 50%(optimal success rate).

We shall discuss three solution paradigms   Kernel-Based methods (including Support Vectors Machines).   Boosting (adapted to the Agnostic setting).   Data Dependent Success Approximation Algorithms.

The Types of Errors to be Considered Output of the the learning Algorithm D Best regressor for D Approximation Error Estimation Error Computational Error The Class H

The Boosting Solution Basic Idea “Extend the concept class as much as it can be done without hurting generalizability”

The Boosting Idea H S Given a hypothesis class H and a labeled sample S, Rather than searching for a good hypothesis HCo(H) in H search in a larger class Co(H). Important Gains: Co(H) H 1) A fine approximation can be found in Co(H) in time polynomial in the time of finding a coarse approximation in H. HCo(H). 2) The generalization bounds do not deteriorate when moving from H to Co(H).

Boosting Solution: Weak Learners An algorithm is a for a class H An algorithm is a  weak learner for a class H if on every H-labeled weighted sample S, it outputs some h in H Er S (h) < ½ - so that Er S (h) < ½ - 

Boosting Solution: the Basic Result Theorem [Schapire ’89, Freund ’90] : Theorem [Schapire ’89, Freund ’90] : There is an algorithm that, having access to an efficient  weak learner, PHS, for a P -random H -sample S and parameters , h Co(H) it finds some h in Co(H) Er P (h) < so that Er P (h) < ,  with prob. . |S| In time polynomial in  and  (and |S|).

The Boosting Solution in Practice The boosting approach was embraced by The boosting approach was embraced by practitoners of Machine Learning and applied, practitoners of Machine Learning and applied, quite successfully, to a wide variety of real-life quite successfully, to a wide variety of real-life problems. problems.

Theoretical Problems with the The Boosting Solution The boosting results assume that the input sample The boosting results assume that the input sample labeling is consistent with some function in H labeling is consistent with some function in H (the PAC framework assumption). (the PAC framework assumption). In practice this is never the case. In practice this is never the case. The boosting algorithm’s success is based on having The boosting algorithm’s success is based on having access to an efficient weak learner – access to an efficient weak learner – no such learner exists. no such learner exists.

Boosting Theory Attempt to Recover Can one settle for weaker, realistic, assumptions? Agnostic weak leaners : Agnostic weak leaners : H an algorithm is a  weak agnostic learner for H, ShH if for every labeled sample S it finds h in H s.t. Er S (h) < Er S (Opt(H)) + Er S (h) < Er S (Opt(H)) + 

Revised Boosting Solution Theorem [B-D, Long, Mansour] : Theorem [B-D, Long, Mansour] : There is an algorithm that, having access to a  weak agnostic learner, computes an h s.t. Er P (h) < c Er P (Opt(H)) c’ Er P (h) < c Er P (Opt(H)) c’ c c’h Co(H) (Where c and c’ are constants depending on  and h is in Co(H))

Problems with the The Boosting Solution Only for a restricted family of classes, are Only for a restricted family of classes, are there known efficient agnostic weak learners. there known efficient agnostic weak learners. The generalization bound we currently have, The generalization bound we currently have, Contains an annoying exponentiation of the optimal Contains an annoying exponentiation of the optimal error. error. Can this be improved? Can this be improved?

The SVM Solution “Extend the Hypothesis Class to guarantee computational feasibility”. R n Rather than bothering with non-separable data, make the data separable - by embedding it into some high-dimensional R n

The SVM Solution

The SVM Paradigm  X  Choose an Embedding of the domain X into some high dimensional Euclidean space, so that the data sample becomes (almost) linearly separable.   Find a large-margin data-separating hyperplane in this image space, and use it for prediction. Important gain: When the data is separable, finding such a hyperplane is computationally feasible.

The SVM Solution in Practice The SVM approach is embraced by The SVM approach is embraced by practitoners of Machine Learning and applied, practitoners of Machine Learning and applied, very successfully, to a wide variety of real-life very successfully, to a wide variety of real-life problems. problems.

A Potential Problem: Generalization  VC-dimension bounds:  VC-dimension bounds: The VC-dimension of R n n+1 the class of half-spaces in R n is n+1. Can we guarantee low dimension of the embeddings range?  Margin bounds:,  Margin bounds: Regardless of the Euclidean dimension, g generalization can bounded as a function of the margins of the hypothesis hyperplane. Can one guarantee the existence of a large-margin separation?

An Inherent Limitation of SVM ‘s   (|X|)  In “most” cases the data cannot be made separable unless the mapping is into dimension  (|X|). This happens even for classes of small VC-dimension.   For “most” classes, no mapping for which concept-classified data becomes separable, has large margins. In both cases generalization bounds are lost! In both cases generalization bounds are lost!

A third Proposal for Solution: Data- Dependent Success Approximations   Note that the definition of success for agnostic learning is data-dependent; The success rate of the learner on S is compared to that of the best h in H.   We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

Data- Dependent Success Approximations   While Boosting, as well as Kernel-Based methods, extend the class from which the algorithm picks hypothesis, there is a natural alternative for circumventing the hardness-of-approximation results: shrinking the comparison class.   Our DDSA algorithms do it by imposing margins on the comparison class hypotheses.

Data Dependent Success Definition for Half-spaces A A learning algorithm A is  margin  successful S  R n  {0,1} if, for every input S  R n  {0,1}, |{(x,y)  S: A (s) (x) = y}| > |{(x,y): h(x)=y and d(h, x) >  h for every half-space h.

Some Intuition   If there exist some optimal h which separates with generous margins, then a  margin algorithm must produce an optimal separator. On the other hand,   If every good separator can be degraded by small perturbations, then a  margin algorithm can settle for a hypothesis that is far from optimal.

 S| n The positive result  For every positive  there is a  - margin algorithm whose running time is polynomial in |S| and n. A Complementing Hardness Result  |S|n  Unless P = NP, no algorithm can do this in time polynomial in  and in |S| and n ).

Some Obvious Open Questions   Is there a parameter that can be used to ensure good generalization for Kernel –Based (SVM-like) methods?   Are there efficient Agnostic Weak Learners for potent hypothesis classes?   Is there an inherent trade-off between the generalization ability and the computational complexity of algorithms?

THE END

“Old” Work   Hardness results: Blum and Rivest showed that it is NP-hard to optimize the weights of a 3-nodes NN. Similar hardness-of-optimization results for other classes followed. But learning can settle for less than optimization.  known  Efficient algorithms: known perceptron algorithms are efficient for linearly separable input data (or the image of such data under ‘tamed’ noise). But natural data sets are usually not separable.

A  -margin Perceptron Algorithm   On input S consider all k-size sub-samples.   For each such sub-sample find its largest margin separating hyperplane.   Among all the (~|S| k ) resulting hyperplanes. choose the one with best performance on S. (The choice of k is a function of the desired margin   k ~   

Other  margin Algorithms Each of the following algorithms can replace the “find the largest margin separating hyperplane”   The usual “Perceptron Algorithm”.   “Find a point of equal distance from x 1, … x k “.   Phil Long’s ROMMAalgorithm. These are all very fast online algorithms.