Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.

Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta Nadav Eiron Barbara Hammer David Haussler Klaus Hoffgen Lee Jones Michael Kearns Christian Kuhlman Phil Long Ron Rives Hava Siegelman Hans Ulrich Simon Eduardo Sontag Leslie Valiant Kevin Van Horn Santosh Vempala Van Vu

Introduction populareffective, Neural Nets are most popular, effective, practical … practical … learning tools. Yet, after almost 40 years of research, there are no efficient algorithms for learning with NN’s.WHY?

Outline of this Talk 1.Some background. 2. Survey of recent strong hardness results. 3. New efficient learning algorithms for some basic NN architectures.

The Label Prediction Problem Given some domain X set X S A sample S of labeled X members of X is generated by some (unknown) distribution x For a next point x, Predict its label Data files of drivers Will the customer you interview file a claim? Drivers in a sample are labeled according to whether they filed an insurance claim Formal DefinitionExample

The Agnostic Learning Paradigm  H X  Choose a Hypothesis Class H of subsets of X.  Sh HS  For an input sample S, find some h in H that fits S well.  h  For a new point x, predict a label according to its membership in h.

The Mathematical Justification H If H is not too rich (has small VC-dimension) hH then, for every h in H, hS the agreement ratio of h on the sample S x is a good estimate of its probability of success on a new x.

The Mathematical Justification - Formally SD If S is sampled i.i.d., by some D over X  {0, 1}> 1-  X  {0, 1} then with probability > 1-  Agreement ratioProbability of success

A Comparison to ‘Classic’ PAC Sample labels are consistent hH with some h in H Learner hypothesis required to meet absolute Upper bound on its error No prior restriction on the sample labels The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class) PAC frameworkAgnostic framework

The Model Selection Issue Output of the the learning Algorithm Best regressor for P Approximation Error Estimation Error Computational Error The Class H

The Big Question Are there hypotheses classes that are: 1. Expressive (small approximation error). 2. Have small VC-dim (small generalization error) 3.Have efficient good-approximation algorithms NN’s are quite successful as approximators (property 1). If they are small (relative to the data size) then they also satisfy property 2. We investigate property 3 for such NN’s.

The Computational Problem For some class H of domain subsets  :  Input: A finite set of {0, 1}-labeled points S in R n.  : in  Output: Some h in H that maximizes the number of correctly classified point of S.

“Old” Work   Hardness results: Blum and Rivest showed that it is NP-hard to optimize the weights of a 3-nodes NN. Similar hardness-of-optimization results for other classes followed. But learning can settle for less than optimization.  known  Efficient algorithms: known perceptron algorithms are efficient for linearly separable input data (or the image of such data under ‘tamed’ noise). But natural data sets are usually not separable.

The Focus of this Tutorial in The results mentioned above (Blum and Rivest etc.) show that for many “natural” NN’s finding such S-optimal h in H is NP hard. Are there efficient algorithms that output good approximations to the S-optimal hypothesis?

HS For each of the following classes there exist some constant s. t. approximating the best agreement rate for h in H (on a given input sample S ) up to this constant ratio, is NP-hard : MonomialsConstant width Monotone Monomials Half-spaces Balls Axis aligned Rectangles Threshold NN’s with constant 1st-layer width BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results

How Significant are Such Hardness Results All the above results are proved via reductions from some known-to-be-hard problem.

Relevant Questions 1.Samples that are hard for one H are easy for another (a model selection issue). 2.Where do ‘naturally generated’ samples fall?

Data-Dependent Success   Note that the definition of success for agnostic learning is data-dependent; The success rate of the learner on S is compared to that of the best h in H.   We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

A New Success Criterion A A learning algorithm A is  margin  successful S  R n  {o,1} if, for every input S  R n  {o,1}, |{(x,y)  S: A (s) (x) = y}| > |{(x,y): h(x)=y and d(h, x) >  h  H for every h  H.

Some Intuition   If there exist some optimal h which separates with generous margins, then a  margin algorithm must produce an optimal separator. On the other hand,   If every good separator can be degraded by small perturbations, then a  margin algorithm can settle for a hypothesis that is far from optimal.

First Virtue of the New Measure The  margin requirement is a rigorous performance guarantee that can be achieved by efficient algorithms (unlike the common approximate optimization).

Another Appealing Feature of the New Criterion   It turns out that for each of the three classes analysed so far (Half-spaces, Balls and Hyper-Rectangles), there exist a critical value    so that:   margin  learnability is NP-hard for all    while, on the other hand,   For any   , there exist a poly-time  margin learning algorithm.

A New Positive Result [B-D, Simon]   For every positive  there is a poly-time algorithm that classifies correctly as many input points as any half-space can classify correctly with margin > 

 The positive result  For every positive  there is a poly-time algorithm that classifies correctly as many input points as any half-space can classify correctly with margin >  A Complementing Hardness Result   Unless P = NP, no algorithm can do this in time polynomial in 1/  and in |S| and n ).

Proof of the Positive Result (Outline) Best Separating Hyperplane Best Separating Homogeneous Hyperplane Densest Hemisphere (un-supervised input) Densest Open Ball We apply the following chain of reductions:

Input: Input: A finite set P of points on the unit sphere S n-1. Output: Output:An open Ball B of radious 1 so that |B  P| is maximized. The Denset Open Ball Problem S n-1 B

Algorithms for the Densest Open Ball Problem  Alg. 1.  For every x 1, …x n  P, find the center of their minimal enclosing Ball, Z(x 1, …, x n ) Check |B[Z(x 1, …, x n ), 1]  P|   Output the ball with maximum intersection with P Running time: ~|P| n exponential in n !

Another Algorithm (for the Densest Open Ball Problem) Fix a parameter k << n, Alg. 2. Apply Alg. 1 only for subsets of size < k, i.e.,   For every x 1, …x k  P, find the center of their minimal enclosing Ball, Z(x 1, …, x k ) Check |B[Z(x 1, …, x k ), 1]  P|   Output the ball with maximum intersection with P Running time: ~|P| k But, does it output a good hypothesis?

Our Core Mathematical Result The following is a local approximation result. It shows that computations from local data (k-size subsets) can approximate global computations, with precision guarantee depending only on the local parameter, k. Theorem: For every k < n and x 1 … x n on the unit sphere S n-1, there exist a subset So that

The Resulting Perceptron Algorithm   On input S consider all k-size sub-samples.   For each such sub-sample find its largest margin separating hyperplane.   Among all the (~|S| k ) resulting hyperplanes. choose the one with best performance on S. (The choice of k is a function of the desired margin   k ~   

A Different, Randomized, Algorithm Avrim Blum noticed that the ‘randomized projection’ algorithm of Rosa Ariaga and Santosh Vempala ‘99 achieves, with high probability, a similar Performance as our algorithm.

Directions for Further Research   Can similar efficient algorithms be derived for more complex NN architectures?   How well do the new algorithms perform on real data sets?   Can the ‘local approximation’ results be extended to more geometric functions?

Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.

Similar presentations

Presentation on theme: "Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.

Similar presentations

Presentation on theme: "Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta."— Presentation transcript:

Similar presentations

About project

Feedback