Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

The Capacity of Wireless Networks Danss Course, Sunday, 23/11/03.
VC Dimension – definition and impossibility result
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
VC theory, Support vectors and Hedged prediction technology.
Machine Learning Week 3 Lecture 1. Programming Competition
Tools from Computational Geometry Bernard Chazelle Princeton University Bernard Chazelle Princeton University Tutorial FOCS 2005.
PAC Learning adapted from Tom M.Mitchell Carnegie Mellon University.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Machine Learning Week 2 Lecture 2.
Computational Geometry The art of finding algorithms for solving geometrical problems Literature: –M. De Berg et al: Computational Geometry, Springer,
An Algorithm for Polytope Decomposition and Exact Computation of Multiple Integrals.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
Binary Classification Problem Linearly Separable Case
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Learning to Identify Winning Coalitions in the PAC Model A. D. Procaccia & J. S. Rosenschein.
CSE 421 Algorithms Richard Anderson Lecture 27 NP Completeness.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Computational Learning Theory
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
On Complexity, Sampling, and є-Nets and є-Samples. Present by: Shay Houri.
On the Complexity of Approximating the VC Dimension Chris Umans, Microsoft Research joint work with Elchanan Mossel, Microsoft Research June 2001.
Support Vector Machines
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.
Basics Set systems: (X,F) where F is a collection of subsets of X. e.g. (R 2, set of half-planes) µ: a probability measure on X e.g. area/volume is a.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Lecture 22 More NPC problems
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 36-37: Foundation of Machine Learning.
CS151 Complexity Theory Lecture 13 May 11, Outline proof systems interactive proofs and their power Arthur-Merlin games.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
1.1 Chapter 3: Proving NP-completeness Results Six Basic NP-Complete Problems Some Techniques for Proving NP-Completeness Some Suggested Exercises.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Inequalities for Stochastic Linear Programming Problems By Albert Madansky Presented by Kevin Byrnes.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
NP Completeness Piyush Kumar. Today Reductions Proving Lower Bounds revisited Decision and Optimization Problems SAT and 3-SAT P Vs NP Dealing with NP-Complete.
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Summary of the Last Lecture This is our second lecture. In our first lecture, we discussed The vector spaces briefly and proved some basic inequalities.
Computational Geometry
Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.
1 Chapter 4 Geometry of Linear Programming  There are strong relationships between the geometrical and algebraic features of LP problems  Convenient.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
HW HW1: Let us know if you have any questions. ( the TAs)
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
Vapnik–Chervonenkis Dimension
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
CS344 : Introduction to Artificial Intelligence
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour

PAC Learning model There exists a distribution D over domain X Examples: –use c for target function (rather than c t ) Goal: –With high probability (1-  ) –find h in H such that –error(h,c ) <  –  arbitrarily small.

VC: Motivation Handle infinite classes. VC-dim “replaces” finite class size. Previous lecture (on PAC): –specific examples –rectangle. –interval. Goal: develop a general methodology.

The VC Dimension C collection of subsets of universe U VC(C) = VC dimension of C: size of largest subset T  U shattered by C T shattered if every subset T’  T expressible as T  (an element of C) Example: C = {{a}, {a, c}, {a, b, c}, {b, c}, {b}} VC(C) = 2{b, c} shattered by C Plays important role in learning theory, finite automata, comparability theory, computational geometry

Definitions: Projection Given a concept c over X –associate it with a set (all positive examples) Projection (sets) –For a concept class C and subset S –  C (S) = { c  S | c  C} Projection (vectors) –For a concept class C and S = {x 1, …, x m } –  C (S) = { | c  C}

Definition: VC-dim Clearly |  C (S) |  2 m C shatters S if |  C (S) | =2 m (S is shattered by C) VC dimension of a class C: –The size d of the largest set S that shatters C. –Can be infinite. For a finite class C –VC-dim(C)  log |C|

Example S is Shattered by C VC: A combinatorial measure of a function class complexity

Calculating VC dimensionality The VC dimension is at least d if there exists some sample |S| = d which is shattered by C. This does not mean that all samples of size d are shattered by C. (Three point on a single line in 2d) Conversely, in order to show that the VC dimension is at most d, one must show that no sample of size d + 1 is shattered. Naturally, proving an upper bound is more difficult than proving the lower bound on the VC dimension.

Example 1: Interval 1 0 C 1 ={c z | z  [0,1] } c z (x) = 1  x  z

Example 2: line C 2 ={c w | w=(a,b,c) } c w (x,y) = 1  ax+by  c

Line: Hyperplane VC dim > 3

VC dim < 4 4 points can not be shattered

Example 3: Parallel Rectangle

VC Dim of Rectangles

Example 4: Finite union of intervals Any set of points can be covered Thus VC dim =

Example 5 : Parity n Boolean input variables T  {1, …, n} f T (x) =  i  T x i Lower bound: n unit vectors Upper bound –Number of concepts –Linear dependency

Example 6: OR n Boolean input variables P and N subsets {1, …, n} f P,N (x) = (  i  P x i )  (  i  N  x i ) Lower bound: n unit vectors Upper bound –Trivial 2n –Use ELIM (get n+1) –Show second vector removes 2 (get n)

Example 7: Convex polygons

Example 8: Hyper-plane VC-dim(C 8 ) = d+1 Lower bound –unit vectors and zero vector Upper bound! C 8 ={c w,c | w  d } c w,c (x) = 1   c

Complexity Questions Given C, compute VC(C) since VC(C)  log |C|, can compute in O(n log n ) time (Linial-Mansour-Rivest 88) probably can’t do better: problem is LOG NP-complete (Papadimitriou-Yannakakis 96) Often C has a small implicit representation: C(i, x) is a polynomial-size circuit such that C(i, x) = 1 iff x belongs to set i implicit version is  3 -complete (Schaefer 99) (as hard as  a  b  c  (a, b, c) for CNF formula  )

Sampling Lemma Lemma: Let W X be chosen randomly such that |W| ε|X|. A set of O(1/ε ln(1/δ)) points sampled independently and uniformly at random from X intersects W with probability at least (1- δ) Proof: Any sample x is in W with probability at least ε. Thus, the probability that all samples do not intersect with W is at most δ:

ε-Net Theorem Theorem: Let VC-dimension of (X,C) be d 2 and 0 ε ½. ε-net for (X,C) of size at most O(d/ε ln(1/ε)). If we choose O(d/ε ln(d/ε) + 1/ε ln(1/δ)) points at random from X, then the resulting set N is an ε-net with probability δ. Exercise 3, Submission next week A polynomial bound on the sample size for PAC learning

Radon Theorem Definitions: –Convex set. –Convex hull: conv(S) Theorem: –Let T be a set of d+2 points in R d –There exists a subset S of T such that –conv(S)  conv(T \ S)  Proof!

Hyper-plane: Finishing the proof Assume d+2 points T can be shattered. Use Radon Theorem to find S such that –conv(S)  conv(T \ S)  Assign point in S label 1 –points not in S label 0 There is a separating hyper-plane How will it label conv(S)  conv(T \ S)

Lower bounds: Setting Static learning algorithm: –asks for a sample S of size m(  ) –Based on S selects a hypothesis

Lower bounds: Setting Theorem: –if VC-dim(C) =  then C is not learnable. Proof: –Let m = m(0.1,0.1) –Find 2m points which are shattered (set T) –Let D be the uniform distribution on T –Set c t (x i )=1 with probability ½. Expected error ¼. Finish proof!

Lower Bound: Feasible Theorem –VC-dim(C)=d+1, then m(  )=  (d/  ) Proof: –Let T be a set of d+1 points which is shattered. –D samples: z 0 with prob. 1-8  z i with prob. 8  /d

Continue –Set c t (z 0 )=1 and c t (z i )=1 with probability ½ Expected error 2  Bound confidence –for accuracy 

Lower Bound: Non-Feasible Theorem –For two hypoth. m(  )=  ((log 1  )    ) Proof: –Let H={h 0, h 1 }, where h b (x)=b –Two distributions: –D 0 : Prob. is ½ -  and is ½ +  –D 1 : Prob. is ½ +  and is ½ - 