Download presentation
Presentation is loading. Please wait.
Published bySharon Kimberly Hamilton Modified over 5 years ago
1
CSCI B609: βFoundations of Data Scienceβ
Lecture 11/12: VC-Dimension and VC-Theorem Slides at Grigory Yaroslavtsev
2
Intro to ML Classification problem Formalization:
Instance space π: 0,1 π
or β π
(feature vectors) Classification: come up with a mapping πβ{0,1} Formalization: Assume there is a probability distribution π· over π π β = βtarget conceptβ (set π β βπ of positive instances) Given labeled i.i.d. samples from π· produce πβπ Goal: have π agree with π β over distribution π· Minimize: ππ π π· π = Pr π· [π Ξ π β ] ππ π π· π = βtrueβ or βgeneralizationβ error
3
Intro to ML Training error
π = labeled sampled (pairs π₯,π , π₯βπ, πβ{0,1}) Training error: ππ π π π = πβ© π Ξ π β π βOverfittingβ: low training error, high true error Hypothesis classes: H: collection of subsets of π called hypotheses If π=β could be all intervals π,π ,πβ€π If π= β π could be linear separators: πβ β π πβ
πβ₯ π€ 0 |πβ β π , π€ 0 ββ If π is large enough (compared to some property of H) then overfitting doesnβt occur
4
Overfitting and Uniform Convergence
PAC learning (agnostic): For π,πΏ>0 if π β₯1/2 π 2 ( ln π» + ln 2/πΏ ) then with probability 1βπΏ: βπβH: ππ π π π βππ π π· π β€π Size of the class of hypotheses can be very large Can also be infinite, how to give a bound then? We will see ways around this today
5
VC-dimension VC-dim(π») β€ ln π» Consider database age vs. salary
Query: fraction of the overall population with ages 35β45 and salary $(50β70)K How big a database can answer with Β±π error 100 ages Γ 1000 salaries β rectangles 1/2 π 2 (10 ln 10+ ln 2/πΏ ) samples suffice What if we donβt want to discretize?
6
VC-dimension Def. Concept class π» shatters a set π if βπ΄βπ
there is hβπ» labeling π΄ positive and Aβπ negative Def. VC-dim(π») = size of the largest shattered set Example: axis-parallel rectangles on the plane 4-point diamond is shattered No 5-point set can be shattered VC-dim(axis-parallel rectangles) = 4 Def. π» π ={ββ©π:ββπ»} = set of labelings of the points in π by functions in π» Def. Growth function π» π = max π =π |π» π | Example: growth function of a-p. rectangles is π( π 4 )
7
Growth function & uniform convergence
PAC learning via growth function: For π,πΏ>0 if π =πβ₯8/ π 2 ( ln 2π»(2π) + ln 1/πΏ ) then with probability 1βπΏ: βπβH: ππ π π π βππ π π· π β€π Thm (Sauerβs lemma). If VC-dim(H)=π then: π» π β€ π=0 π π π β€ ππ π π For half-planes, VCβdim=3, π» π =π( π 2 )
8
Sauerβs Lemma Proof Let π=ππΆ-dim(π») weβll show that if π =π:
π» π β€ π β€π = π=0 π π π π β€π = πβ1 β€π + πβ1 β€πβ1 Proof (induction by set size): πβ π₯ : by induction π» πβ{π₯} β€ πβ1 β€π π»[π] β π» πβ π₯ β€ πβ1 β€πβ1 ?
9
π»[π] β π» πβ π₯ β€ πβ1 β€πβ1 If π» π >π» πβ π₯ then it is because of the sets that differ only on π₯ so letβs pair them up For ββπ» π containing π₯ let ππππ π =ββ π₯ π={ββπ» π :π₯ββ πππ ππππ π βπ» π } Note: π» π β π» πβ π₯ = π What is the VC-dimension of π? If VC-dim π =πβ² then πΉβπβ{π₯} of πβ² is shattered All 2 π β² subsets of πΉ are 0/1 extendable on π₯ πβ₯ π β² +1β VC-dim π β€π β1β apply induction
10
Examples Intervals of the reals: Pairs of intervals of the reals:
Shatter 2 points, donβt shatter 3 βππΆ-dim = 2 Pairs of intervals of the reals: Shatter 4 points, donβt shatter 5 βππΆ-dim = 4 Convex polygons Shatter any π points on a circle βππΆ-dim =β Linear separators in π dimensions: Shatter π+1 points (unit vectors + origin) Take subset S and set π€ π =0 if πβπ: separator π€ π π₯β€0
11
VC-dimension of linear separators
No set of π+2 points can be shattered Thm (Radon). Any set πβ β π with π =π+2 can be partitioned into two subsets π΄,π΅ s.t.: Convex(π΄) β© Convex(π΅) β β
Form πΓ π+2 matrix A, columns = points in π Add extra all-1 row β matrix B π= π₯ 1 , π₯ 2 ,β¦, π₯ π+2 , non-zero vector: π΅π₯=0 Reordering: π₯ 1 , π₯ 2 ,β¦, π₯ π β₯0, π₯ π +1 , β¦, π₯ π+2 <0 Normalize: π=1 π π₯ π =1
12
Radonβs Theorem (cont.)
π π , π π = i-th columns of π΅ and π΄ π=1 π | π₯ π | π π = π=π +1 π+2 | π₯ π | π π π=1 π | π₯ π | π π = π=π +1 π+2 | π₯ π | π π π=1 π | π₯ π | = π=π +1 π+2 | π₯ π | =1 Convex combinations of two subsets intersect Contradiction
13
Growth function & uniform convergence
PAC learning via growth function: For π,πΏ>0 if π =πβ₯8/ π 2 ( ln 2π»(2π) + ln 1/πΏ ) then with probability 1βπΏ: βπβH: ππ π π π βππ π π· π β€π Assume event A: βπβH: ππ π π π βππ π π· π >π Draw πβ² of size π, event B: βπβH: ππ π π π βππ π π· π >π ππ π π β² π βππ π π· π <π/2
14
ππ π΅ β₯Prβ‘[π΄]/2 Lem. If π=Ξ©(1/ π 2 ) then ππ π΅ β₯Prβ‘[π΄]/2. Proof:
ππ π΅ β₯ Pr π΄,π΅ = Pr π΄ Prβ‘[π΅|π΄] Suppose π΄ occurs: βπβH: ππ π π π βππ π π· π >π When we draw π β² : πΌ π β² ππ π π β² π = ππ π π· π By Chernoff: ππ π β² ππ π π β² π βππ π π· π <π/2 β₯ 1 2 ππ π΅ β₯ Pr π΄ Γ1/2
15
VC-theorem Proof Suffices to show that ππ π΅ β€πΏ/2
Consider drawing 2π samples π β²β² and then randomly partitioning into π β² and π π΅ β : same as π΅ for such (π β² ,π)β Pr π΅ β = Pr π΅ Will show: β fixed π β²β² ππ π, S β² π΅ β | π β²β² is small Key observation: once π β²β² is fixed there are only |π» π β²β² |β€π»(2π) events to care about Suffices: for every fixed ββπ» π β²β² : ππ π, S β² π΅ β occurs for β π β²β² β€ πΏ 2π» 2π
16
VC-theorem Proof (cont.)
Randomly pair points in π β²β² into ( π π , π π ) pairs With prob. Β½: π π βπ, π π β π β² or π π β π β² , π π βπ Diff. between ππ π π π and ππ π π β² π for π=1,β¦,π Only changes if mistake on only one of ( π π , π π ) With prob. Β½ difference changes by Β±1 By Chernoff: Pr ππ π π π βππ π π β² π > ππ 4 = π βΞ©( π 2 π) π βΞ©( π 2 π) β€ πΏ 2π» 2π for π from the Thm. statement
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.