Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI B609: β€œFoundations of Data Science”

Similar presentations


Presentation on theme: "CSCI B609: β€œFoundations of Data Science”"β€” Presentation transcript:

1 CSCI B609: β€œFoundations of Data Science”
Lecture 11/12: VC-Dimension and VC-Theorem Slides at Grigory Yaroslavtsev

2 Intro to ML Classification problem Formalization:
Instance space 𝑋: 0,1 𝒅 or ℝ 𝒅 (feature vectors) Classification: come up with a mapping 𝑋→{0,1} Formalization: Assume there is a probability distribution 𝐷 over 𝑋 𝒄 βˆ— = β€œtarget concept” (set 𝒄 βˆ— βŠ†π‘‹ of positive instances) Given labeled i.i.d. samples from 𝐷 produce π’‰βŠ†π‘‹ Goal: have 𝒉 agree with 𝒄 βˆ— over distribution 𝐷 Minimize: π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 = Pr 𝐷 [𝒉 Ξ” 𝒄 βˆ— ] π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 = β€œtrue” or β€œgeneralization” error

3 Intro to ML Training error
𝑆 = labeled sampled (pairs π‘₯,𝑙 , π‘₯βˆˆπ‘‹, π‘™βˆˆ{0,1}) Training error: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 = π‘†βˆ© 𝒉 Ξ” 𝒄 βˆ— 𝑆 β€œOverfitting”: low training error, high true error Hypothesis classes: H: collection of subsets of 𝑋 called hypotheses If 𝑋=ℝ could be all intervals π‘Ž,𝑏 ,π‘Žβ‰€π‘ If 𝑋= ℝ 𝑑 could be linear separators: π’™βˆˆ ℝ 𝑑 π’˜β‹…π’™β‰₯ 𝑀 0 |π’˜βˆˆ ℝ 𝑑 , 𝑀 0 βˆˆβ„ If 𝑆 is large enough (compared to some property of H) then overfitting doesn’t occur

4 Overfitting and Uniform Convergence
PAC learning (agnostic): For πœ–,𝛿>0 if 𝑆 β‰₯1/2 πœ– 2 ( ln 𝐻 + ln 2/𝛿 ) then with probability 1βˆ’π›Ώ: βˆ€π’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 β‰€πœ– Size of the class of hypotheses can be very large Can also be infinite, how to give a bound then? We will see ways around this today

5 VC-dimension VC-dim(𝐻) ≀ ln 𝐻 Consider database age vs. salary
Query: fraction of the overall population with ages 35βˆ’45 and salary $(50βˆ’70)K How big a database can answer with Β±πœ– error 100 ages Γ— 1000 salaries β‡’ rectangles 1/2 πœ– 2 (10 ln 10+ ln 2/𝛿 ) samples suffice What if we don’t want to discretize?

6 VC-dimension Def. Concept class 𝐻 shatters a set 𝑆 if βˆ€π΄βŠ†π‘†
there is h∈𝐻 labeling 𝐴 positive and Aβˆ–π‘† negative Def. VC-dim(𝐻) = size of the largest shattered set Example: axis-parallel rectangles on the plane 4-point diamond is shattered No 5-point set can be shattered VC-dim(axis-parallel rectangles) = 4 Def. 𝐻 𝑆 ={β„Žβˆ©π‘†:β„Žβˆˆπ»} = set of labelings of the points in 𝑆 by functions in 𝐻 Def. Growth function 𝐻 𝑛 = max 𝑆 =𝑛 |𝐻 𝑆 | Example: growth function of a-p. rectangles is 𝑂( 𝑛 4 )

7 Growth function & uniform convergence
PAC learning via growth function: For πœ–,𝛿>0 if 𝑆 =𝑛β‰₯8/ πœ– 2 ( ln 2𝐻(2𝑛) + ln 1/𝛿 ) then with probability 1βˆ’π›Ώ: βˆ€π’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 β‰€πœ– Thm (Sauer’s lemma). If VC-dim(H)=𝑑 then: 𝐻 𝑛 ≀ 𝑖=0 𝑑 𝑛 𝑖 ≀ 𝑒𝑛 𝑑 𝑑 For half-planes, VCβˆ’dim=3, 𝐻 𝑛 =𝑂( 𝑛 2 )

8 Sauer’s Lemma Proof Let 𝑑=𝑉𝐢-dim(𝐻) we’ll show that if 𝑆 =𝑛:
𝐻 𝑆 ≀ 𝑛 ≀𝑑 = 𝑖=0 𝑑 𝑛 𝑖 𝑛 ≀𝑑 = π‘›βˆ’1 ≀𝑑 + π‘›βˆ’1 β‰€π‘‘βˆ’1 Proof (induction by set size): π‘†βˆ– π‘₯ : by induction 𝐻 π‘†βˆ–{π‘₯} ≀ π‘›βˆ’1 ≀𝑑 𝐻[𝑆] βˆ’ 𝐻 π‘†βˆ– π‘₯ ≀ π‘›βˆ’1 β‰€π‘‘βˆ’1 ?

9 𝐻[𝑆] βˆ’ 𝐻 π‘†βˆ– π‘₯ ≀ π‘›βˆ’1 β‰€π‘‘βˆ’1 If 𝐻 𝑆 >𝐻 π‘†βˆ– π‘₯ then it is because of the sets that differ only on π‘₯ so let’s pair them up For β„Žβˆˆπ» 𝑆 containing π‘₯ let π’•π’˜π’Šπ’ 𝒉 =β„Žβˆ– π‘₯ 𝑇={β„Žβˆˆπ» 𝑆 :π‘₯βˆˆβ„Ž π‘Žπ‘›π‘‘ π’•π’˜π’Šπ’ 𝒉 ∈𝐻 𝑆 } Note: 𝐻 𝑆 βˆ’ 𝐻 π‘†βˆ– π‘₯ = 𝑇 What is the VC-dimension of 𝑇? If VC-dim 𝑇 =𝑑′ then π‘ΉβŠ†π‘†βˆ–{π‘₯} of 𝑑′ is shattered All 2 𝑑 β€² subsets of 𝑹 are 0/1 extendable on π‘₯ 𝑑β‰₯ 𝑑 β€² +1β‡’ VC-dim 𝑇 ≀𝑑 βˆ’1β‡’ apply induction

10 Examples Intervals of the reals: Pairs of intervals of the reals:
Shatter 2 points, don’t shatter 3 ⇒𝑉𝐢-dim = 2 Pairs of intervals of the reals: Shatter 4 points, don’t shatter 5 ⇒𝑉𝐢-dim = 4 Convex polygons Shatter any 𝑛 points on a circle ⇒𝑉𝐢-dim =∞ Linear separators in 𝑑 dimensions: Shatter 𝑑+1 points (unit vectors + origin) Take subset S and set 𝑀 𝑖 =0 if π‘–βˆˆπ‘†: separator 𝑀 𝑇 π‘₯≀0

11 VC-dimension of linear separators
No set of 𝑑+2 points can be shattered Thm (Radon). Any set π‘†βŠ† ℝ 𝑑 with 𝑆 =𝑑+2 can be partitioned into two subsets 𝐴,𝐡 s.t.: Convex(𝐴) ∩ Convex(𝐡) β‰ βˆ… Form 𝑑× 𝑑+2 matrix A, columns = points in 𝑆 Add extra all-1 row β‡’ matrix B 𝒙= π‘₯ 1 , π‘₯ 2 ,…, π‘₯ 𝑑+2 , non-zero vector: 𝐡π‘₯=0 Reordering: π‘₯ 1 , π‘₯ 2 ,…, π‘₯ 𝑠 β‰₯0, π‘₯ 𝑠+1 , …, π‘₯ 𝑑+2 <0 Normalize: 𝑖=1 𝑠 π‘₯ 𝑖 =1

12 Radon’s Theorem (cont.)
𝒃 π’Š , 𝒂 π’Š = i-th columns of 𝐡 and 𝐴 𝑖=1 𝑠 | π‘₯ 𝑖 | 𝒃 π’Š = 𝑖=𝑠+1 𝑑+2 | π‘₯ 𝑖 | 𝒃 π’Š 𝑖=1 𝑠 | π‘₯ 𝑖 | 𝒂 π’Š = 𝑖=𝑠+1 𝑑+2 | π‘₯ 𝑖 | 𝒂 π’Š 𝑖=1 𝑠 | π‘₯ 𝑖 | = 𝑖=𝑠+1 𝑑+2 | π‘₯ 𝑖 | =1 Convex combinations of two subsets intersect Contradiction

13 Growth function & uniform convergence
PAC learning via growth function: For πœ–,𝛿>0 if 𝑆 =𝑛β‰₯8/ πœ– 2 ( ln 2𝐻(2𝑛) + ln 1/𝛿 ) then with probability 1βˆ’π›Ώ: βˆ€π’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 β‰€πœ– Assume event A: βˆƒπ’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 >πœ– Draw 𝑆′ of size 𝑛, event B: βˆƒπ’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 >πœ– π‘’π‘Ÿ π‘Ÿ 𝑆 β€² 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 <πœ–/2

14 π‘ƒπ‘Ÿ 𝐡 β‰₯Pr⁑[𝐴]/2 Lem. If 𝑛=Ξ©(1/ πœ– 2 ) then π‘ƒπ‘Ÿ 𝐡 β‰₯Pr⁑[𝐴]/2. Proof:
π‘ƒπ‘Ÿ 𝐡 β‰₯ Pr 𝐴,𝐡 = Pr 𝐴 Pr⁑[𝐡|𝐴] Suppose 𝐴 occurs: βˆƒπ’‰βˆˆH: π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 >πœ– When we draw 𝑆 β€² : 𝔼 𝑆 β€² π‘’π‘Ÿ π‘Ÿ 𝑆 β€² 𝒉 = π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 By Chernoff: π‘ƒπ‘Ÿ 𝑆 β€² π‘’π‘Ÿ π‘Ÿ 𝑆 β€² 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 <πœ–/2 β‰₯ 1 2 π‘ƒπ‘Ÿ 𝐡 β‰₯ Pr 𝐴 Γ—1/2

15 VC-theorem Proof Suffices to show that π‘ƒπ‘Ÿ 𝐡 ≀𝛿/2
Consider drawing 2𝑛 samples 𝑆 β€²β€² and then randomly partitioning into 𝑆 β€² and 𝑆 𝐡 βˆ— : same as 𝐡 for such (𝑆 β€² ,𝑆)β‡’ Pr 𝐡 βˆ— = Pr 𝐡 Will show: βˆ€ fixed 𝑆 β€²β€² π‘ƒπ‘Ÿ 𝑆, S β€² 𝐡 βˆ— | 𝑆 β€²β€² is small Key observation: once 𝑆 β€²β€² is fixed there are only |𝐻 𝑆 β€²β€² |≀𝐻(2𝑛) events to care about Suffices: for every fixed β„Žβˆˆπ» 𝑆 β€²β€² : π‘ƒπ‘Ÿ 𝑆, S β€² 𝐡 βˆ— occurs for β„Ž 𝑆 β€²β€² ≀ 𝛿 2𝐻 2𝑛

16 VC-theorem Proof (cont.)
Randomly pair points in 𝑆 β€²β€² into ( π‘Ž 𝑖 , 𝑏 𝑖 ) pairs With prob. Β½: π‘Ž 𝑖 →𝑆, 𝑏 𝑖 β†’ 𝑆 β€² or π‘Ž 𝑖 β†’ 𝑆 β€² , 𝑏 𝑖 →𝑆 Diff. between π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 and π‘’π‘Ÿ π‘Ÿ 𝑆 β€² 𝒉 for 𝑖=1,…,𝑛 Only changes if mistake on only one of ( π‘Ž 𝑖 , 𝑏 𝑖 ) With prob. Β½ difference changes by Β±1 By Chernoff: Pr π‘’π‘Ÿ π‘Ÿ 𝑆 𝒉 βˆ’π‘’π‘Ÿ π‘Ÿ 𝑆 β€² 𝒉 > πœ–π‘› 4 = 𝑒 βˆ’Ξ©( πœ– 2 𝑛) 𝑒 βˆ’Ξ©( πœ– 2 𝑛) ≀ 𝛿 2𝐻 2𝑛 for 𝑛 from the Thm. statement


Download ppt "CSCI B609: β€œFoundations of Data Science”"

Similar presentations


Ads by Google