CSCI B609: “Foundations of Data Science”

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

VC Dimension – definition and impossibility result
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
VC theory, Support vectors and Hedged prediction technology.
Linear Regression.
Evaluating Classifiers
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Machine Learning Week 3 Lecture 1. Programming Competition
The main idea of the article is to prove that there exist a tester of monotonicity with query and time complexity.
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Machine Learning Week 2 Lecture 2.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Learning to Identify Winning Coalitions in the PAC Model A. D. Procaccia & J. S. Rosenschein.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Model Selection. Outline Motivation Overfitting Structural Risk Minimization Cross Validation Minimum Description Length.
On Complexity, Sampling, and є-Nets and є-Samples. Present by: Shay Houri.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 36-37: Foundation of Machine Learning.
Pareto Linear Programming The Problem: P-opt Cx s.t Ax ≤ b x ≥ 0 where C is a kxn matrix so that Cx = (c (1) x, c (2) x,..., c (k) x) where c.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
1 Discrete Mathematical Mathematical Induction ( الاستقراء الرياضي )
1 Chapter 4 Geometry of Linear Programming  There are strong relationships between the geometrical and algebraic features of LP problems  Convenient.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
On Complexity, Sampling, and ε-Nets and ε-Samples
HW HW1: Let us know if you have any questions. ( the TAs)
Computational Learning Theory
What is Probability? Quantification of uncertainty.
Computational Learning Theory
Introduction to Machine Learning
CH. 2: Supervised Learning
Vapnik–Chervonenkis Dimension
CIS 700: “algorithms for Big Data”
Semi-Supervised Learning
CIS 700: “algorithms for Big Data”
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
CSCI B609: “Foundations of Data Science”
The
Computational Learning Theory
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Lecture 4: Econometric Foundations
Computational Learning Theory
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
I.4 Polyhedral Theory (NW)
CS344 : Introduction to Artificial Intelligence
Linear Algebra Lecture 24.
Supervised Learning Berlin Chen 2005 References:
CIS 700: “algorithms for Big Data”
Lecture 14 Learning Inductive inference
Supervised machine learning: creating a model
Sublinear Algorihms for Big Data
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

CSCI B609: “Foundations of Data Science” Lecture 11/12: VC-Dimension and VC-Theorem Slides at http://grigory.us/data-science-class.html Grigory Yaroslavtsev http://grigory.us

Intro to ML Classification problem Formalization: Instance space 𝑋: 0,1 𝒅 or ℝ 𝒅 (feature vectors) Classification: come up with a mapping 𝑋→{0,1} Formalization: Assume there is a probability distribution 𝐷 over 𝑋 𝒄 ∗ = “target concept” (set 𝒄 ∗ ⊆𝑋 of positive instances) Given labeled i.i.d. samples from 𝐷 produce 𝒉⊆𝑋 Goal: have 𝒉 agree with 𝒄 ∗ over distribution 𝐷 Minimize: 𝑒𝑟 𝑟 𝐷 𝒉 = Pr 𝐷 [𝒉 Δ 𝒄 ∗ ] 𝑒𝑟 𝑟 𝐷 𝒉 = “true” or “generalization” error

Intro to ML Training error 𝑆 = labeled sampled (pairs 𝑥,𝑙 , 𝑥∈𝑋, 𝑙∈{0,1}) Training error: 𝑒𝑟 𝑟 𝑆 𝒉 = 𝑆∩ 𝒉 Δ 𝒄 ∗ 𝑆 “Overfitting”: low training error, high true error Hypothesis classes: H: collection of subsets of 𝑋 called hypotheses If 𝑋=ℝ could be all intervals 𝑎,𝑏 ,𝑎≤𝑏 If 𝑋= ℝ 𝑑 could be linear separators: 𝒙∈ ℝ 𝑑 𝒘⋅𝒙≥ 𝑤 0 |𝒘∈ ℝ 𝑑 , 𝑤 0 ∈ℝ If 𝑆 is large enough (compared to some property of H) then overfitting doesn’t occur

Overfitting and Uniform Convergence PAC learning (agnostic): For 𝜖,𝛿>0 if 𝑆 ≥1/2 𝜖 2 ( ln 𝐻 + ln 2/𝛿 ) then with probability 1−𝛿: ∀𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 ≤𝜖 Size of the class of hypotheses can be very large Can also be infinite, how to give a bound then? We will see ways around this today

VC-dimension VC-dim(𝐻) ≤ ln 𝐻 Consider database age vs. salary Query: fraction of the overall population with ages 35−45 and salary $(50−70)K How big a database can answer with ±𝜖 error 100 ages × 1000 salaries ⇒ 10 10 rectangles 1/2 𝜖 2 (10 ln 10+ ln 2/𝛿 ) samples suffice What if we don’t want to discretize?

VC-dimension Def. Concept class 𝐻 shatters a set 𝑆 if ∀𝐴⊆𝑆 there is h∈𝐻 labeling 𝐴 positive and A∖𝑆 negative Def. VC-dim(𝐻) = size of the largest shattered set Example: axis-parallel rectangles on the plane 4-point diamond is shattered No 5-point set can be shattered VC-dim(axis-parallel rectangles) = 4 Def. 𝐻 𝑆 ={ℎ∩𝑆:ℎ∈𝐻} = set of labelings of the points in 𝑆 by functions in 𝐻 Def. Growth function 𝐻 𝑛 = max 𝑆 =𝑛 |𝐻 𝑆 | Example: growth function of a-p. rectangles is 𝑂( 𝑛 4 )

Growth function & uniform convergence PAC learning via growth function: For 𝜖,𝛿>0 if 𝑆 =𝑛≥8/ 𝜖 2 ( ln 2𝐻(2𝑛) + ln 1/𝛿 ) then with probability 1−𝛿: ∀𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 ≤𝜖 Thm (Sauer’s lemma). If VC-dim(H)=𝑑 then: 𝐻 𝑛 ≤ 𝑖=0 𝑑 𝑛 𝑖 ≤ 𝑒𝑛 𝑑 𝑑 For half-planes, VC−dim=3, 𝐻 𝑛 =𝑂( 𝑛 2 )

Sauer’s Lemma Proof Let 𝑑=𝑉𝐶-dim(𝐻) we’ll show that if 𝑆 =𝑛: 𝐻 𝑆 ≤ 𝑛 ≤𝑑 = 𝑖=0 𝑑 𝑛 𝑖 𝑛 ≤𝑑 = 𝑛−1 ≤𝑑 + 𝑛−1 ≤𝑑−1 Proof (induction by set size): 𝑆∖ 𝑥 : by induction 𝐻 𝑆∖{𝑥} ≤ 𝑛−1 ≤𝑑 𝐻[𝑆] − 𝐻 𝑆∖ 𝑥 ≤ 𝑛−1 ≤𝑑−1 ?

𝐻[𝑆] − 𝐻 𝑆∖ 𝑥 ≤ 𝑛−1 ≤𝑑−1 If 𝐻 𝑆 >𝐻 𝑆∖ 𝑥 then it is because of the sets that differ only on 𝑥 so let’s pair them up For ℎ∈𝐻 𝑆 containing 𝑥 let 𝒕𝒘𝒊𝒏 𝒉 =ℎ∖ 𝑥 𝑇={ℎ∈𝐻 𝑆 :𝑥∈ℎ 𝑎𝑛𝑑 𝒕𝒘𝒊𝒏 𝒉 ∈𝐻 𝑆 } Note: 𝐻 𝑆 − 𝐻 𝑆∖ 𝑥 = 𝑇 What is the VC-dimension of 𝑇? If VC-dim 𝑇 =𝑑′ then 𝑹⊆𝑆∖{𝑥} of 𝑑′ is shattered All 2 𝑑 ′ subsets of 𝑹 are 0/1 extendable on 𝑥 𝑑≥ 𝑑 ′ +1⇒ VC-dim 𝑇 ≤𝑑 −1⇒ apply induction

Examples Intervals of the reals: Pairs of intervals of the reals: Shatter 2 points, don’t shatter 3 ⇒𝑉𝐶-dim = 2 Pairs of intervals of the reals: Shatter 4 points, don’t shatter 5 ⇒𝑉𝐶-dim = 4 Convex polygons Shatter any 𝑛 points on a circle ⇒𝑉𝐶-dim =∞ Linear separators in 𝑑 dimensions: Shatter 𝑑+1 points (unit vectors + origin) Take subset S and set 𝑤 𝑖 =0 if 𝑖∈𝑆: separator 𝑤 𝑇 𝑥≤0

VC-dimension of linear separators No set of 𝑑+2 points can be shattered Thm (Radon). Any set 𝑆⊆ ℝ 𝑑 with 𝑆 =𝑑+2 can be partitioned into two subsets 𝐴,𝐵 s.t.: Convex(𝐴) ∩ Convex(𝐵) ≠∅ Form 𝑑× 𝑑+2 matrix A, columns = points in 𝑆 Add extra all-1 row ⇒ matrix B 𝒙= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑑+2 , non-zero vector: 𝐵𝑥=0 Reordering: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑠 ≥0, 𝑥 𝑠+1 , …, 𝑥 𝑑+2 <0 Normalize: 𝑖=1 𝑠 𝑥 𝑖 =1

Radon’s Theorem (cont.) 𝒃 𝒊 , 𝒂 𝒊 = i-th columns of 𝐵 and 𝐴 𝑖=1 𝑠 | 𝑥 𝑖 | 𝒃 𝒊 = 𝑖=𝑠+1 𝑑+2 | 𝑥 𝑖 | 𝒃 𝒊 𝑖=1 𝑠 | 𝑥 𝑖 | 𝒂 𝒊 = 𝑖=𝑠+1 𝑑+2 | 𝑥 𝑖 | 𝒂 𝒊 𝑖=1 𝑠 | 𝑥 𝑖 | = 𝑖=𝑠+1 𝑑+2 | 𝑥 𝑖 | =1 Convex combinations of two subsets intersect Contradiction

Growth function & uniform convergence PAC learning via growth function: For 𝜖,𝛿>0 if 𝑆 =𝑛≥8/ 𝜖 2 ( ln 2𝐻(2𝑛) + ln 1/𝛿 ) then with probability 1−𝛿: ∀𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 ≤𝜖 Assume event A: ∃𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 >𝜖 Draw 𝑆′ of size 𝑛, event B: ∃𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 >𝜖 𝑒𝑟 𝑟 𝑆 ′ 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 <𝜖/2

𝑃𝑟 𝐵 ≥Pr⁡[𝐴]/2 Lem. If 𝑛=Ω(1/ 𝜖 2 ) then 𝑃𝑟 𝐵 ≥Pr⁡[𝐴]/2. Proof: 𝑃𝑟 𝐵 ≥ Pr 𝐴,𝐵 = Pr 𝐴 Pr⁡[𝐵|𝐴] Suppose 𝐴 occurs: ∃𝒉∈H: 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 >𝜖 When we draw 𝑆 ′ : 𝔼 𝑆 ′ 𝑒𝑟 𝑟 𝑆 ′ 𝒉 = 𝑒𝑟 𝑟 𝐷 𝒉 By Chernoff: 𝑃𝑟 𝑆 ′ 𝑒𝑟 𝑟 𝑆 ′ 𝒉 −𝑒𝑟 𝑟 𝐷 𝒉 <𝜖/2 ≥ 1 2 𝑃𝑟 𝐵 ≥ Pr 𝐴 ×1/2

VC-theorem Proof Suffices to show that 𝑃𝑟 𝐵 ≤𝛿/2 Consider drawing 2𝑛 samples 𝑆 ′′ and then randomly partitioning into 𝑆 ′ and 𝑆 𝐵 ∗ : same as 𝐵 for such (𝑆 ′ ,𝑆)⇒ Pr 𝐵 ∗ = Pr 𝐵 Will show: ∀ fixed 𝑆 ′′ 𝑃𝑟 𝑆, S ′ 𝐵 ∗ | 𝑆 ′′ is small Key observation: once 𝑆 ′′ is fixed there are only |𝐻 𝑆 ′′ |≤𝐻(2𝑛) events to care about Suffices: for every fixed ℎ∈𝐻 𝑆 ′′ : 𝑃𝑟 𝑆, S ′ 𝐵 ∗ occurs for ℎ 𝑆 ′′ ≤ 𝛿 2𝐻 2𝑛

VC-theorem Proof (cont.) Randomly pair points in 𝑆 ′′ into ( 𝑎 𝑖 , 𝑏 𝑖 ) pairs With prob. ½: 𝑎 𝑖 →𝑆, 𝑏 𝑖 → 𝑆 ′ or 𝑎 𝑖 → 𝑆 ′ , 𝑏 𝑖 →𝑆 Diff. between 𝑒𝑟 𝑟 𝑆 𝒉 and 𝑒𝑟 𝑟 𝑆 ′ 𝒉 for 𝑖=1,…,𝑛 Only changes if mistake on only one of ( 𝑎 𝑖 , 𝑏 𝑖 ) With prob. ½ difference changes by ±1 By Chernoff: Pr 𝑒𝑟 𝑟 𝑆 𝒉 −𝑒𝑟 𝑟 𝑆 ′ 𝒉 > 𝜖𝑛 4 = 𝑒 −Ω( 𝜖 2 𝑛) 𝑒 −Ω( 𝜖 2 𝑛) ≤ 𝛿 2𝐻 2𝑛 for 𝑛 from the Thm. statement