Computational Learning Theory In the Name of God Machine Learning Computational Learning Theory Mohammad Ali Keyvanrad Thanks to: M. Soleymani (Sharif University of Technology) Tom Mitchell (Carnegie Mellon University ) Fall 1392
Outline Computational Learning Theory PAC learning theorem VC dimension
Computational Learning Theory We want a theory that relates Number of training examples Complexity of hypothesis space Accuracy to which target function is approximated Probability that learner outputs a successful hypothesis
Learning scenarios Learner proposes instances as queries to teacher? learner proposes 𝒙, teacher provides 𝑐(𝒙) Teacher (who knows 𝑐(𝒙)) proposes training examples? teacher proposes sequence { 𝒙 1 ,𝑐 𝒙 1 ,…, ( 𝒙 𝑛 ,𝑐 𝒙 𝑛 )} instances drawn according to 𝑃(𝒙)
Sample Complexity How good is the classifier, really? How much data do I need to make it “good enough”?
Problem settings Set of all instances 𝑋 Set of hypotheses 𝐻 Set of possible target functions 𝐶={𝑐:𝑋→ 0,1 } Sequence of 𝑚 training instances 𝐷= 𝒙 𝑖 ,𝑐 𝒙 𝑖 𝑖=1 𝑚 𝒙 drawn at random from unknown distribution 𝑃(𝒙) Teacher provides noise-free label 𝑐(𝒙) for it Learner observes a set of training examples 𝐷 for target function 𝑐 and outputs a hypothesis ℎ∈𝐻 estimating 𝑐 Goal: with high probability ("probably"), the selected function will have low generalization error ("approximately correct")
True error of a hypothesis True error of ℎ: probability that it will misclassify an example drawn at random from 𝑃(𝒙)
Two notions of error
Overfitting Consider a hypothesis ℎ and its Error rate over training data: 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) True error rate over all data: 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ) We say ℎ overfits the training data if 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)>𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) Amount of overfitting 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 ℎ −𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) Can we bound 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 ℎ in terms of 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 ℎ ?
Problem setting Classification 𝐷: 𝑚 i.i.d. data points that are labeled Finite number of possible hypothesis (e.g., decision trees of depth 𝑑 0 ) A learner finds a hypothesis ℎ that is consistent with training data 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ)=0 What is the probability that the true error of ℎ will be more than 𝜖? 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)≥𝜖
How likely is a learner to pick a bad hypothesis? Bound on the probability that any consistent learner will output ℎ with 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)>𝜖 Theorem [Haussler, 1988]: For target concept c, ∀ 0 ≤𝜖 ≤1 , If 𝐻 is finite and 𝐷 contains 𝑚≥1 independent random samples
Proof
Proof (Cont’d)
PAC Bound Theorem [Haussler’88]: Consider finite hypothesis space 𝐻, training set 𝐷 with 𝑚 i.i.d. samples,0<𝜖<1: for any learned hypothesis ℎ∈𝐻 that is consistent on the training set 𝐷: If 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ)=0 then with probability at least (1−𝛿): PAC: Probably Approximately Correct
PAC Bound Sample Complexity How many training examples suffice? Given 𝜖 and 𝛿, yields sample complexity: Given 𝑚 and 𝛿, yields error bound:
Example: Conjunction of up to 𝑑 Boolean literals
Agnostic learning
Hoeffding bounds: Agnostic learning
PAC bound: Agnostic learning
Limitation of the bounds
Shattering a set of instances
Vapnick-Chervonenkis (VC) dimension
PAC bound using VC
VC dimension: linear classifier in a 2-D space
VC dimension: linear classifier
Summary of PAC bounds