Computational Learning Theory

Computational Learning Theory
In the Name of God Machine Learning Computational Learning Theory Mohammad Ali Keyvanrad Thanks to: M. Soleymani (Sharif University of Technology) Tom Mitchell (Carnegie Mellon University )

Outline Computational Learning Theory PAC learning theorem
VC dimension

Computational Learning Theory
We want a theory that relates Number of training examples Complexity of hypothesis space Accuracy to which target function is approximated Probability that learner outputs a successful hypothesis

Learning scenarios Learner proposes instances as queries to teacher?
learner proposes 𝒙, teacher provides 𝑐(𝒙) Teacher (who knows 𝑐(𝒙)) proposes training examples? teacher proposes sequence { 𝒙 1 ,𝑐 𝒙 1 ,…, ( 𝒙 𝑛 ,𝑐 𝒙 𝑛 )} Instances drawn according to 𝑃(𝒙)

Sample Complexity How good is the classifier, really?
How much data do I need to make it “good enough”?

Problem settings Set of all instances 𝑋 Set of hypotheses 𝐻
Set of possible target functions 𝐶={𝑐:𝑋→ 0,1 } Sequence of 𝑚 training instances 𝐷= 𝒙 𝑖 ,𝑐 𝒙 𝑖 𝑖=1 𝑚 𝒙 drawn at random from unknown distribution 𝑃(𝒙) Teacher provides noise-free label 𝑐(𝒙) for it Learner observes a set of training examples 𝐷 for target function 𝑐 and outputs a hypothesis ℎ∈𝐻 estimating 𝑐 Goal: with high probability ("probably"), the selected function will have low generalization error ("approximately correct")

True error of a hypothesis
True error of ℎ: probability that it will misclassify an example drawn at random from 𝑃(𝒙)

Two notions of error

Overfitting Consider a hypothesis ℎ and its
Error rate over training data: 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) True error rate over all data: 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ) We say ℎ overfits the training data if 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)>𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) Amount of overfitting 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 ℎ −𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ) Can we bound 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 ℎ in terms of 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 ℎ ?

Problem setting Classification
𝐷: 𝑚 i.i.d. data points that are labeled Finite number of possible hypothesis (e.g., decision trees of depth 𝑑 0 ) A learner finds a hypothesis ℎ that is consistent with training data 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ)=0 What is the probability that the true error of ℎ will be more than 𝜖? 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)≥𝜖

How likely is a learner to pick a bad hypothesis?
Bound on the probability that any consistent learner will output ℎ with 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 (ℎ)>𝜖 Theorem [Haussler, 1988]: For target concept c, ∀ 0 ≤𝜖 ≤1 , If 𝐻 is finite and 𝐷 contains 𝑚≥1 independent random samples

Proof (Cont’d)

PAC Bound Theorem [Haussler’88]: Consider finite hypothesis space 𝐻, training set 𝐷 with 𝑚 i.i.d. samples,0<𝜖<1: for any learned hypothesis ℎ∈𝐻 that is consistent on the training set 𝐷: If 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑎𝑖𝑛 (ℎ)=0 then with probability at least (1−𝛿): PAC: Probably Approximately Correct

PAC Bound Sample Complexity
How many training examples suffice? Given 𝜖 and 𝛿, yields sample complexity: Given 𝑚 and 𝛿, yields error bound:

Example: Conjunction of up to 𝑑 Boolean literals

Agnostic learning

Hoeffding bounds: Agnostic learning

PAC bound: Agnostic learning

Limitation of the bounds

Shattering a set of instances

Vapnick-Chervonenkis (VC) dimension

PAC bound using VC

VC dimension: linear classifier in a 2-D space

VC dimension: linear classifier

Summary of PAC bounds

Computational Learning Theory

Similar presentations

Presentation on theme: "Computational Learning Theory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Learning Theory

Similar presentations

Presentation on theme: "Computational Learning Theory"— Presentation transcript:

Similar presentations

About project

Feedback