Measuring Model Complexity (Textbook, Sections 2.1-2.2) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
VC Dimension – definition and impossibility result
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Machine Learning Week 3 Lecture 1. Programming Competition
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Data mining in 1D: curve fitting
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Machine Learning Week 2 Lecture 2.
The Nature of Statistical Learning Theory by V. Vapnik
COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Computational Learning Theory
Reduced Support Vector Machine
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Binary Classification Problem Linearly Separable Case
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Machine Learning CSE 681 CH2 - Supervised Learning.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Linear Learning Machines and SVM The Perceptron Algorithm revisited
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CS 188: Artificial Intelligence Spring 2006 Lecture 12: Learning Theory 2/23/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew.
Support Vector Machines Tao Department of computer science University of Illinois.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.
CS 9633 Machine Learning Support Vector Machines
Computational Learning Theory
Support Vector Machines
CH. 2: Supervised Learning
INF 5860 Machine learning for image classification
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Machine Learning: UNIT-3 CHAPTER-2
Supervised Learning Berlin Chen 2005 References:
Lecture 14 Learning Inductive inference
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training data, we should select the less complex one. But how do we measure model complexity in a comparable way for any hypothesis? A general framework: Let X be a space of instances, and let H be a space of hypothesis (e.g., all linear decision surfaces in two dimensions). We will measure the complexity of h  H using VC Dimension.

Vapnik-Chervonenkis (VC) Dimension Denoted VC(H) Measures the largest subset S of distinct instances x  X that can be represented using H. “Represented using H ” means: For each dichotomy of S,  h  H such that h classifies x  S perfectly according to this dichotomy.

Dichotomies Let S be a subset of instances, S  X. Each h  H partitions S into two subsets: {x  S | h(x) = 1} {x  S | h(x) = 0} This partition is called a “dichotomy”.

Example Let X = {(0,0), (0, 1), (1,0), (1,1)}. Let S = X What are the dichotomies of S ? In general, given S  X, how many possible dichotomies are there of S ?

Shattering a Set of Instances Definition : –H shatters S if and only if H can represent all dichotomies of S. –That is, H shatters S if and only if for all dichotomies of S,  h  H that is consistent with that dichotomy. Example: Let S be as on the previous slide, and let H be the set of all linear decision surfaces in two dimensions. Does H shatter S ? Let H be the set of all binary decision trees. Does H shatter S?

Comments on the notion of shattering: –Ability of H to shatter a set of instances S is measure of H’s capacity to represent target concepts defined over these instances. –Intuition: The larger the subset of X that can be shattered, the more expressive H is. This is what VC dimension measures.

VC Dimension Definition: VC(H), defined over X, is the size of the largest finite subset S  X shattered by H.

Examples of VC Dimension 1.Let X =  (e.g., x = height of some person), and H = set of intervals on the real number line: H = {a < x < b | a, b   }. What is VC(H)?

Need to find largest subset of X that can be shattered by H. Consider subset of X containing two instances: S = {2.2, 4.5} Can S be shattered by H ? Yes. Write down different dichotomies of S and the hypotheses that are consistent with them. This shows VC(H)  2. Is there a subset, S = {x 0, x 1, x 2 }, of size 3 that can be shattered? Show that the answer is “no”. Thus VC(H) = 2.

2.Let X={(x,y) | x, y   }. Let H = set of all linear decision surfaces in the plane (i.e., lines that separate examples into positive and negative classifications). What is VC(H)?

Show that there is a set of two points that can be shattered by constructing all dichotomies and showing that there is a h  H that represents each one. Is there a set of three points that can be shattered? Yes, if not co-linear. Is there a set of four points that can be shattered? Why or why not? More generally: the VC dimension of linear decision surfaces in an r-dimensional space (i.e., the VC dimension of a perceptron with r inputs) is r + 1.

4.If H is finite, prove that VC(H)  log 2 |H|. Proof: Let VC(H) = d. This means that H can shatter a subset S of size d instances. Then H has to have at least 2 d distinct hypotheses. Thus: so,

Finding VC dimension General approach: To show that VC(H)  m, show that there exists a subset of m instances that is shattered by H. Then, to show that VC(H) = m, show that no subset of m+1 instances is shattered by H.

Can we use VC dimension to design better learning algorithms? Vapnik proved: Given a target function t and a sample S of m training examples drawn independently using D, for any  between 0 and 1, with probability 1-  This gives an upper bound on true error as a function of training error, number of training examples, VC dimension of H, and  Thus, for a given , we can compute tradeoff among expected true error, number of training examples needed, and VC dimension.

VC dimension is controlled by number of free parameters in the model (e.g., hidden units in a neural network). Need VC(H) high enough so that H can represent target concept, but not so high that the learner will overfit.

Structural Risk Minimization Principle Principle: A function that fits the training data and minimizes VC dimension will generalize well. In other words, to minimize true error, both training error and the VC-dimension term should be small Large model complexity can lead to overfitting and high generalization error.

“A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well.” (C. Burges, A tutorial on support vector machines for pattern recognition). “The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning.” (Ibid.)

That “shining peak” is the 1979 paper by V. Vapnik, “Estimation of dependences based on empirical data” (in Russian). These ideas (and others) were fleshed out more thoroughly in Vapnik’s 1995 book, The Nature of Statistical Learning Theory. This is about the time that support vector machines became a very popular topic of research in the machine learning community.

“actual risk” “empirical risk” “VC confidence” If we have a set of different hypothesis spaces, {H i }, which one should we choose for learning? We want the one that will minimize the bound on the actual risk. Structural risk minimization principle: choose H i that minimizes the right hand side. This requires trade-off: need large enough VC(H) so that error S (h) will be small, but need to minimize VC(H) for second term.