Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Machine learning continued Image source:
Machine Learning Week 2 Lecture 1.
Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
Decision Tree Rong Jin. Determine Milage Per Gallon.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Cooperating Intelligent Systems
Learning From Observations
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Ensemble Learning: An Introduction
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
CS 4700: Foundations of Artificial Intelligence
Experimental Evaluation
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Chapter Nine: Evaluating Results from Samples Review of concepts of testing a null hypothesis. Test statistic and its null distribution Type I and Type.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
COMP3503 Intro to Inductive Modeling
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Linear Discrimination Reading: Chapter 2 of textbook.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
CS 188: Artificial Intelligence Spring 2006 Lecture 12: Learning Theory 2/23/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew.
Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Chapter 18 Section 1 – 3 Learning from Observations.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Learning from Observations
Learning from Observations
Introduce to machine learning
Markov Systems, Markov Decision Processes, and Dynamic Programming
CS 9633 Machine Learning Inductive-Analytical Methods
Presented By S.Yamuna AP/CSE
Support Vector Machines
Computational Learning Theory
Computational Learning Theory
Learning from Observations
Support Vector Machines
Lecture 14 Learning Inductive inference
Learning from Observations
Decision trees One possible representation for hypotheses
Supervised machine learning: creating a model
K-means and Hierarchical Clustering
Presentation transcript:

Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

PAC-learning: Slide 2 Copyright © 2001, Andrew W. Moore Probably Approximately Correct (PAC) Learning Imagine we’re doing classification with categorical inputs. All inputs and outputs are binary. Data is noiseless. There’s a machine f(x,h) which has H possible settings (a.k.a. hypotheses), called h 1, h 2.. h H.

PAC-learning: Slide 3 Copyright © 2001, Andrew W. Moore Example of a machine f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 ^ X3 ^ X19 X3 ^ X18 X7 X1 ^ X2 ^ X2 ^ x4 … ^ Xm Question: if there are 3 attributes, what is the complete set of hypotheses in f?

PAC-learning: Slide 4 Copyright © 2001, Andrew W. Moore Example of a machine f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 ^ X3 ^ X19 X3 ^ X18 X7 X1 ^ X2 ^ X2 ^ x4 … ^ Xm Question: if there are 3 attributes, what is the complete set of hypotheses in f? (H = 8) TrueX2X3X2 ^ X3 X1X1 ^ X2X1 ^ X3X1 ^ X2 ^ X3

PAC-learning: Slide 5 Copyright © 2001, Andrew W. Moore And-Positive-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 ^ X3 ^ X19 X3 ^ X18 X7 X1 ^ X2 ^ X2 ^ x4 … ^ Xm Question: if there are m attributes, how many hypotheses in f?

PAC-learning: Slide 6 Copyright © 2001, Andrew W. Moore And-Positive-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 ^ X3 ^ X19 X3 ^ X18 X7 X1 ^ X2 ^ X2 ^ x4 … ^ Xm Question: if there are m attributes, how many hypotheses in f? (H = 2 m )

PAC-learning: Slide 7 Copyright © 2001, Andrew W. Moore And-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are 2 attributes, what is the complete set of hypotheses in f?

PAC-learning: Slide 8 Copyright © 2001, Andrew W. Moore And-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are 2 attributes, what is the complete set of hypotheses in f? (H = 9) True X2 True~X2 X1True X1^X2 X1^~X2 ~X1True ~X1^X2 ~X1^~X2

PAC-learning: Slide 9 Copyright © 2001, Andrew W. Moore And-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are m attributes, what is the size of the complete set of hypotheses in f? True X2 True~X2 X1True X1^X2 X1^~X2 ~X1True ~X1^X2 ~X1^~X2

PAC-learning: Slide 10 Copyright © 2001, Andrew W. Moore And-Literals Machine f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are m attributes, what is the size of the complete set of hypotheses in f? (H = 3 m ) True X2 True~X2 X1True X1^X2 X1^~X2 ~X1True ~X1^X2 ~X1^~X2

PAC-learning: Slide 11 Copyright © 2001, Andrew W. Moore Lookup Table Machine f(x,h) consists of all truth tables mapping combinations of input attributes to true and false Example hypothesis: Question: if there are m attributes, what is the size of the complete set of hypotheses in f? X1X2X3X4Y

PAC-learning: Slide 12 Copyright © 2001, Andrew W. Moore Lookup Table Machine f(x,h) consists of all truth tables mapping combinations of input attributes to true and false Example hypothesis: Question: if there are m attributes, what is the size of the complete set of hypotheses in f? X1X2X3X4Y

PAC-learning: Slide 13 Copyright © 2001, Andrew W. Moore A Game We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0

PAC-learning: Slide 14 Copyright © 2001, Andrew W. Moore Test Error Rate We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that h will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 

PAC-learning: Slide 15 Copyright © 2001, Andrew W. Moore Test Error Rate We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 

PAC-learning: Slide 16 Copyright © 2001, Andrew W. Moore Test Error Rate We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 

PAC-learning: Slide 17 Copyright © 2001, Andrew W. Moore PAC Learning Chose R such that with probability less than  we’ll select a bad h est (i.e. an h est which makes mistakes more than fraction  of the time) Probably Approximately Correct As we just saw, this can be achieved by choosing R such that i.e. R such that

PAC-learning: Slide 18 Copyright © 2001, Andrew W. Moore PAC in action MachineExample Hypothesis HR required to PAC- learn And-positive- literals X3 ^ X7 ^ X82m2m And-literalsX3 ^ ~X73m3m Lookup Table And-lits or And-lits (X1 ^ X5) v (X2 ^ ~X7 ^ X8) X1X2X3X4Y

PAC-learning: Slide 19 Copyright © 2001, Andrew W. Moore PAC for decision trees of depth k Assume m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m) +1 So to PAC-learn, need

PAC-learning: Slide 20 Copyright © 2001, Andrew W. Moore What you should know Be able to understand every step in the math that gets you to Understand that you thus need this many records to PAC-learn a machine with H hypotheses Understand examples of deducing H for various machines