ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Evaluating Classifiers
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Background
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Machine Learning CSE 681 CH2 - Supervised Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Optimal Bayes Classification
The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
CpSc 881: Machine Learning Evaluating Hypotheses.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Confidence Interval & Unbiased Estimator Review and Foreword.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Data Mining Lecture 11.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 19: FOUNDATIONS OF MACHINE LEARNING
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum Description Length Bias and Variance Jackknife Bootstrap Resources: WIKI: Occam's Razor CSCG: MDL On the Web MS: No Free Lunch TGD: Bias and Variance CD: Jackknife Error Estimates KT: Bootstrap Sampling Tutorial WIKI: Occam's Razor CSCG: MDL On the Web MS: No Free Lunch TGD: Bias and Variance CD: Jackknife Error Estimates KT: Bootstrap Sampling Tutorial LECTURE 21: FOUNDATIONS OF MACHINE LEARNING

ECE 8527: Lecture 21, Slide 1 Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s Razor: if performance of two algorithms on the same training data is the same, choose the simpler algorithm. Do simpler or smoother classifiers generalize better? In some fields, basic laws of physics or nature, such as conservation of energy, provide insight. Are there analogous concepts in machine learning? The Bayes Error Rate is one such example. But this is mainly of theoretical interest and is rarely, if ever, known in practice. Why? In this section of the course, we seek mathematical foundations that are independent of a particular classifier or algorithm. We will discuss techniques, such as jackknifing or cross-validation, that can be applied to any algorithm. No pattern classification method is inherently superior to any other. It is the type of problem, prior distributions and other application-specific information that determine which algorithm is best. Can we develop techniques or design guidelines to match an algorithm to an application and to predict its performance? Can we combine classifiers to get better performance?

ECE 8527: Lecture 21, Slide 2 Probabilistic Models of Learning If one algorithm outperforms another, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm. Thus far, we have estimated performance using a test data set which is sampled independently from the problem space. This approach has many pitfalls, since it is hard to avoid overlap between the training and test sets. We have learned about methods that can drive the error rate to zero on the training set. Hence, using held-out data is important. Consider a two-category problem where the training set, D, consists of patterns, x i, and associated category labels, y i = ±1, for i=1,…n, generated by the unknown target function to be learned, F(x), where y i = F(x i ). Let H denote a discrete set of hypotheses or a possible set of parameters to be learned. Let h(x) denote a particular set of parameters, h(x)  H. Let P(h) be the prior probability that the algorithm will produce h after training. Let P(h|D) denote the probability that the algorithm will produce h when training on D. Let E be the error for a zero-one or other loss function. Where have we seen this before? (Hint: Relevance Vector Machines)

ECE 8527: Lecture 21, Slide 3 No Free Lunch Theorem A natural measure of generalization is the expected value of the error given D: where δ() denotes the Kronecker delta function (value of 1 if the two arguments match, a value of zero otherwise). The expected off-training set classification error when the true function F(x) and the probability for the kth candidate learning algorithm is P k (h(x)|D) is: No Free Lunch Theorem: For any two learning algorithms P 1 (h|D) and P 2 (h|D), the following are true independent of the sampling distribution P(x) and the number of training points: 1.Uniformly averaged over all target functions, F, E 1 (E|F,n) – E 2 (E|F,n) = 0. 2.For any fixed training set D, uniformly averaged over F, E 1 (E|F,n) – E 2 (E|F,n) = 0. 3.Uniformly averaged over all priors P(F), E 1 (E|F,n) – E 2 (E|F,n) = 0. 4.For any fixed training set D, uniformly averaged over P(F), E 1 (E|F,n) – E 2 (E|F,n) = 0.

ECE 8527: Lecture 21, Slide 4 Summary Discussed some foundations of machine learning that are common to many algorithms. Introduced the No Free Lunch Theorem.