LECTURE 19: FOUNDATIONS OF MACHINE LEARNING

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
On-line learning and Boosting
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Model Assessment and Selection
Model Assessment, Selection and Averaging
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Bayesian Learning Rong Jin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
For Better Accuracy Eick: Ensemble Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Machine Learning: Ensemble Methods
Chapter 7. Classification and Prediction
Chapter 3: Maximum-Likelihood Parameter Estimation
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 03: DECISION SURFACES
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Bias and Variance of the Estimator
Chapter 9 Hypothesis Testing.
LECTURE 05: THRESHOLD DECODING
Introduction to Instrumentation Engineering
LECTURE 05: THRESHOLD DECODING
Summarizing Data by Statistics
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Machine Learning: Lecture 6
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning: UNIT-3 CHAPTER-1
12. Principles of Parameter Estimation
Supervised machine learning: creating a model
Mathematical Foundations of BME Reza Shadmehr
Support Vector Machines 2
Machine Learning: Lecture 5
Presentation transcript:

LECTURE 19: FOUNDATIONS OF MACHINE LEARNING Objectives: Occam’s Razor No Free Lunch Theorem Minimum Description Length Bias and Variance Jackknife Bootstrap Resources: WIKI: Occam's Razor CSCG: MDL On the Web MS: No Free Lunch TGD: Bias and Variance CD: Jackknife Error Estimates KT: Bootstrap Sampling Tutorial

Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s Razor: if performance of two algorithms on the same training data is the same, choose the simpler algorithm. Do simpler or smoother classifiers generalize better? In some fields, basic laws of physics or nature, such as conservation of energy, provide insight. Are there analogous concepts in machine learning? The Bayes Error Rate is one such example. But this is mainly of theoretical interest and is rarely, if ever, known in practice. Why? In this section of the course, we seek mathematical foundations that are independent of a particular classifier or algorithm. We will discuss techniques, such as jackknifing or cross-validation, that can be applied to any algorithm. No pattern classification method is inherently superior to any other. It is the type of problem, prior distributions and other application-specific information that determine which algorithm is best. Can we develop techniques or design guidelines to match an algorithm to an application and to predict its performance? Can we combine classifiers to get better performance?

Probabilistic Models of Learning If one algorithm outperforms another, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm. Thus far, we have estimated performance using a test data set which is sampled independently from the problem space. This approach has many pitfalls, since it is hard to avoid overlap between the training and test sets. We have learned about methods that can drive the error rate to zero on the training set. Hence, using held-out data is important. Consider a two-category problem where the training set, D, consists of patterns, xi, and associated category labels, yi = ±1, for i=1,…n, generated by the unknown target function to be learned, F(x), where yi = F(xi). Let H denote a discrete set of hypotheses or a possible set of parameters to be learned. Let h(x) denote a particular set of parameters, h(x)  H. Let P(h) be the prior probability that the algorithm will produce h after training. Let P(h|D) denote the probability that the algorithm will produce h when training on D. Let E be the error for a zero-one or other loss function. Where have we seen this before? (Hint: Relevance Vector Machines)

No Free Lunch Theorem A natural measure of generalization is the expected value of the error given D: where δ() denotes the Kronecker delta function (value of 1 if the two arguments match, a value of zero otherwise). The expected off-training set classification error when the true function F(x) and the probability for the kth candidate learning algorithm is Pk(h(x)|D) is: No Free Lunch Theorem: For any two learning algorithms P1(h|D) and P2(h|D), the following are true independent of the sampling distribution P(x) and the number of training points: Uniformly averaged over all target functions, F, E1(E|F,n) – E2(E|F,n) = 0. For any fixed training set D, uniformly averaged over F, E1(E|F,n) – E2(E|F,n) = 0. Uniformly averaged over all priors P(F), E1(E|F,n) – E2(E|F,n) = 0. For any fixed training set D, uniformly averaged over P(F), E1(E|F,n) – E2(E|F,n) = 0.

Analysis The first proposition states that uniformly averaged over all target functions the expected test set error for all learning algorithms is the same: Stated more generally, there are no i and j such that for all F(x), Ei(E|F,n) > Ej(E|F,n). Further, no matter what algorithm we use, there is at least one target function for which random guessing Is better. The second proposition states that even if we know D, then averaged over all target functions, no learning algorithm yields a test set error that is superior to any other: The six squares represent all possible classification problems. If a learning system performs well over some set of problems (better than average), it must perform worse than average elsewhere.

Algorithmic Complexity Can we find some irreducible representation of all members of a category? Algorithmic complexity, also known as Kolmogorov complexity, seeks to measure the inherent complexity of a binary string. If the sender and receiver agree on a mapping, or compression technique, the pattern x can be transmitted as y and recovered as x=L(y). The cost of transmission is the length of y, |y|. The least such cost is the minimum length and denoted: A universal description should be independent of the specification (e.g., the programming language or machine assembly language). The Kolmogorov complexity of a binary string x, denoted K(x), is defined as the size of the shortest program string y, that, without additional data, computes the string x: where U represents an abstract universal Turing machine. Consider a string of n 1s. If our machine is a loop that prints 1s, we only need log2n bits to specify the number of iterations. Hence, K(x) = O(log2n).

Minimum Description Length (MDL) We seek to design a classifier that minimizes the sum of the model’s algorithmic complexity and the description of the training data, D, with respect to that model: Examples of MDL include: Measuring the complexity of a decision tree in terms of the number of nodes. Measuring the complexity of an HMM in terms of the number of states. We can view MDL from a Bayesian perspective: The optimal hypothesis, h*, is the one yielding the highest posterior: Shannon’s optimal coding theorem provides a link between MDL and Bayesian methods by stating that the lower bound on the cost of transmitting a string x is proportional to log2P(x).

Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias measures the accuracy in terms of the distance from the true value of a parameter – high bias implies a poor match. Variance measures the precision of a match in terms of the squared distance from the true value – high variance implies a weak match. For mean-square error, bias and variance are related. Consider these in the context of modeling data using regression analysis. Suppose there is an unknown function F(x) which we seek to estimate based on n samples in a set D drawn from F(x). The regression function will be denoted g(x;D). The mean square error of this estimate is (see lecture 5, slide 12): The first term is the bias and the second term is the variance. This is known as the bias-variance tradeoff since more flexible classifiers tend to have lower bias but higher variance.

Bias and Variance For Classification Consider our two-category classification problem: Consider a discriminant function: Where ε is a zero-mean random variable with a binomial distribution with variance: The target function can be expressed as . Our goal is to minimize . Assuming equal priors, the classification error rate can be shown to be: where yB is the Bayes discriminant (1/2 in the case of equal priors). The key point here is that the classification error is linearly proportional to , which can be considered a boundary error in that it represents the incorrect estimation of the optimal (Bayes) boundary.

Bias and Variance For Classification (Cont.) If we assume p(g(x;D)) is a Gaussian distribution, we can compute this error by integrating the tails of the distribution (see the derivation of P(E) in Chapter 2). We can show: The key point here is that the first term in the argument is the boundary bias and the second term is the variance. Hence, we see that the bias and variance are related in a nonlinear manner. For classification the relationship is multiplicative. Typically, variance dominates bias and hence classifiers are designed to minimize variance. See Fig. 9.5 in the textbook for an example of how bias and variance interact for a two-category problem.

Resampling For Estimating Statistics How can we estimate the bias and variance from real data? Suppose we have a set D of n data points, xi for i=1,…,n. The estimates of the mean/sample variance are: Suppose we wanted to estimate other statistics, such as the median or mode. There is no straightforward way to measure the error. Jackknife and Bootstrap techniques are two of the most popular resampling techniques to estimate such statistics. Use the “leave-one-out” method: This is just the sample average if the ith point is deleted. The jackknife estimate of the mean is defined as: The variance of this estimate is: The benefit of this expression is that it can be applied to any statistic.

Jackknife Bias and Variance Estimates We can write a general estimate for the bias as: The jackknife method can be used to estimate this bias. The procedure is to delete points xi one at a time from D and then compute: . The jackknife estimate is: We can rearrange terms: This is an unbiased estimate of the bias. Recall the traditional variance: The jackknife estimate of the variance is: This same strategy can be applied to estimation of other statistics.

Bootstrap A bootstrap data set is one created by randomly selecting n points from the training set D, with replacement. In bootstrap estimation, this selection process is repeated B times to yield B bootstrap data sets, which are treated as independent sets. The bootstrap estimate of a statistic, , is denoted and is merely the mean of the B estimates on the individual bootstrap data sets: The bootstrap estimate of the bias is: The bootstrap estimate of the variance is: The bootstrap estimate of the variance of the mean can be shown to approach the traditional variance of the mean as . The larger the number of bootstrap samples, the better the estimate.

Summary Analyzed the No Free Lunch Theorem. Introduced Minimum Description Length. Discussed bias and variance using regression as an example. Introduced a class of methods based on resampling to estimate statistics. Introduced the Jackknife and Bootstrap methods. Next: Introduce similar techniques for classifier design.