CS668: Pattern Recognition Ch 1: Introduction Daniel Barbará
Patterns Searching for patterns in data is a fundamental problem with a successful history Many patterns have led to laws: e.g., astronomical observations led to planetary motion laws, Patterns in atomic spectra led to quantum physics Discovery of regularities through computer algorithms and the use of those regularities to make decisions (e.g., classify data into categories)
Example Handwritten Digit Recognition
How can you do it? Develop a series of rules or heuristics describing the shapes of the digits Naïve Brittle A Machine Learning approach: Characterize the digits as a series of features x Discover a function y(x) that maps the feature vector to a category {c1,c2,…,ck} Called supervised learning (under the ‘teacher’ which is the training set)
Other pattern recognition problems Unsupervised learning (clustering): discover groups of data (previously unknown) Density estimation: discover the distribution of data Prediction: like classification, but with real values Reinforcement learning: find suitable actions to take in a given situation in order to maximize a reward
Polynomial Curve Fitting
Sum-of-Squares Error Function Minimizing and objective function: error function
Choosing the order of the polynomial: 0th Order
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Observations The 9th order polynomial results in zero errors (Is this the best?) Lots of oscillations How about predicting the future? OVERFITTING! Likely to do poorly in future data
Over-fitting Root-Mean-Square (RMS) Error:
What is going on? The data was generated using A power expansion (e.g., Taylor) of that function contains all orders We should expect improvement as we increase M!!! What gives?
Polynomial Coefficients
What is going on? The larger values of M result in coefficients that are increasingly tuned to noise Paying to much attention to the training data is not a good thing! This problem varies with the size of the training set
Data Set Size: 9th Order Polynomial
Data Set Size: 9th Order Polynomial
Regularization Penalize large coefficient values
Regularization:
Regularization:
Regularization: vs.
Polynomial Coefficients
Classification Build a machine that can do: Fingerprint identification OCR (Optical Character Recognition) DNA sequence identification
An Example “Sorting incoming Fish on a conveyor according to species using optical sensing” Sea bass Species Salmon
Problem Analysis Set up a camera and take some sample images to extract features Length Lightness Width Number and shape of fins Position of the mouth, etc… This is the set of all suggested features to explore for use in our classifier!
The features are passed to a classifier Preprocessing Use a segmentation operation to isolate fishes from one another and from the background Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features The features are passed to a classifier
Classification Select the length of the fish as a possible feature for discrimination
The length is a poor feature alone! Select the lightness as a possible feature.
Task of decision theory Threshold decision boundary and cost relationship Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory
Adopt the lightness and add the width of the fish Fish xT = [x1, x2] Lightness Width
We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:
Issue of generalization! However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input Issue of generalization!
Conclusion Reader seems to be overwhelmed by the number, complexity and magnitude of the sub-problems of Pattern Recognition Many of these sub-problems can indeed be solved Many fascinating unsolved problems still remain
Probability Theory Apples and Oranges
Probability Theory Marginal Probability Conditional Probability Joint Probability
Probability Theory Sum Rule Product Rule
The Rules of Probability Sum Rule Product Rule 44
Bayes’ Theorem posterior likelihood × prior
Probability Densities
Transformed Densities
Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)
Variances and Covariances
The Gaussian Distribution
Gaussian Mean and Variance
The Multivariate Gaussian
Gaussian Parameter Estimation Likelihood function
Maximum (Log) Likelihood
Properties of and
Curve Fitting Re-visited
Maximum Likelihood Determine by minimizing sum-of-squares error, .
Predictive Distribution
MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, .
Bayesian Curve Fitting
Bayesian Predictive Distribution
With all the detail… See MLE&Bayesian.pdf
Lessons MLE: postulate a distribution (parametric) Bayesian: Form the log likelihood Maximize with respect to parameters (use optimization techniques to find the optimal values) Bayesian: Postulate a prior and a likelihood distributions for the parameter (*CAREFUL: use conjugacy so the function form is preserved*) Determine the distribution for the parameter(s) using Bayes theorem
Model Selection Cross-Validation
Curse of Dimensionality Grid approach Original problem
Curse of Dimensionality
Volume What is the fraction of volume captured in a slice of a hypersphere between r =1- and r =1?
Curse of Dimensionality Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions
Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t.
Decision rule with only the prior information Decide 1 if P(1) > P(2) otherwise decide 2 Use of the class –conditional information P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon
Bayes Posterior, likelihood, evidence P(j | x) = P(x | j) . P (j) / P(x) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence
Minimum Misclassification Rate
Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1
Cannot do better than this! Minimizing the probability of error Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision) Cannot do better than this!
Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth
Minimum Expected Loss Regions are chosen to minimize
Reject Option
Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models
Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function:
The Squared Loss Function
Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly
Entropy Important quantity in coding theory statistical physics machine learning
Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely
Entropy
Entropy In how many ways can N identical objects be allocated M bins? Entropy maximized when
Entropy
Differential Entropy Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case
Conditional Entropy
The Kullback-Leibler Divergence
Mutual Information