Download presentation
Presentation is loading. Please wait.
1
CS668: Pattern Recognition Ch 1: Introduction
Daniel Barbará
2
Patterns Searching for patterns in data is a fundamental problem with a successful history Many patterns have led to laws: e.g., astronomical observations led to planetary motion laws, Patterns in atomic spectra led to quantum physics Discovery of regularities through computer algorithms and the use of those regularities to make decisions (e.g., classify data into categories)
3
Example Handwritten Digit Recognition
4
How can you do it? Develop a series of rules or heuristics describing the shapes of the digits Naïve Brittle A Machine Learning approach: Characterize the digits as a series of features x Discover a function y(x) that maps the feature vector to a category {c1,c2,…,ck} Called supervised learning (under the ‘teacher’ which is the training set)
5
Other pattern recognition problems
Unsupervised learning (clustering): discover groups of data (previously unknown) Density estimation: discover the distribution of data Prediction: like classification, but with real values Reinforcement learning: find suitable actions to take in a given situation in order to maximize a reward
6
Polynomial Curve Fitting
7
Sum-of-Squares Error Function
Minimizing and objective function: error function
8
Choosing the order of the polynomial: 0th Order
9
1st Order Polynomial
10
3rd Order Polynomial
11
9th Order Polynomial
12
Observations The 9th order polynomial results in zero errors (Is this the best?) Lots of oscillations How about predicting the future? OVERFITTING! Likely to do poorly in future data
13
Over-fitting Root-Mean-Square (RMS) Error:
14
What is going on? The data was generated using
A power expansion (e.g., Taylor) of that function contains all orders We should expect improvement as we increase M!!! What gives?
15
Polynomial Coefficients
16
What is going on? The larger values of M result in coefficients that are increasingly tuned to noise Paying to much attention to the training data is not a good thing! This problem varies with the size of the training set
17
Data Set Size: 9th Order Polynomial
18
Data Set Size: 9th Order Polynomial
19
Regularization Penalize large coefficient values
20
Regularization:
21
Regularization:
22
Regularization: vs.
23
Polynomial Coefficients
24
Classification Build a machine that can do: Fingerprint identification
OCR (Optical Character Recognition) DNA sequence identification
25
An Example “Sorting incoming Fish on a conveyor according to species using optical sensing” Sea bass Species Salmon
26
Problem Analysis Set up a camera and take some sample images to extract features Length Lightness Width Number and shape of fins Position of the mouth, etc… This is the set of all suggested features to explore for use in our classifier!
27
The features are passed to a classifier
Preprocessing Use a segmentation operation to isolate fishes from one another and from the background Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features The features are passed to a classifier
29
Classification Select the length of the fish as a possible feature for discrimination
31
The length is a poor feature alone!
Select the lightness as a possible feature.
33
Task of decision theory
Threshold decision boundary and cost relationship Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory
34
Adopt the lightness and add the width of the fish
Fish xT = [x1, x2] Lightness Width
36
We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:
38
Issue of generalization!
However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input Issue of generalization!
40
Conclusion Reader seems to be overwhelmed by the number, complexity and magnitude of the sub-problems of Pattern Recognition Many of these sub-problems can indeed be solved Many fascinating unsolved problems still remain
41
Probability Theory Apples and Oranges
42
Probability Theory Marginal Probability Conditional Probability
Joint Probability
43
Probability Theory Sum Rule Product Rule
44
The Rules of Probability
Sum Rule Product Rule 44
45
Bayes’ Theorem posterior likelihood × prior
46
Probability Densities
47
Transformed Densities
48
Expectations Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)
49
Variances and Covariances
50
The Gaussian Distribution
51
Gaussian Mean and Variance
52
The Multivariate Gaussian
53
Gaussian Parameter Estimation
Likelihood function
54
Maximum (Log) Likelihood
55
Properties of and
56
Curve Fitting Re-visited
57
Maximum Likelihood Determine by minimizing sum-of-squares error,
58
Predictive Distribution
59
MAP: A Step towards Bayes
Determine by minimizing regularized sum-of-squares error,
60
Bayesian Curve Fitting
61
Bayesian Predictive Distribution
62
With all the detail… See MLE&Bayesian.pdf
63
Lessons MLE: postulate a distribution (parametric) Bayesian:
Form the log likelihood Maximize with respect to parameters (use optimization techniques to find the optimal values) Bayesian: Postulate a prior and a likelihood distributions for the parameter (*CAREFUL: use conjugacy so the function form is preserved*) Determine the distribution for the parameter(s) using Bayes theorem
64
Model Selection Cross-Validation
65
Curse of Dimensionality
Grid approach Original problem
66
Curse of Dimensionality
67
Volume What is the fraction of volume captured in a slice of a hypersphere between r =1- and r =1?
68
Curse of Dimensionality
Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions
69
Decision Theory Inference step Determine either or . Decision step
For given x, determine optimal t.
70
Decision rule with only the prior information
Decide 1 if P(1) > P(2) otherwise decide 2 Use of the class –conditional information P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon
72
Bayes Posterior, likelihood, evidence
P(j | x) = P(x | j) . P (j) / P(x) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence
74
Minimum Misclassification Rate
75
Decision given the posterior probabilities
X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1
76
Cannot do better than this!
Minimizing the probability of error Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision) Cannot do better than this!
77
Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth
78
Minimum Expected Loss Regions are chosen to minimize
79
Reject Option
80
Why Separate Inference and Decision?
Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models
81
Decision Theory for Regression
Inference step Determine Decision step For given x, make optimal prediction, y(x), for t. Loss function:
82
The Squared Loss Function
83
Generative vs Discriminative
Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly
84
Entropy Important quantity in coding theory statistical physics
machine learning
85
Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely
86
Entropy
87
Entropy In how many ways can N identical objects be allocated M bins?
Entropy maximized when
88
Entropy
89
Differential Entropy Put bins of width ¢ along the real line
Differential entropy maximized (for fixed ) when in which case
90
Conditional Entropy
91
The Kullback-Leibler Divergence
92
Mutual Information
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.