Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS668: Pattern Recognition Ch 1: Introduction

Similar presentations


Presentation on theme: "CS668: Pattern Recognition Ch 1: Introduction"— Presentation transcript:

1 CS668: Pattern Recognition Ch 1: Introduction
Daniel Barbará

2 Patterns Searching for patterns in data is a fundamental problem with a successful history Many patterns have led to laws: e.g., astronomical observations led to planetary motion laws, Patterns in atomic spectra led to quantum physics Discovery of regularities through computer algorithms and the use of those regularities to make decisions (e.g., classify data into categories)

3 Example Handwritten Digit Recognition

4 How can you do it? Develop a series of rules or heuristics describing the shapes of the digits Naïve Brittle A Machine Learning approach: Characterize the digits as a series of features x Discover a function y(x) that maps the feature vector to a category {c1,c2,…,ck} Called supervised learning (under the ‘teacher’ which is the training set)

5 Other pattern recognition problems
Unsupervised learning (clustering): discover groups of data (previously unknown) Density estimation: discover the distribution of data Prediction: like classification, but with real values Reinforcement learning: find suitable actions to take in a given situation in order to maximize a reward

6 Polynomial Curve Fitting

7 Sum-of-Squares Error Function
Minimizing and objective function: error function

8 Choosing the order of the polynomial: 0th Order

9 1st Order Polynomial

10 3rd Order Polynomial

11 9th Order Polynomial

12 Observations The 9th order polynomial results in zero errors (Is this the best?) Lots of oscillations How about predicting the future? OVERFITTING! Likely to do poorly in future data

13 Over-fitting Root-Mean-Square (RMS) Error:

14 What is going on? The data was generated using
A power expansion (e.g., Taylor) of that function contains all orders We should expect improvement as we increase M!!! What gives?

15 Polynomial Coefficients

16 What is going on? The larger values of M result in coefficients that are increasingly tuned to noise Paying to much attention to the training data is not a good thing! This problem varies with the size of the training set

17 Data Set Size: 9th Order Polynomial

18 Data Set Size: 9th Order Polynomial

19 Regularization Penalize large coefficient values

20 Regularization:

21 Regularization:

22 Regularization: vs.

23 Polynomial Coefficients

24 Classification Build a machine that can do: Fingerprint identification
OCR (Optical Character Recognition) DNA sequence identification

25 An Example “Sorting incoming Fish on a conveyor according to species using optical sensing” Sea bass Species Salmon

26 Problem Analysis Set up a camera and take some sample images to extract features Length Lightness Width Number and shape of fins Position of the mouth, etc… This is the set of all suggested features to explore for use in our classifier!

27 The features are passed to a classifier
Preprocessing Use a segmentation operation to isolate fishes from one another and from the background Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features The features are passed to a classifier

28

29 Classification Select the length of the fish as a possible feature for discrimination

30

31 The length is a poor feature alone!
Select the lightness as a possible feature.

32

33 Task of decision theory
Threshold decision boundary and cost relationship Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory

34 Adopt the lightness and add the width of the fish
Fish xT = [x1, x2] Lightness Width

35

36 We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:

37

38 Issue of generalization!
However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input Issue of generalization!

39

40 Conclusion Reader seems to be overwhelmed by the number, complexity and magnitude of the sub-problems of Pattern Recognition Many of these sub-problems can indeed be solved Many fascinating unsolved problems still remain

41 Probability Theory Apples and Oranges

42 Probability Theory Marginal Probability Conditional Probability
Joint Probability

43 Probability Theory Sum Rule Product Rule

44 The Rules of Probability
Sum Rule Product Rule 44

45 Bayes’ Theorem posterior  likelihood × prior

46 Probability Densities

47 Transformed Densities

48 Expectations Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)

49 Variances and Covariances

50 The Gaussian Distribution

51 Gaussian Mean and Variance

52 The Multivariate Gaussian

53 Gaussian Parameter Estimation
Likelihood function

54 Maximum (Log) Likelihood

55 Properties of and

56 Curve Fitting Re-visited

57 Maximum Likelihood Determine by minimizing sum-of-squares error,

58 Predictive Distribution

59 MAP: A Step towards Bayes
Determine by minimizing regularized sum-of-squares error,

60 Bayesian Curve Fitting

61 Bayesian Predictive Distribution

62 With all the detail… See MLE&Bayesian.pdf

63 Lessons MLE: postulate a distribution (parametric) Bayesian:
Form the log likelihood Maximize with respect to parameters (use optimization techniques to find the optimal values) Bayesian: Postulate a prior and a likelihood distributions for the parameter (*CAREFUL: use conjugacy so the function form is preserved*) Determine the distribution for the parameter(s) using Bayes theorem

64 Model Selection Cross-Validation

65 Curse of Dimensionality
Grid approach Original problem

66 Curse of Dimensionality

67 Volume What is the fraction of volume captured in a slice of a hypersphere between r =1- and r =1?

68 Curse of Dimensionality
Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

69 Decision Theory Inference step Determine either or . Decision step
For given x, determine optimal t.

70 Decision rule with only the prior information
Decide 1 if P(1) > P(2) otherwise decide 2 Use of the class –conditional information P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon

71

72 Bayes Posterior, likelihood, evidence
P(j | x) = P(x | j) . P (j) / P(x) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence

73

74 Minimum Misclassification Rate

75 Decision given the posterior probabilities
X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1

76 Cannot do better than this!
Minimizing the probability of error Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision) Cannot do better than this!

77 Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

78 Minimum Expected Loss Regions are chosen to minimize

79 Reject Option

80 Why Separate Inference and Decision?
Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models

81 Decision Theory for Regression
Inference step Determine Decision step For given x, make optimal prediction, y(x), for t. Loss function:

82 The Squared Loss Function

83 Generative vs Discriminative
Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

84 Entropy Important quantity in coding theory statistical physics
machine learning

85 Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely

86 Entropy

87 Entropy In how many ways can N identical objects be allocated M bins?
Entropy maximized when

88 Entropy

89 Differential Entropy Put bins of width ¢ along the real line
Differential entropy maximized (for fixed ) when in which case

90 Conditional Entropy

91 The Kullback-Leibler Divergence

92 Mutual Information


Download ppt "CS668: Pattern Recognition Ch 1: Introduction"

Similar presentations


Ads by Google