CS668: Pattern Recognition Ch 1: Introduction

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Pattern Recognition and Machine Learning
Supervised Learning Recap
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Chapter 1: Introduction to Pattern Recognition
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Pattern Classification, Chapter 1 1 Basic Probability.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
PATTERN RECOGNITION AND MACHINE LEARNING
Introduction to Pattern Recognition Charles Tappert Seidenberg School of CSIS, Pace University.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
Compiled By: Raj G Tiwari.  A pattern is an object, process or event that can be given a name.  A pattern class (or category) is a set of patterns sharing.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
1 E. Fatemizadeh Statistical Pattern Recognition.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Basic Technical Concepts in Machine Learning
Artificial Intelligence
Pattern Classification Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 Dr. Ding Yuxin Pattern Recognition.
Probability Theory and Parameter Estimation I
Special Topics In Scientific Computing
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
LECTURE 01: COURSE OVERVIEW
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Introduction to Pattern Recognition and Machine Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
An Introduction to Supervised Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
LECTURE 01: COURSE OVERVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

CS668: Pattern Recognition Ch 1: Introduction Daniel Barbará

Patterns Searching for patterns in data is a fundamental problem with a successful history Many patterns have led to laws: e.g., astronomical observations led to planetary motion laws, Patterns in atomic spectra led to quantum physics Discovery of regularities through computer algorithms and the use of those regularities to make decisions (e.g., classify data into categories)

Example Handwritten Digit Recognition

How can you do it? Develop a series of rules or heuristics describing the shapes of the digits Naïve Brittle A Machine Learning approach: Characterize the digits as a series of features x Discover a function y(x) that maps the feature vector to a category {c1,c2,…,ck} Called supervised learning (under the ‘teacher’ which is the training set)

Other pattern recognition problems Unsupervised learning (clustering): discover groups of data (previously unknown) Density estimation: discover the distribution of data Prediction: like classification, but with real values Reinforcement learning: find suitable actions to take in a given situation in order to maximize a reward

Polynomial Curve Fitting

Sum-of-Squares Error Function Minimizing and objective function: error function

Choosing the order of the polynomial: 0th Order

1st Order Polynomial

3rd Order Polynomial

9th Order Polynomial

Observations The 9th order polynomial results in zero errors (Is this the best?) Lots of oscillations How about predicting the future? OVERFITTING! Likely to do poorly in future data

Over-fitting Root-Mean-Square (RMS) Error:

What is going on? The data was generated using A power expansion (e.g., Taylor) of that function contains all orders We should expect improvement as we increase M!!! What gives?

Polynomial Coefficients

What is going on? The larger values of M result in coefficients that are increasingly tuned to noise Paying to much attention to the training data is not a good thing! This problem varies with the size of the training set

Data Set Size: 9th Order Polynomial

Data Set Size: 9th Order Polynomial

Regularization Penalize large coefficient values

Regularization:

Regularization:

Regularization: vs.

Polynomial Coefficients

Classification Build a machine that can do: Fingerprint identification OCR (Optical Character Recognition) DNA sequence identification

An Example “Sorting incoming Fish on a conveyor according to species using optical sensing” Sea bass Species Salmon

Problem Analysis Set up a camera and take some sample images to extract features Length Lightness Width Number and shape of fins Position of the mouth, etc… This is the set of all suggested features to explore for use in our classifier!

The features are passed to a classifier Preprocessing Use a segmentation operation to isolate fishes from one another and from the background Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features The features are passed to a classifier

Classification Select the length of the fish as a possible feature for discrimination

The length is a poor feature alone! Select the lightness as a possible feature.

Task of decision theory Threshold decision boundary and cost relationship Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory

Adopt the lightness and add the width of the fish Fish xT = [x1, x2] Lightness Width

We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:

Issue of generalization! However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input Issue of generalization!

Conclusion Reader seems to be overwhelmed by the number, complexity and magnitude of the sub-problems of Pattern Recognition Many of these sub-problems can indeed be solved Many fascinating unsolved problems still remain

Probability Theory Apples and Oranges

Probability Theory Marginal Probability Conditional Probability Joint Probability

Probability Theory Sum Rule Product Rule

The Rules of Probability Sum Rule Product Rule 44

Bayes’ Theorem posterior  likelihood × prior

Probability Densities

Transformed Densities

Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

Variances and Covariances

The Gaussian Distribution

Gaussian Mean and Variance

The Multivariate Gaussian

Gaussian Parameter Estimation Likelihood function

Maximum (Log) Likelihood

Properties of and

Curve Fitting Re-visited

Maximum Likelihood Determine by minimizing sum-of-squares error, .

Predictive Distribution

MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, .

Bayesian Curve Fitting

Bayesian Predictive Distribution

With all the detail… See MLE&Bayesian.pdf

Lessons MLE: postulate a distribution (parametric) Bayesian: Form the log likelihood Maximize with respect to parameters (use optimization techniques to find the optimal values) Bayesian: Postulate a prior and a likelihood distributions for the parameter (*CAREFUL: use conjugacy so the function form is preserved*) Determine the distribution for the parameter(s) using Bayes theorem

Model Selection Cross-Validation

Curse of Dimensionality Grid approach Original problem

Curse of Dimensionality

Volume What is the fraction of volume captured in a slice of a hypersphere between r =1- and r =1?

Curse of Dimensionality Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t.

Decision rule with only the prior information Decide 1 if P(1) > P(2) otherwise decide 2 Use of the class –conditional information P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon

Bayes Posterior, likelihood, evidence P(j | x) = P(x | j) . P (j) / P(x) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence

Minimum Misclassification Rate

Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1

Cannot do better than this! Minimizing the probability of error Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] (Bayes decision) Cannot do better than this!

Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

Minimum Expected Loss Regions are chosen to minimize

Reject Option

Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models

Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function:

The Squared Loss Function

Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

Entropy Important quantity in coding theory statistical physics machine learning

Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely

Entropy

Entropy In how many ways can N identical objects be allocated M bins? Entropy maximized when

Entropy

Differential Entropy Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case

Conditional Entropy

The Kullback-Leibler Divergence

Mutual Information