EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Visual Recognition Tutorial
Classification and risk prediction
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
EE513 Audio Signals and Systems Wiener Inverse Filter Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Lecture II-2: Probability Review
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Isolated-Word Speech Recognition Using Hidden Markov Models
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
Optimal Bayes Classification
CHAPTER 5 SIGNAL SPACE ANALYSIS
Linear Models for Classification
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Baseband Receiver Receiver Design: Demodulation Matched Filter Correlator Receiver Detection Max. Likelihood Detector Probability of Error.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Applied statistics Usman Roshan.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Usman Roshan CS 675 Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
Outline Introduction Signal, random variable, random process and spectra Analog modulation Analog to digital conversion Digital transmission through baseband.
CH 5: Multivariate Methods
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Advanced Pattern Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
EE Audio Signals and Systems
Linear Discrimination
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky

Interpretation of Auditory Scenes  Human perception and cognition greatly exceeds any computer-based system for abstracting sounds into objects and creating meaningful auditory scenes. This perception of objects (not just detecting acoustic energy) allows for interpretation of situations leading to an appropriate response or further analyses.  Sensory organs (ears) separate acoustic energy into frequency bands and convert band energy into neural firings  The auditory cortex receives the neural responses and abstracts an auditory scene.

Auditory Scene  Perception derives a useful representation of reality from sensory input.  Auditory Stream refers to a perceptual unit associated with a single happening (A.S. Bregman, 1990). Acoustic to Neural Conversion Organize into Auditory Streams Representation of Reality

Computer Interpretation  In order for a computer algorithm to interpret a scene  Acoustic signals must be converted to numbers using meaningful models.  Sets of numbers (or patterns) are mapped into events (perceptions).  Events are analyzed with other events in relation to the goal of the algorithm and mapped into a situation (cognition or deriving meaning).  Situation is mapped into an action/response.  Numbers extracted from the acoustic signal for the purpose of classification (determination of event) are referred to as features.  Time -based features are extracted from signal transforms such as:  Envelope  Correlations  Frequency-based features are extracted from signal transforms such as:  Spectrum (Cepstrum)  Power Spectral Density

Feature Selection Example  Consider a problem of discriminating between the spoken words yes and no based on 2 features: 1. The estimate of first formant frequency g 1 (resonance of the spectral envelope) 2. The ratio in dB of the amplitude of the second formant frequency over the third formant frequency g 2.  A fictitious experiment was performed and these 2 features were computed for 25 recordings of people saying these words. The feature were plotted for each class to develop an algorithm to classify these samples correctly.

Feature Plot  Define a feature vector.  Plot G, given a yes was spoken, with green o’s, and given a no was spoken, be wiht red x’s.

Minimum Distance Approach  Create representative vector for yes and no features  For a new sample with estimated features, use decision rule:  Results in 3 incorrect decisions.

Normalization With STD  The frequency features had larger values than the amplitude ratios, and therefore had more influence in the decision process.  Remove scale differences by normalizing each feature by its standard deviation over all classes.  Now 4 errors result (why would it change?)

Minimum Distance Classifier  Consider feature vector x with the potential to be classified as belonging to K exclusive classes.  Classification decision will be based on the distance of the feature vector to one of the template vectors representing each of the K classes.  The decision rule is for a given observation x and set of template vectors z k for each class, decide on class k such that:

Minimum Distance Classifier  If some features need to be weighted more than others in the decision process, as well as exploiting correlation between the features, the distance for each feature can be weighted to result in the weighted minimum distance classifier: where W is a square matrix of weights with dimension equal to length of x. If W is a diagonal matrix, it simply scales each of the features in the decision process. Off diagonal terms scale the correlation between features. If W is the inverse of the covariance matrix of the features in x, and z k is the mean feature vector for each class, then the above distances are referred to as the Mahanalobis distance.

Correlation Receiver  It can be shown that selecting the class based on the minimum distance between the observation vector and the template vector is equivalent to finding the maximum correlation between the observation vector and the template: or where the template vectors have been normalized such that

Definitions  Random variable (RV) is a function that maps events (sets) into a discrete set of real numbers for a discrete RV, or a continuous set of real numbers for a continuous RV.  Random process (RP) is a series of RVs indexed by a countable set for a discrete RP, or by a non-countable set for continuous RP.

Definitions: PDF First Order  The likelihood of RV values is described through the probability density function (pdf).

Definitions: Joint PDF  The probabilities describing more than one RV is described by a joint pdf.

Definitions: Conditional PDF  The probabilities describing a RV given that the another event has already occurred is described by a conditional pdf.  Closely related to this is Bayes’ rule:

Examples: Gaussian PDF  A first order Gaussian RV pdf (scalar x) with mean µ and standard deviation  is given by:  A higher order joint Gaussian pdf (column vector x) with mean vector m and covariance matrix  is given by:

Example Uncorrelated Prove that for an N th order sequence of uncorrelated Gaussian zero-mean RVs the joint PDF can be written as:  Note that for Gaussian RVs uncorrelated implies statistical independence.  Assume variances are equal for all elements. What would the autocorrelation of this sequence look like?  How would the above analysis change if RVs were not zero mean?

Class PDFs When features are modeled as RVs, their pdfs can be used to derive distance measures for the classifier, and an optimal decision rule that minimizes classification error can be designed. Consider K classes individually denoted by  k. Feature values associated with each class can be described by:  a posteriori probability (likelihood the class after observation/data)  a priori probability (likelihood the class before observation/data)  Likelihood function (likelihood observation/data given a class)

Class PDFs The likelihood function can be estimated through empirical studies. Consider 3 speakers whose 3 rd formant frequency is distributed by: Classifier probabilities can be obtained from Bayes’ rule Decision Thresholds

Maximum a posteriori Decision Rule For K classes and observed feature vector x, the maximum a posteriori (MAP) decision rule states: or by applying Bayes’ rule: For the binary case this reduces to the (log) likelihood ratio

Example Consider a 2 class problem with Gaussian distributed feature vectors Derive the log likelihood ratio and describe how the classifier uses distance information to discriminate between the classes. See related link:

Homework Consider a 2 features for use in a binary classification problem. The features are Gaussian distributed are form feature vector x = [x 1, x 2 ] T. Derive the log likelihood ratio and corresponding classifier for the 3 different cases listed below: 1) 2) 3) 4) Comment how each classifier computes “distance” and uses it in the classification process.

Classification Error Classification error is the decision statistics percentage that occurs on the wrong side of the threshold, scaled by the percentage of times such an event occurs (i.e. the prior probability).

Homework For the previous example, write an expression for probability of a correct classification by changing the integrals and limits (i.e. do not simply write p c =1-p e )

Approximating a Bayes Classifier If density functions are not known:  Determine template vectors that minimize distances to feature vectors in each class for training data (vector quantization).  Assume form of density function and estimate parameters (directly or iteratively) from the data (parametric or expectation maximization).  Learn posterior probabilities directly from training data and interpolate on test data (neural networks or support vector machines). Helpful related link : mining/classification mining/classification