Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Lecture 20 Object recognition I
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Machine Learning CMPT 726 Simon Fraser University
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Speech Processing Speaker Recognition. 18 February 2016Veton Këpuska2 Speaker Recognition  Definitions: Speaker Identification:  For given a set of.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Digital Systems: Hardware Organization and Design
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Digital Systems: Hardware Organization and Design
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Speech Recognition Pattern Classification

22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric classifiers  Dimensionality reduction  Significance testing

22 September 2015Veton Këpuska3 Pattern Classification  Goal: To classify objects (or patterns) into categories (or classes)  Types of Problems: 1.Supervised: Classes are known beforehand, and data samples of each class are available 2.Unsupervised: Classes (and/or number of classes) are not known beforehand, and must be inferred from data Feature Extraction Classifier Class  i Feature Vectors x Observation s

22 September 2015Veton Këpuska4 Probability Basics  Discrete probability mass function (PMF): P(ω i )  Continuous probability density function (PDF): p(x)  Expected value: E(x)

22 September 2015Veton Këpuska5 Kullback-Liebler Distance  Can be used to compute a distance between two probability mass distributions, P(z i ), and Q(z i )  Makes use of inequality log x ≤ x - 1  Known as relative entropy in information theory  The divergence of P(z i ) and Q(z i ) is the symmetric sum

22 September 2015Veton Këpuska6 Bayes Theorem  Define: {i}{i} a set of M mutually exclusive classes P( i )a priori probability for class  i p(x| i )PDF for feature vector x in class  i P( i |x)A posteriori probability of  i given x

22 September 2015Veton Këpuska7 Bayes Theorem From Bayes Rule: Where:

22 September 2015Veton Këpuska8 Bayes Decision Theory  The probability of making an error given x is: P(error|x)=1-P(  i | x )if decide class  i  To minimize P(error|x) (and P(error)): Choose  i if P( i |x)>P( j |x) ∀ j≠i

22 September 2015Veton Këpuska9 Bayes Decision Theory  For a two class problem this decision rule means: Choose  1 if else  2  This rule can be expressed as a likelihood ratio:

22 September 2015Veton Këpuska10 Bayes Risk  Define cost function λ ij and conditional risk R(ω i |x): λ ij is cost of classifying x as ω i when it is really ω j R(ω i |x) is the risk for classifying x as class ω i  Bayes risk is the minimum risk which can be achieved: Choose ω i if R(ω i |x) < R(ω j |x) ∀ i≠j  Bayes risk corresponds to minimum P(error|x) when All errors have equal cost (λ ij = 1, i≠j) There is no cost for being correct (λ ii = 0)

22 September 2015Veton Këpuska11 Discriminant Functions  Alternative formulation of Bayes decision rule  Define a discriminant function, g i (x), for each class ω i Choose ω i if g i (x)>g j (x) ∀ j ≠ i  Functions yielding identical classiffication results: g i (x) = P(ω i |x) = p(x|ω i )P(ω i ) = log p(x|ω i )+log P(ω i )  Choice of function impacts computation costs  Discriminant functions partition feature space into decision regions, separated by decision boundaries.

22 September 2015Veton Këpuska12 Density Estimation  Used to estimate the underlying PDF p(x|ω i )  Parametric methods: Assume a specific functional form for the PDF Optimize PDF parameters to fit data  Non-parametric methods: Determine the form of the PDF from the data Grow parameter set size with the amount of data  Semi-parametric methods: Use a general class of functional forms for the PDF Can vary parameter set independently from data Use unsupervised methods to estimate parameters

22 September 2015Veton Këpuska13 Parametric Classifiers  Gaussian distributions  Maximum likelihood (ML) parameter estimation  Multivariate Gaussians  Gaussian classifiers

Maximum Likelihood Parameter Estimation

22 September 2015Veton Këpuska15 Gaussian Distributions  Gaussian PDF’s are reasonable when a feature vector can be viewed as perturbation around a reference  Simple estimation procedures for model parameters  Classification often reduced to simple distance metrics  Gaussian distributions also called Normal

22 September 2015Veton Këpuska16 Gaussian Distributions: One Dimension  One-dimensional Gaussian PDF’s can be expressed as:  The PDF is centered around the mean  The spread of the PDF is determined by the variance

Maximum-Likelihood Parameter Estimation 22 September 2015Veton Këpuska17

Assumtions  The following assumptions hold: 1.The samples come from a known number of c classes. 2.The prior probabilities P(w j ) for each class are known, j=1,…,c 22 September 2015Veton Këpuska18

22 September 2015Veton Këpuska19 Maximum Likelihood Parameter Estimation  Maximum likelihood parameter estimation determines an estimate θ for parameter θ by maximizing the likelihood L(θ) of observing data X = {x 1,...,x n }  Assuming independent, identically distributed data  ML solutions can often be obtained via the derivative: ^

22 September 2015Veton Këpuska20 Maximum Likelihood Parameter Estimation  For Gaussian distributions log L(θ) is easier to solve

22 September 2015Veton Këpuska21 Gaussian ML Estimation: One Dimension  The maximum likelihood estimate for μ is given by:

22 September 2015Veton Këpuska22 Gaussian ML Estimation: One Dimension  The maximum likelihood estimate for σ is given by:

22 September 2015Veton Këpuska23 Gaussian ML Estimation: One Dimension

22 September 2015Veton Këpuska24 ML Estimation: Alternative Distributions

22 September 2015Veton Këpuska25 ML Estimation: Alternative Distributions

22 September 2015Veton Këpuska26 Gaussian Distributions: Multiple Dimensions (Multivariate)  A multi-dimensional Gaussian PDF can be expressed as:  d is the number of dimensions  x={x 1,…,x d } is the input vector  μ= E(x)= {μ 1,...,μ d } is the mean vector  Σ= E((x-μ )(x-μ) t ) is the covariance matrix with elements σ ij, inverse Σ -1, and determinant |Σ|  σ ij = σ ji = E((x i - μ i )(x j - μ j )) = E(x i x j ) - μ i μ j

22 September 2015Veton Këpuska27 Gaussian Distributions: Multi- Dimensional Properties  If the i th and j th dimensions are statistically or linearly independent then E(x i x j )= E(x i )E(x j ) and σ ij =0  If all dimensions are statistically or linearly independent, then σ ij =0 ∀ i≠j and Σ has non-zero elements only on the diagonal  If the underlying density is Gaussian and Σ is a diagonal matrix, then the dimensions are statistically independent and

22 September 2015Veton Këpuska28 Diagonal Covariance Matrix: Σ=σ 2 I

22 September 2015Veton Këpuska29 Diagonal Covariance Matrix: σ ij =0 ∀ i ≠ j

22 September 2015Veton Këpuska30 General Covariance Matrix: σ ij ≠ 0

22 September 2015Veton Këpuska31 Multivariate ML Estimation  The ML estimates for parameters θ = {θ 1,...,θ l } are determined by maximizing the joint likelihood L(θ) of a set of i.i.d. data x = {x 1,..., x n }  To find θ we solve  θ L(θ)= 0, or  θ log L(θ)= 0  The ML estimates of  and  are ^

22 September 2015Veton Këpuska32 Multivariate Gaussian Classifier  Requires a mean vector  i, and a covariance matrix Σ i for each of M classes {ω 1, ···,ω M }  The minimum error discriminant functions are of the form:  Classification can be reduced to simple distance metrics for many situations.

22 September 2015Veton Këpuska33 Gaussian Classifier: Σ i = σ 2 I  Each class has the same covariance structure: statistically independent dimensions with variance σ 2  The equivalent discriminant functions are:  If each class is equally likely, this is a minimum distance classifier, a form of template matching  The discriminant functions can be replaced by the following linear expression:  where

22 September 2015Veton Këpuska34 Gaussian Classifier: Σ i = σ 2 I  For distributions with a common covariance structure the decision regions a hyper-planes.

22 September 2015Veton Këpuska35 Gaussian Classifier: Σ i =Σ  Each class has the same covariance structure Σ  The equivalent discriminant functions are:  If each class is equally likely, the minimum error decision rule is the squared Mahalanobis distance  The discriminant functions remain linear expressions:  where

22 September 2015Veton Këpuska36 Gaussian Classifier: Σ i Arbitrary  Each class has a different covariance structure Σ i  The equivalent discriminant functions are:  The discriminant functions are inherently quadratic:  where

22 September 2015Veton Këpuska37 Gaussian Classifier: Σ i Arbitrary  For distributions with arbitrary covariance structures the decision regions are defined by hyper-spheres.

22 September 2015Veton Këpuska38 3 Class Classification (Atal & Rabiner, 1976)  Distinguish between silence, unvoiced, and voiced sounds  Use 5 features: Zero crossing count Log energy Normalized first autocorrelation coefficient First predictor coefficient, and Normalized prediction error  Multivariate Gaussian classifier, ML estimation  Decision by squared Mahalanobis distance  Trained on four speakers (2 sentences/speaker), tested on 2 speakers (1 sentence/speaker)

Maximum A Posteriori Parameter Estimation

22 September 2015Veton Këpuska40 Maximum A Posteriori Parameter Estimation  Bayesian estimation approaches assume the form of the PDF p(x|θ) is known, but the value of θ is not  Knowledge of θ is contained in: An initial a priori PDF p(θ) A set of i.i.d. data X = {x 1,...,x n }  The desired PDF for x is of the form  The value posteriori θ that maximizes p(θ|X) is called the maximum a posteriori (MAP) estimate of θ ^

22 September 2015Veton Këpuska41 Gaussian MAP Estimation: One Dimension  For a Gaussian distribution with unknown mean μ:  MAP estimates of μ and x are given by  As n increases, p(μ|X) converges to μ, and p(x,X) converges to the ML estimate ~ N(μ, 2 ) ^ ^

22 September 2015Veton Këpuska42 References  Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall,  Duda, Hart and Stork, Pattern Classification, John Wiley & Sons,  Atal and Rabiner, A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition, IEEE Trans ASSP, 24(3), 1976.