Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Lynch Modified over 9 years ago
1
Speech Processing Speaker Recognition
2
18 February 2016Veton Këpuska2 Speaker Recognition Definitions: Speaker Identification: For given a set of models obtained for a number of known speakers, which voice models best characterizes a speaker. Speaker Verification: Decide whether a speaker corresponds to a particular known voice or to some other unknown voice. Claimant – an individual who is correctly posing as one of known speakers Impostor – unknown speaker who is posing as a known speaker. False Acceptance False Rejection
3
18 February 2016Veton Këpuska3 Speaker Recognition Steps in Speaker Recognition: Model Building For each target speaker (claimant) A number of background (imposter) speakers Speaker-Dependent Features Oral and Nasal tract length and cross section during different sounds Vocal fold mass and shape Location and size of the false vocal folds Accurately measured from the speech waveform. Training Data + Model Building Procedure ⇨ Generate Models
4
18 February 2016Veton Këpuska4 Speaker Recognition In practice difficult to derive speech anatomy features from the speech waveform. Use conventional methods to extract features: Constant-Q filter bank Spectral Based Features
5
18 February 2016Veton Këpuska5 Speaker Recognition System Feature Extraction Recogni tion Training Training Speech Data Linda Kay Joe Target & Background Speaker Models Decision: Tom Not Tom Tom Testing Speech Data
6
18 February 2016Veton Këpuska6 Spectral Features for Speaker Recognition Attributes of Human Voice: High-level – difficult to extract from speech waveform: Clarity Roughness Magnitude Animation Prosody – pitch intonation, articulation rate, and dialect Low-level – easy to extract from speech waveform: Vocal tract Spectrum Instantaneous pitch Glottal flow excitation Source Event Onset Times Modulations in Format Trajectories
7
18 February 2016Veton Këpuska7 Spectral Features for Speaker Recognition Want the feature set to reflect the unique characteristics of a speaker. The short-time Fourier transform (STFT): STFT Magnitude: Vocal tract resonances Vocal tract anti-resonances – important for speaker identifiability. General trend of the envelope of the STFT Magnitude is influenced by the coarse component of the glottal flow derivative. Fine structure of STFT characterized by speaker-dependent features: Pitch Glottal-flow Distributed Acoustic Effects
8
18 February 2016Veton Këpuska8 Spectral Features for Speaker Recognition Speaker Recognition Systems use smooth representation of the STFT magnitude: Vocal tract resonances Spectral Tilt Auditory-based features superior to the conventional features: All-pole LPC spectrum Homomorphic filtered spectrum Homomorphic prediction, etc.
9
18 February 2016Veton Këpuska9 Mel-Cepstrum Davies & Mermelstein:
10
18 February 2016Veton Këpuska10 Short-Time Fourier Analysis (Time- Dependent Fourier Transform)
11
18 February 2016Veton Këpuska11 Rectangular Window
12
18 February 2016Veton Këpuska12 Hamming Window
13
18 February 2016Veton Këpuska13 Comparison of Windows
14
18 February 2016Veton Këpuska14 Comparison of Windows (cont’d)
15
18 February 2016Veton Këpuska15 A Wideband Spectrogram
16
18 February 2016Veton Këpuska16 A Narrowband Spectrogram
17
18 February 2016Veton Këpuska17 Discrete Fourier Transform In general, the number of input points, N, and the number of frequency samples, M, need not be the same. If M>N, we must zero-pad the signal If M<N, we must time-alias the signal
18
18 February 2016Veton Këpuska18 Examples of Various Spectral Representations
19
18 February 2016Veton Këpuska19 Cepstral Analysis of Speech The speech signal is often assumed to be the output of an LTI system; i.e., it is the convolution of the input and the impulse response. If we are interested in characterizing the signal in terms of the parameters of such a model, we must go through the process of de-convolution. Cepstral, analysis is a common procedure used for such de- convolution.
20
18 February 2016Veton Këpuska20 Cepstral Analysis Cepstral analysis for convolution is based on the observation that: x[n]= x 1 [n] * x 2 [n] ⇒ X (z)= X 1 (z)X 2 (z) By taking the complex logarithm of X(z), then log{X (z)} =log{X 1 (z)} + log{X 2 (z)} = If the complex logarithm is unique, and if is a valid z-transform, then The two convolved signals will be additive in this new, cepstral domain. If we restrict ourselves to the unit circle, z = e j, then: It can be shown that one approach to dealing with the problem of uniqueness is to require that arg{X(e jω )} be a continuous, odd, periodic function of ω.
21
18 February 2016Veton Këpuska21 Cepstral Analysis (cont’d) To the extent that X(z)=log{X(z)} is valid, It can easily be shown that c[n] is the even part of x[n]. If x[n] is real and causal then x[n], be recovered from c[n]. This is known as the Minimum Phase condition. ^ ^ ^ ^
22
18 February 2016Veton Këpuska22 Mel-Frequency Cepstral Representation (Mermelstein & Davis 1980) Some recognition systems use Mel-scale cepstral coefficients to mimic auditory processing. (Mel frequency scale is linear up to 100 Hz and logarithmic thereafter.) This is done by multiplying the magnitude (or log magnitude) of S(e j ) with a set of filter weights as shown below:
23
18 February 2016Veton Këpuska23 References 1.Tohkura, Y., “A Weighted Cepstral Distance Measure for Speech Recognition," IEEE Trans. ASSP, Vol. ASSP-35, No. 10, 1414-1422, 1987. 2.Mermelstein, P. and Davis, S., “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," IEEE Trans. ASSP, Vol. ASSP-28, No. 4, 357-366, 1980. 3.Meng, H., The Use of Distinctive Features for Automatic Speech Recognition,SM Thesis, MIT EECS, 1991. 4.Leung, H., Chigier, B., and Glass, J., “A Comparative Study of Signal Represention and Classi.cation Techniques for Speech Recognition," Proc. ICASSP,Vol.II, 680-683, 1993.
24
18 February 2016Veton Këpuska24 Pattern Classification Goal: To classify objects (or patterns) into categories (or classes) Types of Problems: 1.Supervised: Classes are known beforehand, and data samples of each class are available 2.Unsupervised: Classes (and/or number of classes) are not known beforehand, and must be inferred from data Feature Extraction Classifier Class i Feature Vectors x Observation s
25
18 February 2016Veton Këpuska25 Probability Basics Discrete probability mass function (PMF): P(ω i ) Continuous probability density function (PDF): p(x) Expected value: E(x)
26
18 February 2016Veton Këpuska26 Kullback-Liebler Distance Can be used to compute a distance between two probability mass distributions, P(z i ), and Q(z i ) Makes use of inequality log x ≤ x - 1 Known as relative entropy in information theory The divergence of P(z i ) and Q(z i ) is the symmetric sum
27
18 February 2016Veton Këpuska27 Bayes Theorem Define: {i}{i} a set of M mutually exclusive classes P( i )a priori probability for class i p(x| i )PDF for feature vector x in class i P( i |x)A posteriori probability of i given x
28
18 February 2016Veton Këpuska28 Bayes Theorem From Bayes Rule: Where:
29
18 February 2016Veton Këpuska29 Bayes Decision Theory The probability of making an error given x is: P(error|x)=1-P( i |x)if decide class i To minimize P(error|x) (and P(error)): Choose i if P( i |x)>P( j |x) ∀ j≠i For a two class problem this decision rule means: Choose 1 if else 2 This rule can be expressed as a likelihood ratio:
30
18 February 2016Veton Këpuska30 Bayes Risk Define cost function λ ij and conditional risk R(ω i |x): λ ij is cost of classifying x as ωi when it is really ω j R(ω i |x) is the risk for classifying x as class ω i Bayes risk is the minimum risk which can be achieved: Choose ω i if R(ω i |x) < R(ω j |x) ∀ i≠j Bayes risk corresponds to minimum P(error|x) when All errors have equal cost (λ ij = 1, i≠j) There is no cost for being correct (λ ii = 0)
31
18 February 2016Veton Këpuska31 Discriminant Functions Alternative formulation of Bayes decision rule Define a discriminant function, g i (x), for each class ω i Choose ω i if g i (x)>g j (x) ∀ j = i Functions yielding identical classiffication results: g i (x) = P(ω i |x) = p(x|ω i )P(ω i ) = log p(x|ω i )+log P(ω i ) Choice of function impacts computation costs Discriminant functions partition feature space into decision regions, separated by decision boundaries.
32
18 February 2016Veton Këpuska32 Density Estimation Used to estimate the underlying PDF p(x|ω i ) Parametric methods: Assume a specific functional form for the PDF Optimize PDF parameters to fit data Non-parametric methods: Determine the form of the PDF from the data Grow parameter set size with the amount of data Semi-parametric methods: Use a general class of functional forms for the PDF Can vary parameter set independently from data Use unsupervised methods to estimate parameters
33
18 February 2016Veton Këpuska33 Parametric Classifiers Gaussian distributions Maximum likelihood (ML) parameter estimation Multivariate Gaussians Gaussian classifiers
34
18 February 2016Veton Këpuska34 Gaussian Distributions Gaussian PDF’s are reasonable when a feature vector can be viewed as perturbation around a reference Simple estimation procedures for model parameters Classification often reduced to simple distance metrics Gaussian distributions also called Normal
35
18 February 2016Veton Këpuska35 Gaussian Distributions: One Dimension One-dimensional Gaussian PDF’s can be expressed as: The PDF is centered around the mean The spread of the PDF is determined by the variance
36
18 February 2016Veton Këpuska36 Maximum Likelihood Parameter Estimation Maximum likelihood parameter estimation determines an estimate θ for parameter θ by maximizing the likelihood L(θ) of observing data X = {x 1,...,x n } Assuming independent, identically distributed data ML solutions can often be obtained via the derivative: ^
37
18 February 2016Veton Këpuska37 Maximum Likelihood Parameter Estimation For Gaussian distributions log L(θ) is easier to solve
38
18 February 2016Veton Këpuska38 Gaussian ML Estimation: One Dimension The maximum likelihood estimate for μ is given by:
39
18 February 2016Veton Këpuska39 Gaussian ML Estimation: One Dimension The maximum likelihood estimate for σ is given by:
40
18 February 2016Veton Këpuska40 Gaussian ML Estimation: One Dimension
41
18 February 2016Veton Këpuska41 ML Estimation: Alternative Distributions
42
18 February 2016Veton Këpuska42 ML Estimation: Alternative Distributions
43
18 February 2016Veton Këpuska43 Gaussian Distributions: Multiple Dimensions (Multivariate) A multi-dimensional Gaussian PDF can be expressed as: d is the number of dimensions x={x 1,…,x d } is the input vector μ= E(x)= {μ 1,...,μ d } is the mean vector Σ= E((x-μ )(x-μ) t ) is the covariance matrix with elements σ ij, inverse Σ -1, and determinant |Σ| σ ij = σ ji = E((x i - μ i )(x j - μ j )) = E(x i x j ) - μ i μ j
44
18 February 2016Veton Këpuska44 Gaussian Distributions: Multi- Dimensional Properties If the i th and j th dimensions are statistically or linearly independent then E(x i x j )= E(x i )E(x j ) and σ ij =0 If all dimensions are statistically or linearly independent, then σ ij =0 ∀ i≠j and Σ has non-zero elements only on the diagonal If the underlying density is Gaussian and Σ is a diagonal matrix, then the dimensions are statistically independent and
45
18 February 2016Veton Këpuska45 Diagonal Covariance Matrix: Σ=σ 2 I
46
18 February 2016Veton Këpuska46 Diagonal Covariance Matrix: σ ij =0 ∀ i ≠ j
47
18 February 2016Veton Këpuska47 General Covariance Matrix: σ ij ≠ 0
48
18 February 2016Veton Këpuska48 Multivariate ML Estimation The ML estimates for parameters θ = {θ 1,...,θ l } are determined by maximizing the joint likelihood L(θ) of a set of i.i.d. data x = {x 1,..., x n } To find θ we solve θ L(θ)= 0, or θ log L(θ)= 0 The ML estimates of and are ^
49
18 February 2016Veton Këpuska49 Multivariate Gaussian Classifier Requires a mean vector i, and a covariance matrix Σ i for each of M classes {ω 1, ···,ω M } The minimum error discriminant functions are of the form: Classification can be reduced to simple distance metrics for many situations.
50
18 February 2016Veton Këpuska50 Gaussian Classifier: Σ i = σ 2 I Each class has the same covariance structure: statistically independent dimensions with variance σ 2 The equivalent discriminant functions are: If each class is equally likely, this is a minimum distance classifier, a form of template matching The discriminant functions can be replaced by the following linear expression: where
51
18 February 2016Veton Këpuska51 Gaussian Classifier: Σ i = σ 2 I For distributions with a common covariance structure the decision regions a hyper-planes.
52
18 February 2016Veton Këpuska52 Gaussian Classifier: Σ i =Σ Each class has the same covariance structure Σ The equivalent discriminant functions are: If each class is equally likely, the minimum error decision rule is the squared Mahalanobis distance The discriminant functions remain linear expressions: where
53
18 February 2016Veton Këpuska53 Gaussian Classifier: Σ i Arbitrary Each class has a different covariance structure Σ i The equivalent discriminant functions are: The discriminant functions are inherently quadratic: where
54
18 February 2016Veton Këpuska54 Gaussian Classifier: Σ i Arbitrary For distributions with arbitrary covariance structures the decision regions are defined by hyper-spheres.
55
18 February 2016Veton Këpuska55 3 Class Classification (Atal & Rabiner, 1976) Distinguish between silence, unvoiced, and voiced sounds Use 5 features: Zero crossing count Log energy Normalized first autocorrelation coefficient First predictor coefficient, and Normalized prediction error Multivariate Gaussian classifier, ML estimation Decision by squared Mahalanobis distance Trained on four speakers (2 sentences/speaker), tested on 2 speakers (1 sentence/speaker)
56
18 February 2016Veton Këpuska56 Maximum A Posteriori Parameter Estimation Bayesian estimation approaches assume the form of the PDF p(x|θ) is known, but the value of θ is not Knowledge of θ is contained in: An initial a priori PDF p(θ) A set of i.i.d. data X = {x 1,...,x n } The desired PDF for x is of the form The value posteriori θ that maximizes p(θ|X) is called the maximum a posteriori (MAP) estimate of θ ^
57
18 February 2016Veton Këpuska57 Gaussian MAP Estimation: One Dimension For a Gaussian distribution with unknown mean μ: MAP estimates of μ and x are given by As n increases, p(μ|X) converges to μ, and p(x,X) converges to the ML estimate ~ N(μ, 2 ) ^ ^
58
18 February 2016Veton Këpuska58 References Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, 2001. Duda, Hart and Stork, Pattern Classification, John Wiley & Sons, 2001. Atal and Rabiner, A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition, IEEE Trans ASSP, 24(3), 1976.
59
18 February 2016Veton Këpuska59 Speaker Recognition Algorithm Minimum-Distance Classifier:
60
18 February 2016Veton Këpuska60 Vector Quantization Class dependent Distance
61
18 February 2016Veton Këpuska61 Gaussian Mixture Model (GMM) Speech Production not Deterministic: Phones are never produced by a speaker with exactly the same vocal tract shape and glottal flow due to variations in: Context Coarticulation Anatomy Fluid Dynamics On of the best ways to represent variability is through multi-dimensinal Gaussian pdf’s. In general a Mixture of Gaussians is used to represent a class pdf.
62
18 February 2016Veton Këpuska62 Mixture Densities PDF is composed of a mixture of m components densities { 1,…, 2 }: Component PDF parameters and mixture weights P( j ) are typically unknown, making parameter estimation a form of unsupervised learning. Gaussian mixtures assume Normal components:
63
18 February 2016Veton Këpuska63 Gaussian Mixture Example: One Dimension p(x)=0.6p 1 (x)+0.4p 2 (x) p1(x)~N(-, 2 ) p 2 (x) ~N(1.5, 2 )
64
18 February 2016Veton Këpuska64 Gaussian Example First 9 MFCC’s from [s]: Gaussian PDF
65
18 February 2016Veton Këpuska65 Independent Mixtures [s]: 2 Gaussian Mixture Components/Dimension
66
18 February 2016Veton Këpuska66 Mixture Components [s]: 2 Gaussian Mixture Components/Dimension
67
18 February 2016Veton Këpuska67 ML Parameter Estimation: 1D Gaussian Mixture Means
68
18 February 2016Veton Këpuska68 Gaussian Mixtures: ML Parameter Estimation The maximum likelihood solutions are of the form:
69
18 February 2016Veton Këpuska69 Gaussian Mixtures: ML Parameter Estimation The ML solutions are typically solved iteratively: Select a set of initial estimates for P( k ), µ k, k Use a set of n samples to re-estimate the mixture parameters until some kind of convergence is found Clustering procedures are often used to provide the initial parameter estimates Similar to K-means clustering procedure ˆ ˆ ˆ
70
18 February 2016Veton Këpuska70 Example: 4 Samples, 2 Densities 1.Data: X = {x 1,x 2,x 3,x 4 } = {2,1,-1,-2} 2.Init: p(x| 1 )~N(1,1), p(x| 2 )~N(1-,1), P( i )=0.5 3.Estimate: 4.Recompute mixture parameters (only shown for 1 ): x1x1 x2x2 x3x3 x4x4 P( 1 |x) 0.980.880.120.02 P( 2 |x) 0.020.120.880.98 p(X) (e -0.5 + e -4.5 )(e 0 + e -2 )(e 0 + e -2 )(e -0.5 + e -4.5 )0.5 4
71
18 February 2016Veton Këpuska71 Example: 4 Samples, 2 Densities Repeat steps 3,4 until convergence.
72
18 February 2016Veton Këpuska72 [s] Duration: 2 Densities
73
18 February 2016Veton Këpuska73 Gaussian Mixture Example: Two Dimensions
74
18 February 2016Veton Këpuska74 Two Dimensional Mixtures...
75
18 February 2016Veton Këpuska75 Two Dimensional Components
76
18 February 2016Veton Këpuska76 Mixture of Gaussians: Implementation Variations Diagonal Gaussians are often used instead of full- covariance Gaussians Can reduce the number of parameters Can potentially model the underlying PDF just as well if enough components are used Mixture parameters are often constrained to be the same in order to reduce the number of parameters which need to be estimated Richter Gaussians share the same mean in order to better model the PDF tails Tied-Mixtures share the same Gaussian parameters across all classes. Only the mixture weights P( i ) are class specific. (Also known as semi-continuous) ˆ
77
18 February 2016Veton Këpuska77 Richter Gaussian Mixtures [s] Log Duration: 2 Richter Gaussians
78
18 February 2016Veton Këpuska78 Expectation-Maximization (EM) Used for determining parameters, , for incomplete data, X = {x i } (i.e., unsupervised learning problems) Introduces variable, Z = {z j }, to make data complete so can be solved using conventional ML techniques In reality, z j can only be estimated by P(z j |x i,), so we can only compute the expectation of log L() EM solutions are computed iteratively until convergence 1.Compute the expectation of log L() 2.Compute the values j, which maximize E
79
18 February 2016Veton Këpuska79 EM Parameter Estimation: 1D Gaussian Mixture Means Let z i be the component id, { j }, which x i belongs to Convert to mixture component notation: Differentiate with respect to k :
80
18 February 2016Veton Këpuska80 EM Properties Each iteration of EM will increase the likelihood of X Using Bayes rule and the Kullback-Liebler distance metric:
81
18 February 2016Veton Këpuska81 EM Properties Since ’ was determined to maximize E(log L()): Combining these two properties: p(X|’)≥ p(X|)
82
18 February 2016Veton Këpuska82 Dimensionality Reduction Given a training set, PDF parameter estimation becomes less robust as dimensionality increases Increasing dimensions can make it more difficult to obtain insights into any underlying structure Analytical techniques exist which can transform a sample space to a different set of dimensions If original dimensions are correlated, the same information may require fewer dimensions The transformed space will often have more Normal distribution than the original space If the new dimensions are orthogonal, it could be easier to model the transformed space
83
18 February 2016Veton Këpuska83 Principal Components Analysis Linearly transforms d-dimensional vector, x, to d’ dimensional vector, y, via orthonormal vectors, W y=W t x W={w 1,…,w d’ } W t W=I If d’<d, x can be only partially reconstructed from y x=Wy ^ ^
84
18 February 2016Veton Këpuska84 Principal Components Analysis Principal components, W, minimize the distortion, D, between x, and x, on training data X = {x 1,…,x n } Also known as Karhunen-Loéve (K-L) expansion (w i ’s are sinusoids for some stochastic processes)
85
18 February 2016Veton Këpuska85 PCA Computation W corresponds to the first d’ eigenvectors, P, of P= {e 1,…,e d } =PP t w i = e i Full covariance structure of original space, , is transformed to a diagonal covariance structure ’ Eigenvalues, { 1,…, d’ }, represents the variances in ’
86
18 February 2016Veton Këpuska86 PCA Computation Axes in d’-space contain maximum amount of variance
87
18 February 2016Veton Këpuska87 PCA Example Original feature vector mean rate response (d = 40) Data obtained from 100 speakers from TIMIT corpus First 10 components explains 98% of total variance
88
18 February 2016Veton Këpuska88 PCA Example
89
18 February 2016Veton Këpuska89 PCA for Boundary Classification Eight non-uniform averages from 14 MFCCs First 50 dimensions used for classification
90
18 February 2016Veton Këpuska90 PCA Issues PCA can be performed using Covariance matrixes Correlation coefficients matrix P P is usually preferred when the input dimensions have significantly different ranges PCA can be used to normalize or whiten original d-dimensional space to simplify subsequent processing: P I Whitening operation can be done in one step: z=V t x
91
18 February 2016Veton Këpuska91 Significance Testing To properly compare results from different classifier algorithms, A1, and A2, it is necessary to perform significance tests Large differences can be insignificant for small test sets Small differences can be significant for large test sets General significance tests evaluate the hypothesis that the probability of being correct, p i, of both algorithms is the same The most powerful comparisons can be made using common train and test corpora, and common evaluation criterion Results reflect differences in algorithms rather than accidental differences in test sets Significance tests can be more precise when identical data are used since they can focus on tokens misclassified by only one algorithm, rather than on all tokens
92
18 February 2016Veton Këpuska92 McNemar’s Significance Test When algorithms A1 and A2 are tested on identical data we can collapse the results into a 2x2 matrix of counts To compare algorithms, we test the null hypothesis H 0 that p 1 = p 2, or n 01 = n 10, or A1/A2CorrectIncorrect Correct n 00 n 01 Incorrect n 10 n 11
93
18 February 2016Veton Këpuska93 McNemar’s Significance Test Given H 0, the probability of observing k tokens asymmetrically classified out of n = n 01 + n 10 has a Binomial PMF McNemar’s Test measures the probability, P, of all cases that meet or exceed the observed asymmetric distribution, and tests P <
94
18 February 2016Veton Këpuska94 McNemar’s Significance Test The probability, P, is computed by summing up the PMF tails For large n, a Normal distribution is often assumed.
95
18 February 2016Veton Këpuska95 Significance Test Example (Gillick and Cox, 1989) Common test set of 1400 tokens Algorithms A1 and A2 make 72 and 62 errors Are the differences significant?
96
18 February 2016Veton Këpuska96 References Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, 2001. Duda, Hart and Stork, Pattern Classification, John Wiley & Sons, 2001. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1997. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, 1995. Gillick and Cox, Some Statistical Issues in the Comparison of Speech Recognition Algorithms, Proc. ICASSP, 1989.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.