Download presentation
Presentation is loading. Please wait.
1
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas
2
Outline Work package 1 Task 1: Blind Source Separation for multi-microphone ASR Task 2/5: Feature combination Task 4: Segment models for ASR Work package 2 Task 2: VTLN
3
Outline Work package 1 Task 1: Blind Source Separation for multi-microphone ASR Task 2/5: Feature combination Task 4: Segment models for ASR Work package 2 Task 2: VTLN
4
Blind Speech Separation (BSS) problem (Cocktail Party problem) Find a set of unobservable speech signals from a set of observable mixtures without knowing any information about the mixing system
5
Convolutive Speech Mixtures The problem is that in real world we have to deal with (4) : mixing impulse response matrix : spatial signature of the i-th speaker for lag τ : additive noise vector Objective: Estimate the inverse-channel impulse response matrix W(τ) from the observed signals, such that (5) L : Channel order Data Model – Problem Statement
6
Estimate {A(f),all f} (PARAFAC) Estimated autocorrelation data Unmixing system...... f0f0 f1f1 f2f2 f3f3 [1.5L,3K] T [12L,0.5K] T [30K,0.1L] T [2K,2L] T Problem: Frequency-dependent permutation and scaling ambiguity L: Lefteris K: Kleanthis IDFT {Â (f)} Assumption: rank(A(f))=I Frequency-domain approach
7
Resolution of the permutation ambiguity Criterion 1 G j,i (f): frequency response of the acoustic channel between the j-th microphone and the i-th speaker. H j (f): j-th microphone's frequency response. τ j,i : delay, due to the time needed for the sound to cover the distance between the i-th speaker and the j-th microphone. Assumptions Non-reverberant (unechoic) environment G j,i (f)≈n j,i G(f), where n j,i : attenuation coefficient and n j,i ~1/d j,i. The microphones are identical H j (f)≈ H for all j=1,…,J. Frequency-dependent permutation and scaling ambiguities
8
The elements of the mixing matrices A(f), f=0,…,T-1 may be decomposed into products of the form (15) Normalize the columns of the output matrices of the estimation step w.r.t. the first microphone. Consider the vector whose elements are the magnitudes of the respective elements of the normalized column associated with the i-th speaker (16) (16) depends only on i. Frequency-dependent permutation and scaling ambiguities
9
Hopefully, even for reverberant environments, the collection of vectors can be divided into I separate clusters,each cluster associated with a specific speaker. c1c1 c2c2 Frequency-dependent permutation and scaling ambiguities
10
Cluster Separability mic 1 Sp. 1 Sp. 2 B B CC A d 11 d 12 d 21 d 22 r VQ clustering Fixed r, d ii =r: A:B: C: no separability medium separability maximum separability Separability when d ii or r We also showed that variance (Separability ) when speaker-mic distances are comparable to room dimensions Frequency-dependent permutation and scaling ambiguities
11
Apply VQ clustering procedure over all the available vectors Determine the I centroids {c 1,…, c I } of the I clusters. Construct C= [c 1,…, c I ] The frequency-dependent permutation problem boils down to the following Integer Least Squares (ILS) minimization problem where A n (f): matrix constructed from the normalized vectors of A s,p (f). Π i, i=1,…,I!, the possible perm. matrices Indicator function
12
Criterion 2 For adjacent frequency bins, f 1, f 2 = f 1 +1, and This is valid for dense FFT grids. Based on this fact, we formulate our second ILS minimization criterion (18) Note: (18) fixes the permutation ambiguity globally across all freq. bins more robust than sequential approach (wrong detection of a permutation can cause wrong permutations over a large frequency block). Frequency-dependent permutation and scaling ambiguities
13
Combined Criterion Combine the previous criteria into one overall ILS minimization criterion When λ 0 (practically, for λ >1) it coincides with criterion 2. (19) Frequency-dependent permutation and scaling ambiguities
14
Measurement Setup Experimental Results
17
[Parra and Spence (2000)] Convergence?, speed , ID potential , [Mitianoudis and Davies (2003)] Convergence?, ID potential , sequential approach in perm. problem resolution [Pham et al (2003)] ID potential , sequential approach in perm. problem resolution [Rahbar and Reilly (2005)] Complexity , ID potential Comparison with previous methods
18
Improved performance relative to Parra Complexity reduced by 1-2 orders of magnitude relative to Parra, bringing execution times within range for on-line application, at least for teleconference, possibly also cellular telephony applications Guaranteed convergence of overall algorithm Interpretable criteria for resolving permutation-scale ambiguity, leading to well known ILS problem for which many good approximate solutions exist Our approach reveals the much broader identifiability potential of joint-diagonalization-based BSS methods, which went unrecognized in the past. Our permutation correction scheme prevents catastrophic errors that can happen as a result of a wrongly adjusted permutation Conclusions
19
Outline Work package 1 Task 1: Blind Source Separation for multi-microphone ASR Task 2/5: Feature combination Task 4: Segment models for ASR Work package 2 Task 2: VTLN
20
Motivation Combining classifiers/information sources is an important problem in machine learning apps. Simple, yet powerful, way to combine classifiers is “multi-stream” approach; assumes independent information sources Unsupervised stream weight computation for multi-stream classifiers is an open problem
21
Problem Definition Compute “optimal” exponent weights for each stream s [ HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] Optimality in the sense of minimizing “total classification error”
22
Goals Obtain analytical expressions of total error in Bayes classification Compute optimal stream weights that minimize total error (Bayes+estimation+model error) Propose estimators of optimal stream weights that can be computed in “unsupervised” way
23
Multi-Stream Classification Two class problem w 1, w 2 Feature vector x is broken up into two independent streams x 1 and x 2 Stream weights s 1 and s 2 are used to “equalize” the “probabilities”
24
Multi-Stream Classification (cont.) Bayes classification decision Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error “Optimal” weights minimize estimation error variance z 2
25
Multi-Stream Classification (cont.) Estimate the variance of z close to the decision boundary where S1 2 and S2 2 are the total stream estimation error variances for stream 1 and 2 (sum over all classes)
26
Optimal Stream Weights (cont.) Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance
27
Optimal Stream Weights (cont.) Equal error rate in single-stream classifiers simulation estimation/modeling error obtained using the histogram and a given gaussian
28
Optimal Stream Weights (cont.) Equal error rate in single-stream classifiers simulation optimal and computed stream weights
29
Optimal Stream Weights (cont.) Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error
30
Experimental Results Test case: audio-visual continuous digit recognition task Difference from ideal two-class case Multi-class problem Recognition instead of classification Two feature streams: audio (d=39) video (d=105) Multiple experiments: clean video stream noise corrupted audio streams at various SNR
31
Experimental Results (cont.) Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A) Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams
32
Optimal Stream Weights Results Assume: V 2 / A 2 = 2 SNR-indep. correlation 0.96
33
AV-ASR Results (Matched Scenario)
34
AV-ASR Results (Mismatched Scenario)
35
Conclusions Towards unsupervised estimation of stream weights for multi-stream classification Analytical results for two-class two-stream classification Steam weights should be inversely proportional to stream estimation error variance and single- stream classification error For AV-ASR we show good correlation between theoretical and experimental results
36
Outline Work package 1 Task 1: Blind Source Separation for multi-microphone ASR Task 2/5: Feature combination Task 4: Segment models for ASR Work package 2 Task 2: VTLN
37
Dynamical System Segment Model Based on linear dynamical system The system parameters should guarantee Identifiability Controllability Observability Stability Until now the parameter structures considered are very restricted We investigated more generalized parameter structures
38
Linear dynamical system with state-control: Parameters F,B,H have canonical forms (Ljung – “System Identification”) Generalized forms of parameter structures
39
Parameter Estimation Use of EM algorithm to estimate the parameters F,B,P,R We propose a new element-wise parameter estimation algorithm For the forward-backward recursions, use of kalman smoother recursions
40
Experiments with artificial data Experiments description: Randomly set system parameters based on the canonical forms Check if the system is controllable, observable and stable Extract artificial data from the system Based on these artificial data estimate the parameters of the system using Kalman smoother counts and the element-wise reestimation process Two criteria for the evaluation of the system: The log likelihood of the observations The distance between the actual and the estimated value of the parameters:
41
Experiments with artificial data Dimension of F: 9x9 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000
42
Without state control Dimension of F: 9x9 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000
43
Conclusions and remarks on the experiments with artificial data Initialization of the parameters significantly influences the parameter convergence The likelihood always increases EM might converge to local optima The state control may be helpful in some cases The covariances of the Gaussian noises could be full matrices Increasing the number of samples to more than 1000 does not alter systems behavior System shows slow convergence especially when the state dimension increases
44
Model Training on Speech Data Aurora 2 Database 77 training sentences Word models with different number of states based on the phonetic transcription State alignments produced using HTK StatesModels 2oh 4two, eight 6one, three, four, five, nine 8zero, six 10seven
45
Speech Data Modeling
46
Training on Speech Data (LogLikelihood vs. EM Iteration)
47
Classification process Produce alignments for each test sentence based on every trained models Classification is based on the likelihood Test set: Clean Aurora 2 sentences (single word transcription) 100 test sentences
48
Classification results Work in progress
49
Segment Models – Workplan Further Investigation of state space on speech data State of the dynamical system Use of formants and other articulatory features for initialization of the state vectors Dimension of the state Examine different dimensions of the state vector Investigate more canonical forms of parameters that guarantee system Identifiability Extension to a non-linear dynamical system Use of extended Kalman filter Derivation of the EM reestimation formulae for the non-linear case (extending EM reestimation formulae presented in Digalakis et. al.)
50
Outline Work package 1 Task1: Blind Source Separation for multi-microphone ASR Task2: Non-Linear Features (collaboration with ICCS-NTUA) Task4: Segment models for ASR Task5: Feature combination Work package 2 Task2: VTLN
51
Optimal Bayes’ Adaptation Acoustic model adaptation on DMHMMs Update the output probabilities Bayes’ Optimal Classification θ is a set of N=2 B dimensional discrete distribution
52
Optimal Bayes’ Adaptation Bayes’ Optimal Approximation Θ is a discrete finite set Adaptation: Computes the posteriors:
53
Genone Clustering Based on model’s central phone Number of Centroids Number of Subvectors (K bits) (N bits) (R bits) Number of Mixture Components Number of Centroids (K bits) (N bits) (R bits) 12M12M genone 1 genone 2 Θ (subvector 1) Θ (subvector L)
54
Computations Prior computation: Likelihood probabilities: Based on the adaptation counts Consider smoothing techniques when insufficient amounts of adaptation data
55
Adaptation Results SI Performance: Bayes’ Optimal Adaptation:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.