HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas

Outline  Work package 1 Task 1: Blind Source Separation for multi-microphone ASR Task 2/5: Feature combination Task 4: Segment models for ASR  Work package 2 Task 2: VTLN

Blind Speech Separation (BSS) problem (Cocktail Party problem)  Find a set of unobservable speech signals from a set of observable mixtures without knowing any information about the mixing system

Convolutive Speech Mixtures The problem is that in real world we have to deal with (4) : mixing impulse response matrix : spatial signature of the i-th speaker for lag τ : additive noise vector Objective: Estimate the inverse-channel impulse response matrix W(τ) from the observed signals, such that (5) L : Channel order Data Model – Problem Statement

Estimate {A(f),all f} (PARAFAC) Estimated autocorrelation data Unmixing system...... f0f0 f1f1 f2f2 f3f3 [1.5L,3K] T [12L,0.5K] T [30K,0.1L] T [2K,2L] T Problem: Frequency-dependent permutation and scaling ambiguity L: Lefteris K: Kleanthis IDFT {Â  (f)} Assumption: rank(A(f))=I Frequency-domain approach

Resolution of the permutation ambiguity Criterion 1 G j,i (f): frequency response of the acoustic channel between the j-th microphone and the i-th speaker. H j (f): j-th microphone's frequency response. τ j,i : delay, due to the time needed for the sound to cover the distance between the i-th speaker and the j-th microphone. Assumptions Non-reverberant (unechoic) environment  G j,i (f)≈n j,i G(f), where n j,i : attenuation coefficient and n j,i ~1/d j,i. The microphones are identical  H j (f)≈ H for all j=1,…,J. Frequency-dependent permutation and scaling ambiguities

The elements of the mixing matrices A(f), f=0,…,T-1 may be decomposed into products of the form (15)  Normalize the columns of the output matrices of the estimation step w.r.t. the first microphone. Consider the vector whose elements are the magnitudes of the respective elements of the normalized column associated with the i-th speaker (16) (16) depends only on i. Frequency-dependent permutation and scaling ambiguities

Hopefully, even for reverberant environments, the collection of vectors can be divided into I separate clusters,each cluster associated with a specific speaker. c1c1 c2c2 Frequency-dependent permutation and scaling ambiguities

Cluster Separability mic 1 Sp. 1 Sp. 2 B B CC A d 11 d 12 d 21 d 22 r VQ clustering   Fixed r, d ii =r: A:B: C: no separability medium separability maximum separability  Separability  when d ii  or r   We also showed that variance  (Separability  ) when speaker-mic distances are comparable to room dimensions Frequency-dependent permutation and scaling ambiguities

Apply VQ clustering procedure over all the available vectors Determine the I centroids {c 1,…, c I } of the I clusters. Construct C= [c 1,…, c I ] The frequency-dependent permutation problem boils down to the following Integer Least Squares (ILS) minimization problem where A n (f): matrix constructed from the normalized vectors of A s,p (f). Π i, i=1,…,I!, the possible perm. matrices Indicator function

Criterion 2 For adjacent frequency bins, f 1, f 2 = f 1 +1, and This is valid for dense FFT grids. Based on this fact, we formulate our second ILS minimization criterion (18) Note: (18) fixes the permutation ambiguity globally across all freq. bins  more robust than sequential approach (wrong detection of a permutation can cause wrong permutations over a large frequency block). Frequency-dependent permutation and scaling ambiguities

Combined Criterion Combine the previous criteria into one overall ILS minimization criterion When λ  0 (practically, for λ >1) it coincides with criterion 2. (19) Frequency-dependent permutation and scaling ambiguities

Measurement Setup Experimental Results

[Parra and Spence (2000)]  Convergence?, speed , ID potential , [Mitianoudis and Davies (2003)]  Convergence?, ID potential , sequential approach in perm. problem resolution  [Pham et al (2003)]  ID potential , sequential approach in perm. problem resolution  [Rahbar and Reilly (2005)]  Complexity , ID potential  Comparison with previous methods

Improved performance relative to Parra Complexity reduced by 1-2 orders of magnitude relative to Parra, bringing execution times within range for on-line application, at least for teleconference, possibly also cellular telephony applications Guaranteed convergence of overall algorithm Interpretable criteria for resolving permutation-scale ambiguity, leading to well known ILS problem for which many good approximate solutions exist Our approach reveals the much broader identifiability potential of joint-diagonalization-based BSS methods, which went unrecognized in the past. Our permutation correction scheme prevents catastrophic errors that can happen as a result of a wrongly adjusted permutation Conclusions

Motivation  Combining classifiers/information sources is an important problem in machine learning apps.  Simple, yet powerful, way to combine classifiers is “multi-stream” approach; assumes independent information sources  Unsupervised stream weight computation for multi-stream classifiers is an open problem

Problem Definition Compute “optimal” exponent weights for each stream s [ HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] Optimality in the sense of minimizing “total classification error”

Goals  Obtain analytical expressions of total error in Bayes classification  Compute optimal stream weights that minimize total error (Bayes+estimation+model error)  Propose estimators of optimal stream weights that can be computed in “unsupervised” way

Multi-Stream Classification  Two class problem w 1, w 2  Feature vector x is broken up into two independent streams x 1 and x 2  Stream weights s 1 and s 2 are used to “equalize” the “probabilities”

Multi-Stream Classification (cont.)  Bayes classification decision  Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error  “Optimal” weights minimize estimation error variance  z 2

Multi-Stream Classification (cont.)  Estimate the variance of z close to the decision boundary where  S1 2 and  S2 2 are the total stream estimation error variances for stream 1 and 2 (sum over all classes)

Optimal Stream Weights (cont.)  Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance

Optimal Stream Weights (cont.)  Equal error rate in single-stream classifiers simulation estimation/modeling error obtained using the histogram and a given gaussian

Optimal Stream Weights (cont.)  Equal error rate in single-stream classifiers simulation optimal and computed stream weights

Optimal Stream Weights (cont.)  Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error

Experimental Results  Test case: audio-visual continuous digit recognition task  Difference from ideal two-class case Multi-class problem Recognition instead of classification  Two feature streams: audio (d=39) video (d=105)  Multiple experiments: clean video stream noise corrupted audio streams at various SNR

Experimental Results (cont.)  Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A)  Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams

Optimal Stream Weights Results  Assume:  V 2 /  A 2 = 2 SNR-indep. correlation 0.96

AV-ASR Results (Matched Scenario)

AV-ASR Results (Mismatched Scenario)

Conclusions  Towards unsupervised estimation of stream weights for multi-stream classification Analytical results for two-class two-stream classification Steam weights should be inversely proportional to stream estimation error variance and single- stream classification error For AV-ASR we show good correlation between theoretical and experimental results

Dynamical System Segment Model  Based on linear dynamical system  The system parameters should guarantee Identifiability Controllability Observability Stability  Until now the parameter structures considered are very restricted  We investigated more generalized parameter structures

 Linear dynamical system with state-control:  Parameters F,B,H have canonical forms (Ljung – “System Identification”) Generalized forms of parameter structures

Parameter Estimation  Use of EM algorithm to estimate the parameters F,B,P,R We propose a new element-wise parameter estimation algorithm  For the forward-backward recursions, use of kalman smoother recursions

Experiments with artificial data  Experiments description: Randomly set system parameters based on the canonical forms Check if the system is controllable, observable and stable Extract artificial data from the system Based on these artificial data estimate the parameters of the system using Kalman smoother counts and the element-wise reestimation process  Two criteria for the evaluation of the system: The log likelihood of the observations The distance between the actual and the estimated value of the parameters:

Experiments with artificial data Dimension of F: 9x9 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000

Without state control Dimension of F: 9x9 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000

Conclusions and remarks on the experiments with artificial data  Initialization of the parameters significantly influences the parameter convergence  The likelihood always increases  EM might converge to local optima  The state control may be helpful in some cases  The covariances of the Gaussian noises could be full matrices  Increasing the number of samples to more than 1000 does not alter systems behavior  System shows slow convergence especially when the state dimension increases

Model Training on Speech Data  Aurora 2 Database  77 training sentences  Word models with different number of states based on the phonetic transcription  State alignments produced using HTK StatesModels 2oh 4two, eight 6one, three, four, five, nine 8zero, six 10seven

Speech Data Modeling

Training on Speech Data (LogLikelihood vs. EM Iteration)

Classification process  Produce alignments for each test sentence based on every trained models  Classification is based on the likelihood  Test set: Clean Aurora 2 sentences (single word transcription) 100 test sentences

Classification results  Work in progress

Segment Models – Workplan  Further Investigation of state space on speech data State of the dynamical system Use of formants and other articulatory features for initialization of the state vectors Dimension of the state Examine different dimensions of the state vector  Investigate more canonical forms of parameters that guarantee system Identifiability  Extension to a non-linear dynamical system Use of extended Kalman filter Derivation of the EM reestimation formulae for the non-linear case (extending EM reestimation formulae presented in Digalakis et. al.)

Outline  Work package 1 Task1: Blind Source Separation for multi-microphone ASR Task2: Non-Linear Features (collaboration with ICCS-NTUA) Task4: Segment models for ASR Task5: Feature combination  Work package 2 Task2: VTLN

Optimal Bayes’ Adaptation  Acoustic model adaptation on DMHMMs Update the output probabilities  Bayes’ Optimal Classification θ is a set of N=2 B dimensional discrete distribution

Optimal Bayes’ Adaptation  Bayes’ Optimal Approximation   Θ is a discrete finite set  Adaptation: Computes the posteriors:

Genone Clustering  Based on model’s central phone Number of Centroids Number of Subvectors (K bits) (N bits) (R bits) Number of Mixture Components Number of Centroids (K bits) (N bits) (R bits) 12M12M genone 1 genone 2 Θ (subvector 1) Θ (subvector L)

Computations  Prior computation:  Likelihood probabilities: Based on the adaptation counts Consider smoothing techniques when insufficient amounts of adaptation data

Adaptation Results  SI Performance:  Bayes’ Optimal Adaptation:

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical."— Presentation transcript:

Similar presentations

About project

Feedback