HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University of Crete

Outline  Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2)  Research for the Platforms Features and Fusion  Integration on Year 2 Platforms Mobile Platform Fixed Platform

Outline  Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2)  Research for the Platforms New Features and Fusion  Integration on Year 2 Platforms Mobile Platform Fixed Platform

Stream-Weights: Motivation  Low performance of ASR systems if low SNR  combine several sources of information  Sources of information are not equally reliable for different environments and noise conditions  Mismatch between training and test conditions  Unsupervised stream weight computation for multistream classifiers is an open problem.

Problem Definition  Compute “optimal” exponent weights for each stream s i  Optimality in the sense of minimizing “total classification error”

Total Error Computation  Two class problem w 1, w 2, for the feature vector x  Feature pdfs p(x |w 1 ) p(x |w 2 )  Assume that estimation/modeling error is normal variable z i

Optimal Stream Weights (1)  Minimize σ 2 with respect to stream  Two interesting cases Equal error rate in single-stream classifiers p(x1 | w1 ) = p(x2 | w1) in decision region Equal estimation error variance in each stream σ S1 2 =σ S2 2

Optimal Stream Weights (2)  Equal error rate in single-stream classifiers  Equal estimation error variance in each stream

Antimodels, Inter and Intra Distances  The multi-class problem is reposed as (multiple) two-class classification problem  If p(x|w) follows a Gaussian distribution N(μ,σ²), the Bayes error is function of D=|μ 1 - μ 2 |/σ

Experimental Results (1)  Test case: audio-visual continuous digit recognition task  Difference from ideal two-class case Multi-class problem Recognition instead of classification  Multiple experiments: clean video stream noise corrupted audio streams at various SNR

Experimental Results (2)  Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns)  Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams

Weights’ distribution

Results (classification)

Inter- Intra- Distances and Recognition  In each stream a total inter- intra- dist is computed

Results (recognition)

Conclusions  We have proposed a stream computation method for a multi class classification task based on theoretical results obtained for a two classes classification problem and making use of an anti- model technique  We use only the test utterance and the information contained in the trained models  Results are of interest for the problem of unsupervised estimation of stream weights for multi-streams classification and recognition problems

Dynamical System Segment Model  Segment models directly model time evolution of speech parameters  Based on linear dynamical system  The system parameters should guarantee Identifiability, Controllability, Observability, Stability  Simple matrix topologies studies up to now

 Linear dynamical system with state-control:  Parameters F,B,H have canonical forms (Ljung – “System Identification”) Generalized forms of parameter structures

Parameter Estimation  Use of EM algorithm to estimate the parameters F,B,P,R We propose a new element-wise parameter estimation algorithm  For the forward-backward recursions, use Kalman smoother recursions

Experiments with artificial data  Experiments description: Select random system parameters (using canonical matrix topology) Generate artificial data from the system Parameter estimation using the artificial data  Criteria for the evaluation of the system: The log likelihood of the observations increases per EM iter. The parameter estimation error decreases per EM iter.

Without state control Dimension of F: 3x3 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000

Model Training on Speech Data  Aurora 2 Database  77 training sentences  Word models with different number of states based on the phonetic transcription  State alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven

Speech Segment Modeling

Classification process  Keep true word-boundaries fixed Digit-level alignments produced by an HMM  Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence  Classification is based on maximizing the likelihood  Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise (Clean, SNR: 20, 15, 10, 5 dB) Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations –12 Cepstral Coefficients + C0 + Energy –+ first and second order derivatives (δ, δδ)

Classification results  Comparison of Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)Segment Models MFCC, E+δ +δδMFCC, E+δ +δδ Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%

Conclusions and Future work  Without derivatives Segment-models significantly outperform HMMs particularly under highly noisy conditions  When derivatives are used for both models their performance is similar  Use formants and other articulatory features to initialize the state vectors  Examine different dimensions of the state vector  Extension to a non-linear dynamical system Use of extended Kalman filter Derivation of the EM reestimation formulae for the non-linear case

MAP versus Bayes Optimal  MAP adaptation techniques derive from Bayes Optimal Classification Assumption: Posterior is peaked around the most probable model It is not optimal  Bayes Optimal adaptation is based on a weighted average of the posteriors Better Performance with less training data But: Computationally expensive Hard to find analytical solutions Approximations should be considered

Bayes Optimal Adaptation  Bayes optimal classification is based on:  Assuming θ denotes a Gaussian component this becomes: Θ is a subset of Gaussians

Our Approach  To obtain the N Gaussians of Θ: Step 1: Cluster the Gaussian mixtures associated to context- dependent models with common central phone Step 2: From the extended Gaussian mixture choose the N less distant Gaussians from each Gaussian component of the SI Gaussian mixture  Bayes optimal classification becomes:

Gaussian Size Number of Mixture Components 12M12M Mixture 1Mixture 2 For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together The clustering can be performed at an element or sub-vector basis thus increasing the degrees of freedom.

Adaptation Configuration  Baseline trained on the WSJ database  Adaptation data: spoke3 WSJ task non-native speakers 5 male and 5 female 20 adaptation sentences per speaker 40 test sentences per speaker  Perform experiments for different number of associated mixtures (associations)

Adaptation Results (% WER) Bayes’ Adaptation Baseline 5 Associations 6 Associations Male speaker (4n0)51.52%47.65%59.28% Male speaker (4n3)43.27%41.98%51.72% Male speaker (4n5)33.13%31.48%36.30% Male speaker (4n9)34.48%33.43%28.96% Male speaker (4na)26.66%26.22%28.72% Total Male %WER37.87%36.15%40,99% Female speaker (4n1)74.96%74.47%81.01% Female speaker (4n4)58.18% 60.12% Female speaker (4n8)34.16%35.99%30.85% Female speaker (4nb)40.31%39.38%39.06% Female speaker (4nc)40.23%41.68%42.97% Total Female %WER49.56%49.94%50.80%

Total Results and Conclusions AdaptationBaseline 5 Associations6 Associations Total %WER43.71%43.04%45.89%  Small improvements can be obtained compared to the Baseline  The number of associations significantly influences the adaptation performance  The optimum number of associations depends on the baseline models and the adaptation data  dynamically choose the associations

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University."— Presentation transcript:

Similar presentations

About project

Feedback