Download presentation
Presentation is loading. Please wait.
1
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University of Crete
2
Outline Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2) Research for the Platforms Features and Fusion Integration on Year 2 Platforms Mobile Platform Fixed Platform
3
Outline Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2) Research for the Platforms New Features and Fusion Integration on Year 2 Platforms Mobile Platform Fixed Platform
4
Stream-Weights: Motivation Low performance of ASR systems if low SNR combine several sources of information Sources of information are not equally reliable for different environments and noise conditions Mismatch between training and test conditions Unsupervised stream weight computation for multistream classifiers is an open problem.
5
Problem Definition Compute “optimal” exponent weights for each stream s i Optimality in the sense of minimizing “total classification error”
6
Total Error Computation Two class problem w 1, w 2, for the feature vector x Feature pdfs p(x |w 1 ) p(x |w 2 ) Assume that estimation/modeling error is normal variable z i
7
Optimal Stream Weights (1) Minimize σ 2 with respect to stream Two interesting cases Equal error rate in single-stream classifiers p(x1 | w1 ) = p(x2 | w1) in decision region Equal estimation error variance in each stream σ S1 2 =σ S2 2
8
Optimal Stream Weights (2) Equal error rate in single-stream classifiers Equal estimation error variance in each stream
9
Antimodels, Inter and Intra Distances The multi-class problem is reposed as (multiple) two-class classification problem If p(x|w) follows a Gaussian distribution N(μ,σ²), the Bayes error is function of D=|μ 1 - μ 2 |/σ
10
Experimental Results (1) Test case: audio-visual continuous digit recognition task Difference from ideal two-class case Multi-class problem Recognition instead of classification Multiple experiments: clean video stream noise corrupted audio streams at various SNR
11
Experimental Results (2) Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns) Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams
12
Weights’ distribution
13
Results (classification)
14
Inter- Intra- Distances and Recognition In each stream a total inter- intra- dist is computed
15
Inter- Intra- Distances and Recognition In each stream a total inter- intra- dist is computed
16
Results (recognition)
17
Conclusions We have proposed a stream computation method for a multi class classification task based on theoretical results obtained for a two classes classification problem and making use of an anti- model technique We use only the test utterance and the information contained in the trained models Results are of interest for the problem of unsupervised estimation of stream weights for multi-streams classification and recognition problems
18
Outline Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2) Research for the Platforms New Features and Fusion Integration on Year 2 Platforms Mobile Platform Fixed Platform
19
Dynamical System Segment Model Segment models directly model time evolution of speech parameters Based on linear dynamical system The system parameters should guarantee Identifiability, Controllability, Observability, Stability Simple matrix topologies studies up to now
20
Linear dynamical system with state-control: Parameters F,B,H have canonical forms (Ljung – “System Identification”) Generalized forms of parameter structures
21
Parameter Estimation Use of EM algorithm to estimate the parameters F,B,P,R We propose a new element-wise parameter estimation algorithm For the forward-backward recursions, use Kalman smoother recursions
22
Experiments with artificial data Experiments description: Select random system parameters (using canonical matrix topology) Generate artificial data from the system Parameter estimation using the artificial data Criteria for the evaluation of the system: The log likelihood of the observations increases per EM iter. The parameter estimation error decreases per EM iter.
23
Without state control Dimension of F: 3x3 Observation vector size: 3x1 # of rows with free parameters: 3 # of samples: 1000
24
Model Training on Speech Data Aurora 2 Database 77 training sentences Word models with different number of states based on the phonetic transcription State alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven
25
Speech Segment Modeling
26
Classification process Keep true word-boundaries fixed Digit-level alignments produced by an HMM Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence Classification is based on maximizing the likelihood Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise (Clean, SNR: 20, 15, 10, 5 dB) Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations –12 Cepstral Coefficients + C0 + Energy –+ first and second order derivatives (δ, δδ)
27
Classification results Comparison of Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)Segment Models MFCC, E+δ +δδMFCC, E+δ +δδ Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%
28
Conclusions and Future work Without derivatives Segment-models significantly outperform HMMs particularly under highly noisy conditions When derivatives are used for both models their performance is similar Use formants and other articulatory features to initialize the state vectors Examine different dimensions of the state vector Extension to a non-linear dynamical system Use of extended Kalman filter Derivation of the EM reestimation formulae for the non-linear case
29
Outline Long Term Research Audio-Visual Processing (WP1) Segment Models (WP1) Bayes’ Optimal Adaptation (WP2) Research for the Platforms New Features and Fusion Integration on Year 2 Platforms Mobile Platform Fixed Platform
30
MAP versus Bayes Optimal MAP adaptation techniques derive from Bayes Optimal Classification Assumption: Posterior is peaked around the most probable model It is not optimal Bayes Optimal adaptation is based on a weighted average of the posteriors Better Performance with less training data But: Computationally expensive Hard to find analytical solutions Approximations should be considered
31
Bayes Optimal Adaptation Bayes optimal classification is based on: Assuming θ denotes a Gaussian component this becomes: Θ is a subset of Gaussians
32
Our Approach To obtain the N Gaussians of Θ: Step 1: Cluster the Gaussian mixtures associated to context- dependent models with common central phone Step 2: From the extended Gaussian mixture choose the N less distant Gaussians from each Gaussian component of the SI Gaussian mixture Bayes optimal classification becomes:
33
Gaussian Size Number of Mixture Components 12M12M Mixture 1Mixture 2 For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together The clustering can be performed at an element or sub-vector basis thus increasing the degrees of freedom.
34
Adaptation Configuration Baseline trained on the WSJ database Adaptation data: spoke3 WSJ task non-native speakers 5 male and 5 female 20 adaptation sentences per speaker 40 test sentences per speaker Perform experiments for different number of associated mixtures (associations)
35
Adaptation Results (% WER) Bayes’ Adaptation Baseline 5 Associations 6 Associations Male speaker (4n0)51.52%47.65%59.28% Male speaker (4n3)43.27%41.98%51.72% Male speaker (4n5)33.13%31.48%36.30% Male speaker (4n9)34.48%33.43%28.96% Male speaker (4na)26.66%26.22%28.72% Total Male %WER37.87%36.15%40,99% Female speaker (4n1)74.96%74.47%81.01% Female speaker (4n4)58.18% 60.12% Female speaker (4n8)34.16%35.99%30.85% Female speaker (4nb)40.31%39.38%39.06% Female speaker (4nc)40.23%41.68%42.97% Total Female %WER49.56%49.94%50.80%
36
Total Results and Conclusions AdaptationBaseline 5 Associations6 Associations Total %WER43.71%43.04%45.89% Small improvements can be obtained compared to the Baseline The number of associations significantly influences the adaptation performance The optimum number of associations depends on the baseline models and the adaptation data dynamically choose the associations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.