Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg,

Slides:



Advertisements
Similar presentations
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.
Pattern Recognition and Machine Learning
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
G. Alonso, D. Kossmann Systems Group
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Dimension reduction (1)
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
2008 SIAM Conference on Imaging Science July 7, 2008 Jason A. Palmer
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.
Dimensional reduction, PCA
Speaker Adaptation for Vowel Classification
Independent Component Analysis (ICA) and Factor Analysis (FA)
Chapter 11 Integration Information Instructor: Prof. G. Bebis Represented by Reza Fall 2005.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
Presented By Wanchen Lu 2/25/2013
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
CSDA Conference, Limassol, 2005 University of Medicine and Pharmacy “Gr. T. Popa” Iasi Department of Mathematics and Informatics Gabriel Dimitriu University.
Voice Recognition All Talk No Walk.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Hmm, HID HMMs Gerald Dalley MIT AI Lab Activity Perception Group Group Meeting 17 April 2003.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.
Chapter 3: Maximum-Likelihood Parameter Estimation
Statistical Models for Automatic Speech Recognition
Special Topics In Scientific Computing
Application of Independent Component Analysis (ICA) to Beam Diagnosis
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg, Germany International Conference on Pattern Recognition, Istanbul, Turkey 24. August 2010

2 Content 1. Introduction 2. Related Work 3. Idea and Justification 4. Implementation 5. Results 6. Conclusions

3 Introduction Why speaker recognition? n Speaker recognition is useful for –User authentication (e.g., telephone services) –Video indexing by person (e.g., movies) –Preprocessing for automatic speech recognition (e.g., speaker adaptation) n Scenarios have in common: –Additional training and testing data is unavailable… E.g., movies: typical speaker turn duration of 1-2s –…or costly E.g., access control and speaker adaptation: user has to provide enrollment data, but just wants to proceed with his/her actual purpose

4 Introduction And why on short utterances? n But: typical speaker recognition systems need: –30-100s of speech on average for training (evaluation: 10s) –7-10s as a minimum for training in specialized forensic software n => Furui [„40 Years of…“, 2009]: „The most pressing issues […] for speaker recognition are rooted in […] insufficient data.“

5 Related Work How is this dealt with normally? n Use of additional data, assumptions or modalities: –Anchor models, phonetic structure [Merlin et al., 1999] –Speech content, word dependencies [Larcher et al., 2008] –Video in multimodal data streams [Larcher et al., 2008] –Subspace models, confidence intervals [Vogt et al., 2008/2009] n Combine this with the typical Gaussian Mixture Model (GMM) approach

6 Idea and Justification How is this dealt with here? n The typical approach to speaker recognition is to use a statistical voice model n If it is possible to find a similar model formulation using less parameters… n => fewer data necessary for reliable estimates n => side effect: improved runtime with compact model n The typical (almost omnipresent) statistical voice model is the GMM n => optimize the GMM for employing less parameters

7 Idea and Justification Idea Observations: n Some dimensions (e.g.: 0, 1, 4, 7) are multimodal/ skewed => need Gaussian mixture to be modeled accurately n Some dimensions (6, 11, 13, 18) look Gaussian itself => why spend parameters of 31 more mixtures on them? Per-dimension plot of 32-mixture GMM modeling 19 dim. MFCCs (Upper blue curve: joint density)

8 Idea and Justification Justification n Idea: model each dimension individually with the optimal number of mixtures for its marginal density n Promising: –LPCCs are similar to MFCCs in (non-)Gaussianity of individual dim. –LSPs are Gaussian/like in any dimension –Pitch is quite non/Gaussian –Combinations are common=> method is generally applicable n Permissible: –Standard GMM already treats dimensions as decorrelated via diagonal covariance –Chances are that additionally treating them as independent doesn’t miss important information for speaker recognition

9 Implementation How is it put into existence? n Wrapper around existing GMM implementation: –Build single GMM per dimension of feature set –Optimize number of mixtures in each dimension via Bayesian Information Criterion (BIC) –Apply orthogonal transform prior to training/test to further decorrelate data n => Dimension-Decoupled GMM (DD-GMM) is tupel (#mixtures, GMM) per dimension plus transformation matrix n => easily integratable with existing GMM implementations n => combinable with other short utterance schemes from related work

10 Results Speaker recognition performance n Until 45% removal: nearly no difference n >50% removal: DD-GMM 7.56% (avg.) better as best competitor with same amount of data n >50% removal: DD-GMM as good as best competitor with 4.17% (avg.) less data n Effect stronger with only less training data n Effect still visible with only less test data % train./test data removed (100%: ca. 20/5s) Speaker identification rate on TIMIT n Competitors in 630-speaker identification experiment : n BIC-GMM: GMM with #multimodal-mixtures optimized via BIC n 32-GMM: Multimodal GMM with always 32 mixtures n DD-GMM: New dimension-decoupled GMM

11 Results Evolution of parameter count n DD-GMM uses 90.98% (avg.) less parameters than BIC/GMM n Best in literature [Liu and He, 1999]: 75% without better identification rate

12 Results Run time n DD-GMM train time: 2.5 times longer than 32-GMM (best), but still 3.5 times faster than real time n DD-GMM test time: 2.1 times faster than BIC-GMM (best), that is 54.5% real time n Test phase is practically more relevant (occurs more frequently)

13 Conclusions What remains n DD-GMM gives more reliable speaker recognition results in case of lacking data n DD-GMM is computationally more efficient in case of plenty of data n DD-GMM performs speaker recognition where std. GMM approaches aren’t useable anymore –>80% identification rate with <5.5/1.3s training/test data n DD-GMM is easily integratable with other systems –Wrapper comprises effectively 80 lines of code around existing GMM –Approach is supplemental to other short utterance schemes n Future work: –Apply and test on other features and data sets beyond speaker recognition