A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Slides:



Advertisements
Similar presentations
Speech Recognition with Hidden Markov Models Winter 2011
Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Speaker Adaptation for Vowel Classification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Optimal Adaptation for Statistical Classifiers Xiao Li.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
Institute of Information Science, Academia Sinica, Taiwan Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Variational Bayesian Methods for Audio Indexing
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Progress Report - V Ravi Chander
Computational NeuroEngineering Lab
3. Applications to Speaker Verification
LTI Student Research Symposium 2004 Antoine Raux
Statistical Models for Automatic Speech Recognition
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Decision Making Based on Cohort Scores for
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Learning to Rank with Ties
EM Algorithm and its Applications
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

C. Mokbel - UOB - NIST20022 Outline Introduction Baseline speaker recognition system NIST 2002 evaluation Conclusion and perspective

C. Mokbel - UOB - NIST20023 Introduction A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation –GMM based system –Normalization using z-norm –Adaptation technique used to estimate speaker model starting from world model

C. Mokbel - UOB - NIST20024 Baseline Speaker Recognition System Feature extraction: –Speech recognition based feature vectors 13 MFCC coefficients including the energy on logarithmic scale + first and second order derivative –Leading to 39 feature parameters Preprocessing using cepstral mean normalization

C. Mokbel - UOB - NIST20025 Baseline Speaker Recognition System GMM modeling for both hypotheses: speaker and non speaker (world) –EM algorithm to train the world model (Baum- Welch) Initialization using LBG VQ –Speaker model: adapted mean vectors from the world model Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001)

C. Mokbel - UOB - NIST20026 Baseline Speaker Recognition System Speaker Adaptation: –World model Gaussian distributions grouped in a binary tree –Speaker data driven determination of the Gaussian classes –MLLR applied based on these classes: only means of Gaussian distributions are adapted –MAP applied to the leaves Gaussian distributions

C. Mokbel - UOB - NIST20027 Baseline Speaker Recognition System Building the Gaussian tree bottom up: –Grouping two by two the closest Gaussian distributions –Distance between 2 Gaussian distributions is equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian

C. Mokbel - UOB - NIST20028 Baseline Speaker Recognition System After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold –This defines a partition that will be used in an MLLR algorithm

C. Mokbel - UOB - NIST20029 Baseline Speaker Recognition System MAP algorithm: –Estimated Gaussian means parameters at the leaves are smoothed using a fixed weight with the parameters of the world Gaussian

C. Mokbel - UOB - NIST Baseline Speaker Recognition System Given a target speaker model s, the world model w and a test utterance X, the score for this utterance is computed as the log likelihood ratio: s = log [p(X/ s ) / p(X/ w )] This score should be normalized due to the fact that the world model is not precise

C. Mokbel - UOB - NIST Baseline Speaker Recognition System Normalization using the z-norm: –Few impostors utterances are used –A score is computed for every utterance –The different scores define a distribution per target speaker –Target speakers distributions should be similar for a decision using a unique threshold Reduce and center the distribution ns = a * s + b

C. Mokbel - UOB - NIST Baseline Speaker Recognition System Based on the data from the 2001 evaluation a DET curve can be plotted –Find the optimal decision threshold that minimize the cost defined by NIST’2002, i.e.: C det = C mis *Pr miss/target *Pr target + C FalseAlarm *Pr FalseAlarm/NonTarget *(1-Pr target )

C. Mokbel - UOB - NIST NIST 2002 evaluation Feature vector: 13 MFCCs + 13  + 13  2 Cepstral Mean Normalization Gender dependent GMM with 256 Gaussian mixtures for world model –Trained on a subset of the cellular data of NIST 2001 evaluation

C. Mokbel - UOB - NIST NIST 2002 evaluation Target speaker model adapted from world model –For every iteration and after the E step Threshold (cumulative probability = 3.0) to select tree nodes MLLR used to update the Gaussian means Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2)

C. Mokbel - UOB - NIST NIST 2002 evaluation 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each) –The pseudo-impostors scores define a distribution used to z-normalize the score for a given target speaker Global threshold estimated on NIST 2001 data in order to minimize the cost

C. Mokbel - UOB - NIST NIST 2002 evaluation System characteristics: –CPU time on a pentium III 800 MHz: 2.1 ms per frame and per speaker for speaker model adaptation 0.92 ms per frame for the test –Memory usage: ~360 Kbytes per test

C. Mokbel - UOB - NIST NIST 2002 evaluation Results: –C det = –Min C det = DET Curve:

C. Mokbel - UOB - NIST NIST 2002 evaluation

C. Mokbel - UOB - NIST NIST 2002 evaluation

C. Mokbel - UOB - NIST NIST 2002 evaluation

C. Mokbel - UOB - NIST Conclusions and perspectives A new baseline system has been developed and evaluated A lot of work to be done, mainly: –Optimize the feature extraction module –Implement the complete Unified Adaptation approach –Investigate new normalization strategies –Integrate automatic labeling of speech segments