Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique.

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

Speech Recognition with Hidden Markov Models Winter 2011

Biointelligence Laboratory, Seoul National University

Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka,

Computer vision: models, learning and inference Chapter 8 Regression.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Expectation Maximization

A Text-Independent Speaker Recognition System

Supervised Learning Recap

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 5: Learning models using EM

Speaker Adaptation for Vowel Classification

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Visual Recognition Tutorial

Computer vision: models, learning and inference

SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

PATTERN RECOGNITION AND MACHINE LEARNING

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

EM and expected complete log-likelihood Mixture of Experts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Robust speaker recognition over varying channels

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

HMM - Part 2 The EM algorithm Continuous density HMM.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CS Statistical Machine learning Lecture 24

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Variational Bayesian Methods for Audio Indexing

Lecture 2: Statistical learning primer for biologists

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Machine Learning 5. Parametric Methods.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

Probability Theory and Parameter Estimation I

ICS 280 Learning in Graphical Models

Variational Bayes Model Selection for Mixture Distribution

Ch3: Model Building through Regression

Computer vision: models, learning and inference

Statistical Models for Automatic Speech Recognition

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Latent Variables, Mixture Models and EM

Introduction to particle filter

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Statistical Models for Automatic Speech Recognition

Introduction to particle filter

Parametric Estimation

LECTURE 15: REESTIMATION, EM AND MIXTURES

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

SNR-Invariant PLDA Modeling for Robust Speaker Verification

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Presentation transcript:

Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique de Montreal (CRIM)

Tous droits réservés © 2005 CRIM Systems CRIM_2 was the primary system for all but the core condition –Large stand-alone joint factor analysis (JFA) system trained on pre-2006 data CRIM_1 was the primary system for the core condition –CRIM_1 = CRIM_2 + 3 other JFA systems with different feature sets CRIM_3 = CRIM_ SRE data

Tous droits réservés © 2005 CRIM Overview Tasks involving multiple enrollment recordings: –8conv-short3, 3conv-short3 Tasks involving 10 sec test recordings: –10sec-10sec, short2-10sec, 8conv-10sec Najim Dehak will talk about –JFA with unconventional features –Post-eval experiments on the interview data (following LPT and I4U)

Tous droits réservés © 2005 CRIM Factor Analysis Configuration 2K Gaussians, 60 dimensional features –20 Gaussianized mfcc’s + first and second derivatives 300 speaker factors 100 channel factors for telephone speech Additional 100 channel factors for microphone speech

Tous droits réservés © 2005 CRIM Speaker Variability Prior distribution on speaker supervectors s = m + vy + dz –m is the speaker-independent supervector –v is rectangular, low rank (eigenvectors) –d is diagonal –y, z standard Normal random vectors (speaker factors)

Tous droits réservés © 2005 CRIM Channel Variability Each supervector M is assumed to be a sum of a speaker supervector and a channel supervector: M = s + c Prior distribution on channel supervectors c = ux –u is rectangular, low rank (eigenchannels) –x standard Normal random

Tous droits réservés © 2005 CRIM Enrollment: single utterance The supervector for the utterance is m + dz + vy + ux Calculate the MAP estimates of x, y and z The speaker supervector is s + dz + vy The full posterior distribution of s can be calculated in closed form (but this is messy unless d is 0)

Tous droits réservés © 2005 CRIM Enrollment: 8conv case Again the joint posterior distribution of the hidden variables can be calculated in closed form. Unless d is 0, this is very messy Trick: pool the utterances together and ignore the fact that the x’s are different

Tous droits réservés © 2005 CRIM

10 second test conditions Many labs have reported difficulty in getting channel factors or NAP to work under these conditions The problem may be that it is unrealistic to attempt to produce point estimates (ML or MAP) of channel factors using 10 second test utterances Probability rules say you should integrate over channel factors instead

Tous droits réservés © 2005 CRIM Why is this not an issue for long test utterances? If the test utterance is long, the posterior distribution of the channel factors will be sharply peaked in the neighbourhood of the point estimate (MAP or ML).

Tous droits réservés © 2005 CRIM

Research Problem How should factor analysis likelihoods and posteriors be evaluated so as to take account of all of the relevant uncertainties? - Uncertainty in the speaker factors - Uncertainty in the channel factors - Uncertainty in the assignment of observations to mixture components

Tous droits réservés © 2005 CRIM Current Solution Use point estimate of speaker factors –Bayesian approach (using full posterior) doesn’t seem to help Integrate over the channel factors Use the UBM to align frames with mixture components –Tractable posterior + Jensen’s inequality gives lower bound on likelihood (Niko Brummer) –Very fast if combined with LPT assumption Paradoxical results if speaker/channel dependent GMM’s used in place of UBM

Tous droits réservés © 2005 CRIM Ideal Solution: Integrate over all hidden variables Robbie Vogt (Odyssey 2004) did this for a diagonal factor analysis model –No speaker or channel factors –Exact dynamic programming solution Variational Bayes offers an approximate solution in the general case –Assume that the posterior distribution factorizes into 3 terms (speaker factors, channel factors, assignments of frames to mixture components) –Cycle through the factors to update them (like EM) –Jensen’s inequality gives lower bound on the likelihood which increases on successive iterations

Tous droits réservés © 2005 CRIM Fusion Fusing long term and short term features Pseudo-syllable unsupervised prosodic and MFCC’s contours segmentation. Six Legendre Polynomial coefficients for each contour. JFA without common factor (d=0) Logistic regression function (Focal).

Tous droits réservés © 2005 CRIM Pseudo-syllable segmentation

Tous droits réservés © 2005 CRIM Long term features Three long term systems: –512 G, Features : Pitch + energy + duration (13 dimension) –1024 G, Features : 12 MFCCs contours + energy + duration (79 dimension) –1024 G, Features : 12 MFCCs contours + pitch + energy + duration (85 dimension)

Tous droits réservés © 2005 CRIM Short2-short3 : Tel-Tel det7

Tous droits réservés © 2005 CRIM Short2-short3 : Tel-Tel det8

Tous droits réservés © 2005 CRIM How to deal with interview data? Interview eigenchannel trained on interview development data (as LPT and I4U). Small configuration of the Factor analayis –Features 20 Gaussianized MFCC’s + first derivatives –300 speaker factors, d=0 (no common factor), 100 telephone channel factors. We carried out two experiments : –50 TeL-Mic channel factors. –50 TeL-Mic channel factors + 50 interview channel factors.

Tous droits réservés © 2005 CRIM NIST 2008 : Interview data – det1

Tous droits réservés © 2005 CRIM NIST 2008 : Interview data – det1 EER (%)MinDCF Without interview eigenchannels 8.9% Interview speaker utterances means 5.5% Interview channel_2 utterance as means 5.7% Interview & microphone eigenchannels 5.7%

Tous droits réservés © 2005 CRIM References A Study of Inter-Speaker Variability in Speaker Verification. Modeling prosodic features with joint factor analysis for speaker verification.