Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1 Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: Intra-Class Variability Modeling for Speech Processing
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 2 Given labeled training segments from class + and class –, classify unlabeled test segments Classification framework 1. Represent speech segments in segment-space 2. Learn a classifier in segment-space SVMs NNs Bayesian classifiers … Speech Classification Proposed framework
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 3 Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 4 GMM based speaker recognition Estimate Pr(y t |S) 1. Train a universal background model (UBM) GMM using EM 2. For every target speaker S: Train a GMM G S by applying MAP-adaptation Text-Independent Speaker Recognition GMM-Based Algorithm [Reynolds 1995] Assuming frame independence: UBM Q 1 - speaker #1 Q 2 - speaker #2 μ1μ1 μ2μ2 μ3μ3 R 26 MFCC feature space
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Invalid frame independence assumption: Factors such as channel, emotion, lexical variability, and speaker aging cause frame dependency 2.GMM scoring is inefficient – linear in the length of the audio 3.GMM scoring does not support indexing GMM Based Algorithm - Analysis
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 6 Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 7 Mapping Speech Segments into Segment Space GMM scoring approximation 1/4 Definitions X:training session for target speaker Y:test session Q:GMM trained for X P:GMM trained for Y Goal Compute Pr(Y |Q) using GMMs P and Q only Motivation 1. Efficient speaker recognition and indexing 2. More accurate modeling
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 8 (1) Negative cross entropy Mapping Speech Segments into Segment Space GMM scoring approximation 2/4 Approximating the cross entropy between two GMMs 1.Matching based lower bound [Aronowitz 2004] 2.Unscented-transform based approximation [Goldberger & Aronowitz 2005] 3.Others options in [Hershey 2007]
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 9 (2) Matching based approximation Mapping Speech Segments into Segment Space GMM scoring approximation 3/4 Assuming weights and covariance matrices are speaker independent (+ some approximations): (3) Mapping T is induced: (4)
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Results Mapping Speech Segments into Segment Space GMM scoring approximation 4/4 Figure and Table taken from: H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Anchor modeling projection [Sturim 2001] efficient but inaccurate 2.MLLR transofrms [Stolcke 2005] accurate but inefficient 3.Kernel-PCA-based mapping [Aronowitz 2007c] Given - a set of objects - a kernel function (a dot product between each pair of objects) Finds a mapping of the objects into R n which preserves the kernel function. accurate & efficient Other Mapping Techniques
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Mapping Kernel induced
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction Mapping Modeling Speaker Diarization Summary The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: channel, noise language stress, emotion, aging The frame independence assumption does not hold in these cases! (1) (3) Instead, we can use a more relaxed assumption: Intra-Class Variability Modeling [Aronowitz 2005b] Introduction (2) which leads to:
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction Mapping Modeling Speaker Diarization Summary Speaker Framesequence generated independently a GMM Old vs. New Generative Models Session GMM Framesequence Speaker a PDF over GMM space a GMM generated independently Old Model New Model
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction Mapping Modeling Speaker Diarization Summary speaker #1 speaker #2 speaker #3 Session-GMM Space Session-GMM space GMM for session A of speaker #1 GMM for session B of speaker #1
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Modeling in Session-GMM space 1/2 Recall mapping T induced by the GMM approximation analysis: is called a supervector A speaker is modeled by a multivariate normal distribution in supervector space: (3) A typical dimension of is 50,000*50,000 is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction Mapping Modeling Speaker Diarization Summary Supervector space speaker #1 speaker #2 speaker #3 Delta supervector space Modeling in Session-GMM Space 2/2 Estimating covariance matrix
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June is estimated from the NIST-2006-SRE corpus Evaluation is done on the NIST-2004-SRE corpus ETSI MFCC (13-cep + 13-delta-cep) Energy based voice activity detector Feature warping 2048 Gaussians Target models are adapted from GI-UBM ZT-norm score normalization Experimental Setup Datasets System description
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Results 38% reduction in EER
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June NAP+SVMs [Campbell 2006] Factor Analysis [Kenny 2005] Kernel-PCA [Aronowitz 2007c] Model each supervector as s S : Common speaker subspace u U : Speaker unique subspace S is spanned by a set of development supervectors (700 speakers) U is the orthogonal complement of S in supervector space Intra-speaker variability is modeled separately in S and in U U was found to be more discriminative than S EER was reduced by 44% compared to baseline GMM Other Modeling Techniques Kernel-PCA based algorithm
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Modeling Kernel induced
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Goals Detect speaker changes – “speaker segmentation” Cluster speaker segments - “speaker clustering” Motivation for new method Current algorithms do not exploit available training data! (besides tuning thresholds, etc.) Method Explicitly model inter-segment intra-speaker variability from labeled training data, and use for the metric used by change-detection / clustering algorithms. Trainable Speaker Diarization [Aronowitz 2007d]
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Dev data BNAD05 (5hr) - Arabic, broadcast news Eval data BNAT05 – Arabic, broadcast news, (207 target models, 6756 test segments) SystemEER (%) Anchor modeling (baseline)15.1 Anchor modeling - Kernel based scoring10.8 Kernel-PCA projection (CSS)8.8 Kernel-PCA projection (CSS) + inter-segment variability modeling 7.4 Speaker recognition on pairs of 3s segments
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Speaker change detection 2 adjacent sliding windows (3s each) Speaker verification scoring + normalization Speaker clustering Speaker verification scoring + normalization Bottom-up clustering Speaker Error Rate (SER) on BNAT05 Anchor modeling (baseline): 12.9% Kernel-PCA based method: 7.9% Speaker Diarization System & Experiments
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary Outline Intra-Class Variability Modeling for Speech Processing
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June A method for mapping speech segments into a GMM supervector space was described Intra-speaker inter-session variability is modeled in GMM supervector space Speaker recognition EER was reduced by 38% on the NIST-2004 SRE A corresponding kernel-PCA based approach reduces EER by 44% Speaker diarization SER for speaker diarization was reduced by 39%. Summary 1/2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Speaker recognition [Aronowitz 2005b; Aronowitz 2007c] Speaker diarization (“who spoke when”) [Aronowitz 2007d] VAD (voice activity detection) [Aronowitz 2007a] Language identification [Noor & Aronowitz 2006] Gender identification [Bocklet 2008] Age detection [Bocklet 2008] Channel/bandwidth classification [Aronowitz 2007d] Summary 2/2 Algorithms based on the proposed framework
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June [1]D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, [2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, [3] H. Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, [4]H. Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, [5]P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, [6]H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, [7]J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition", in Proc. Interspeech [8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech Bibliography 1/2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June [9]A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, [10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, [11]W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP [12]H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, [13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models”,in Proc. ICASSP [14]H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September [15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, [16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, [17]T. Bocklet et al., “Age and Gender Recognition for Telephone Applications Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, Bibliography 2/2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Presentation is available online at: Thanks!
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Backup slides
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Session space Dot-product feature space f(x) f(y) x y Kernel trick Anchor sessions f() Goals: - Map sessions into feature space - Model in feature space Kernel-PCA Based Mapping 2/5
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Given- kernel K - n anchor sessions Find an orthonormal basis for Method 1)Compute eigenvectors of the centralized kernel-matrix k i, j = K(A i,A j ). 2)Normalize eigenvectors by square-roots of corresponding eigenvalues → {v i } 3) for is the requested basis Kernel-PCA Based Mapping 3/5
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June is a mapping x→R n with the property: Given sessions x, y, may be uniquely represented as: Common speaker subspace - Speaker unique subspace - Kernel-PCA Based Mapping 4/5
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Mapping 5/5
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Modeling in Segment-GMM Supervector Space Segment-GMM supervector space Framesequence: segment #1 Framesequence: segment #2 Framesequence: segment #n music speech silence
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Segmental Modeling for Audio Segmentation Goal Segment audio accurately and robustly into speech / silence / music segments. Novel idea Acoustic modeling is usually done on a frame-basis. Segmentation/classification is usually done on a segment-basis (using smoothing). Why not explicitly model whole segments? Note: speaker, noise, music-context, channel (etc.) are constant during a segment.
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June FR=0.5% FA=1% FR=0.25% GMM baseline 2.9%7.9%29.6% Segmental1.7%5.1% 2.7% Error reduction 41%35% 91% Speech / Silence Segmentation – Results 1/2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June FR=0.5% FA=1% FR=0.25% GMM baseline 1.43%3.4%3.2% Segmental1.27%2.0% 1.9% Error reduction 11%41% Speech / Silence Segmentation – Results 2/2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June LID in Session Space English Arabic French Session space Training session Test session
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Front end: shifted delta cepstrum (SDC). 2.Represent every train/test session by a GMM super-vector. 3.Train a linear SVM to classify GMM super-vectors. Results EER=4.1% on the NIST-03 Eval (30sec sessions). LID in Session Space - Algorithm
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June Anchor Modeling Projection Speaker indexing [Sturim et al., 2001] Intersession variability modeling in projected space [Collet et al., 2005] Speaker clustering [Reynolds et al., 2004] Speaker segmentation [Collet et al., 2006] Language identification [Noor and Aronowitz, 2006] Given: anchor models λ 1,…,λ n and session X= x 1,…,x F = average normalized log-likelihood Projection:
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: Noise Channel Language Changing speaker characteristics – stress, emotion, aging The frame independence assumption does not hold in these cases! (1) (2) Instead, we get: Intra-Class Variability Modeling Introduction