Download presentation
Presentation is loading. Please wait.
Published byKelly Short Modified over 9 years ago
1
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1 Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: http://aronowitzh.googlepages.com/ Intra-Class Variability Modeling for Speech Processing
2
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 2 Given labeled training segments from class + and class –, classify unlabeled test segments Classification framework 1. Represent speech segments in segment-space 2. Learn a classifier in segment-space SVMs NNs Bayesian classifiers … Speech Classification Proposed framework
3
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 3 Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
4
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 4 GMM based speaker recognition Estimate Pr(y t |S) 1. Train a universal background model (UBM) GMM using EM 2. For every target speaker S: Train a GMM G S by applying MAP-adaptation Text-Independent Speaker Recognition GMM-Based Algorithm [Reynolds 1995] Assuming frame independence: UBM Q 1 - speaker #1 Q 2 - speaker #2 μ1μ1 μ2μ2 μ3μ3 R 26 MFCC feature space
5
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 5 1.Invalid frame independence assumption: Factors such as channel, emotion, lexical variability, and speaker aging cause frame dependency 2.GMM scoring is inefficient – linear in the length of the audio 3.GMM scoring does not support indexing GMM Based Algorithm - Analysis
6
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 6 Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
7
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 7 Mapping Speech Segments into Segment Space GMM scoring approximation 1/4 Definitions X:training session for target speaker Y:test session Q:GMM trained for X P:GMM trained for Y Goal Compute Pr(Y |Q) using GMMs P and Q only Motivation 1. Efficient speaker recognition and indexing 2. More accurate modeling
8
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 8 (1) Negative cross entropy Mapping Speech Segments into Segment Space GMM scoring approximation 2/4 Approximating the cross entropy between two GMMs 1.Matching based lower bound [Aronowitz 2004] 2.Unscented-transform based approximation [Goldberger & Aronowitz 2005] 3.Others options in [Hershey 2007]
9
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 9 (2) Matching based approximation Mapping Speech Segments into Segment Space GMM scoring approximation 3/4 Assuming weights and covariance matrices are speaker independent (+ some approximations): (3) Mapping T is induced: (4)
10
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 10 Results Mapping Speech Segments into Segment Space GMM scoring approximation 4/4 Figure and Table taken from: H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.
11
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 11 1.Anchor modeling projection [Sturim 2001] efficient but inaccurate 2.MLLR transofrms [Stolcke 2005] accurate but inefficient 3.Kernel-PCA-based mapping [Aronowitz 2007c] Given - a set of objects - a kernel function (a dot product between each pair of objects) Finds a mapping of the objects into R n which preserves the kernel function. accurate & efficient Other Mapping Techniques
12
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 12 Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Mapping Kernel induced
13
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 13 Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
14
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 14 Introduction Mapping Modeling Speaker Diarization Summary The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: channel, noise language stress, emotion, aging The frame independence assumption does not hold in these cases! (1) (3) Instead, we can use a more relaxed assumption: Intra-Class Variability Modeling [Aronowitz 2005b] Introduction (2) which leads to:
15
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 15 Introduction Mapping Modeling Speaker Diarization Summary Speaker Framesequence generated independently a GMM Old vs. New Generative Models Session GMM Framesequence Speaker a PDF over GMM space a GMM generated independently Old Model New Model
16
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 16 Introduction Mapping Modeling Speaker Diarization Summary speaker #1 speaker #2 speaker #3 Session-GMM Space Session-GMM space GMM for session A of speaker #1 GMM for session B of speaker #1
17
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 17 Modeling in Session-GMM space 1/2 Recall mapping T induced by the GMM approximation analysis: is called a supervector A speaker is modeled by a multivariate normal distribution in supervector space: (3) A typical dimension of is 50,000*50,000 is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal
18
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 18 Introduction Mapping Modeling Speaker Diarization Summary Supervector space 1 2 1 2 1 2 1 2 1 2 1 2 speaker #1 speaker #2 speaker #3 Delta supervector space Modeling in Session-GMM Space 2/2 Estimating covariance matrix
19
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 19 is estimated from the NIST-2006-SRE corpus Evaluation is done on the NIST-2004-SRE corpus ETSI MFCC (13-cep + 13-delta-cep) Energy based voice activity detector Feature warping 2048 Gaussians Target models are adapted from GI-UBM ZT-norm score normalization Experimental Setup Datasets System description
20
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 20 Results 38% reduction in EER
21
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 21 NAP+SVMs [Campbell 2006] Factor Analysis [Kenny 2005] Kernel-PCA [Aronowitz 2007c] Model each supervector as s S : Common speaker subspace u U : Speaker unique subspace S is spanned by a set of development supervectors (700 speakers) U is the orthogonal complement of S in supervector space Intra-speaker variability is modeled separately in S and in U U was found to be more discriminative than S EER was reduced by 44% compared to baseline GMM Other Modeling Techniques Kernel-PCA based algorithm
22
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 22 Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Modeling Kernel induced
23
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 23 Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary
24
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 24 Goals Detect speaker changes – “speaker segmentation” Cluster speaker segments - “speaker clustering” Motivation for new method Current algorithms do not exploit available training data! (besides tuning thresholds, etc.) Method Explicitly model inter-segment intra-speaker variability from labeled training data, and use for the metric used by change-detection / clustering algorithms. Trainable Speaker Diarization [Aronowitz 2007d]
25
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 25 Dev data BNAD05 (5hr) - Arabic, broadcast news Eval data BNAT05 – Arabic, broadcast news, (207 target models, 6756 test segments) SystemEER (%) Anchor modeling (baseline)15.1 Anchor modeling - Kernel based scoring10.8 Kernel-PCA projection (CSS)8.8 Kernel-PCA projection (CSS) + inter-segment variability modeling 7.4 Speaker recognition on pairs of 3s segments
26
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 26 Speaker change detection 2 adjacent sliding windows (3s each) Speaker verification scoring + normalization Speaker clustering Speaker verification scoring + normalization Bottom-up clustering Speaker Error Rate (SER) on BNAT05 Anchor modeling (baseline): 12.9% Kernel-PCA based method: 7.9% Speaker Diarization System & Experiments
27
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 27 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary Outline Intra-Class Variability Modeling for Speech Processing
28
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 28 A method for mapping speech segments into a GMM supervector space was described Intra-speaker inter-session variability is modeled in GMM supervector space Speaker recognition EER was reduced by 38% on the NIST-2004 SRE A corresponding kernel-PCA based approach reduces EER by 44% Speaker diarization SER for speaker diarization was reduced by 39%. Summary 1/2
29
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 29 Speaker recognition [Aronowitz 2005b; Aronowitz 2007c] Speaker diarization (“who spoke when”) [Aronowitz 2007d] VAD (voice activity detection) [Aronowitz 2007a] Language identification [Noor & Aronowitz 2006] Gender identification [Bocklet 2008] Age detection [Bocklet 2008] Channel/bandwidth classification [Aronowitz 2007d] Summary 2/2 Algorithms based on the proposed framework
30
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 30 [1]D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108. [2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001. [3] H. Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004. [4]H. Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005. [5]P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005. [6]H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005. [7]J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition", in Proc. Interspeech 2005. [8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005. Bibliography 1/2
31
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 31 [9]A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005. [10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006. [11]W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006. [12]H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007. [13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models”,in Proc. ICASSP 2007. [14]H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007. [15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007. [16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007. [17]T. Bocklet et al., “Age and Gender Recognition for Telephone Applications Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008. Bibliography 2/2
32
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 32 Presentation is available online at: http://aronowitzh.googlepages.com/ Thanks!
33
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 33 Backup slides
34
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 34 Session space Dot-product feature space f(x) f(y) x y Kernel trick Anchor sessions f() Goals: - Map sessions into feature space - Model in feature space Kernel-PCA Based Mapping 2/5
35
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 35 Given- kernel K - n anchor sessions Find an orthonormal basis for Method 1)Compute eigenvectors of the centralized kernel-matrix k i, j = K(A i,A j ). 2)Normalize eigenvectors by square-roots of corresponding eigenvalues → {v i } 3) for is the requested basis Kernel-PCA Based Mapping 3/5
36
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 36 is a mapping x→R n with the property: Given sessions x, y, may be uniquely represented as: Common speaker subspace - Speaker unique subspace - Kernel-PCA Based Mapping 4/5
37
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 37 Session space Feature space x f(x) Tx Common speaker subspace (R n) y f(y) Ty uyuy uxux Speaker unique subspace K-PCA Anchor sessions Kernel-PCA Based Mapping 5/5
38
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 38 Modeling in Segment-GMM Supervector Space Segment-GMM supervector space Framesequence: segment #1 Framesequence: segment #2 Framesequence: segment #n music speech silence
39
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 39 Segmental Modeling for Audio Segmentation Goal Segment audio accurately and robustly into speech / silence / music segments. Novel idea Acoustic modeling is usually done on a frame-basis. Segmentation/classification is usually done on a segment-basis (using smoothing). Why not explicitly model whole segments? Note: speaker, noise, music-context, channel (etc.) are constant during a segment.
40
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 40 SystemEERFA @ FR=0.5% FR @ FA=1% EVAL06FA=24.2% @ FR=0.25% GMM baseline 2.9%7.9%29.6% Segmental1.7%5.1% 2.7% Error reduction 41%35% 91% Speech / Silence Segmentation – Results 1/2
41
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 41 SystemEERFA @ FR=0.5% FR @ FA=1% EVAL06FA=69% @ FR=0.25% GMM baseline 1.43%3.4%3.2% Segmental1.27%2.0% 1.9% Error reduction 11%41% Speech / Silence Segmentation – Results 2/2
42
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 42 LID in Session Space English Arabic French Session space Training session Test session
43
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 43 1.Front end: shifted delta cepstrum (SDC). 2.Represent every train/test session by a GMM super-vector. 3.Train a linear SVM to classify GMM super-vectors. Results EER=4.1% on the NIST-03 Eval (30sec sessions). LID in Session Space - Algorithm
44
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 44 Anchor Modeling Projection Speaker indexing [Sturim et al., 2001] Intersession variability modeling in projected space [Collet et al., 2005] Speaker clustering [Reynolds et al., 2004] Speaker segmentation [Collet et al., 2006] Language identification [Noor and Aronowitz, 2006] Given: anchor models λ 1,…,λ n and session X= x 1,…,x F = average normalized log-likelihood Projection:
45
Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 45 The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: Noise Channel Language Changing speaker characteristics – stress, emotion, aging The frame independence assumption does not hold in these cases! (1) (2) Instead, we get: Intra-Class Variability Modeling Introduction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.