Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng.

Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng

2 Outline Speaker diarization  Problem formulation  A prototypical speaker diarization system Speaker segmentation  Problem formulation  Speaker segmentation using a fixed-size analysis window  Speaker segmentation using a variable-size analysis window Bottom-up segmentation using BIC Top-down segmentation using BIC Speaker clustering  Problem formulation  Hierarchical agglomerative clustering  Optimization-oriented approaches Two leading speaker diarization systems  LIMSI’s system  Cambridge’s system

3 Speaker diarization (Problem formulation) Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT03 Spring Eval.) speaker segmentation Speaker 1 Speaker 2 Speaker 3 speaker clustering

4 Speaker diarization (Problem formulation)  Performance measure of the speaker diarization task (C. Barras et. al., 2006 ; NIST RT03 Spring Eval.)  Applications Find the mapping between reference speakers and hypothesis speakers such that their overlapping in time is largest. In this case, S1->A and S3->B.

5 Speaker diarization (Problem formulation)  Example: Automatic transcription for a broadcast news show By speaker recognition Speaker adaptation+ speech recognition

6 Speaker diarization (A prototypical system) Change boundary refinement Speaker segmentation (usually, over segmentation) To filter out non-speech data The prototypical speaker diarization system (S. E. Tranter & D. A. Reynolds, 2006) Speaker clustering

7 Speaker segmentation (Problem formulation) detect the speaker change boundaries Problem formulation  Performance measure Target changes Hypothesized changes false alarm miss detection Error type: miss detection & false alarm Performance metric: ROC curve ROC curve: F-score: P: precision rate R: recall rate

8 Speaker segmentation (Fixed-size analysis window approach) Data stream Distance computation Sliding windows Distance curve Speaker segmentation using a fixed-size analysis window ( Siegler et. al., 1997 )  Distance measure of two segments Kullback-Leibler (KL) distance ( Siegler et. al., 1997 )

9 Speaker segmentation (Fixed-size analysis window approach) SVM training error ( 王駿發 et. al., 2005 ) Y X Y X More overlap, larger training error  larger distance, less similarity

10 Bayesian information criterion (BIC) for model selection: Data set: Candidate models: Model selection by BIC:  λ=1 in the BIC theory, but is usually tuned for trade-off between error types; maximum likelihood of X for model ; : the number of parameters of ; Speaker segmentation (Fixed-size analysis window approach) ΔBIC ( S. Chen et. al., 1998; P. Delacourt et. al., 2001)

11 Use BIC as an inter-segment distance computation Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test: X and Y are judged as from the same acoustic condition if  BIC <0. Seg X Seg Y Seg X Seg Y Ex: X and Y are from different acoustic conditions,  BIC>=0 X and Y are from the same acoustic condition,  BIC<=0 Speaker segmentation (Fixed-size analysis window approach)

12 Speaker segmentation (Variable-size analysis window approach) The bottom-up detection process on an audio stream Seg4 1 2 3 Audio stream One-change- point detection Seg4 1 2 3 Change point  Bottom-up detection using BIC (S. Chen and P. Gopalakrishnan, 1998; M. Cettolo et. al., 2005 ) Speaker segmentation using a variable-size analysis window

13 One-change-point detection using BIC X Y Calculate at each feature vector  BIC Feature vectors X Y Calculate at each feature vector  BIC Speaker segmentation (Variable-size analysis window approach)

14 Speaker segmentation (Variable-size analysis window approach) Top-down detection using BIC (; )  Top-down detection using BIC ( C. H. Wu and C. H. Hsieh, 2006 ; M. Cettolo et. al., 2005 ) The top-down detection process for an audio stream multiple-change-detection Seg4 1 2 3 Audio stream Seg4 1 2 3

15 Multiple-change-detection using BIC Seg4 1 2 3 Audio stream Seg4 1 2 3 H 0 : H 1 : H 2 : H 3 : Assumption: different segments arise from different Gaussian processes X pr(X| H 0 )<pr(X| H 1 )<pr(X| H 2 )<pr(X| H 3 ) Intuitively, but, BIC(X|H 2 )>BIC(X| H 3 )>BIC(X| H 1 )> BIC(X| H 0 ) Multiple-change-detection: Search the H that has the largest BIC value in the solution space Exhausted search Speaker segmentation (Variable-size analysis window approach)

16 Speaker segmentation (Variable-size analysis window approach) Top-down, hierarchical search (C. H. Wu and C. H. Hsieh, 2006) Seg4 1 2 3 Audio stream Seg4 1 2 3 Pass1: X Pass2: Terminate Dynamic programming (M. Cettolo et. al., 2005 )  An optimal search  An sub-optimal search

17 Speaker clustering (Problem formulation) Problem formulation  given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker

18  Cluster Purity The probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker P : total no. of speakers involved, M : total no. of clusters,  m : purity of the m -th cluster, n m* : no. of utterances in the m -th cluster, n *p : no. of utterances from the p -th speaker, n mp : no. of utterances in the m -th cluster that are from the p -th speaker Increases as the number of clusters increases Speaker clustering (Problem formulation)

19 speaker cluster 12…MSum 1n 11 n 21 …nM1nM1 n1n1 2n 12 n 22 …nM2nM2 n2n2 ……………… Pn1Pn1P n2Pn2P …n MP nPnP Sumn1n1 n2n2 …nMnM N  Rand Index Two error types: I: The number of utterance pairs (with replacement) in the same cluster but from different speakers II: The number of utterance pairs (with replacement) from the same speaker but in different clusters The number of utterance pairs from the same speaker that are in the same cluster The number of utterance pairs from the same speaker Type II error : The number of utterance pairs from the same cluster and are in the same cluster The number of utterance pairs from the same cluster Type I error : Speaker clustering (Problem formulation) Reaches its minimum only when M = P

20 Speaker clustering (Hierarchical agglomerative clustering) Hierarchical agglomerative clustering ( S. Chen and P. Gopalakrishnan, 1998; Barras et. al., 2006 ) X 2 X 13 X X 1 X 2 X 14 X N X 1 X 19 X N X 1 X X 2 X 13 X N X 1 X N X 2  Distance of two clusters: ΔBIC  Stopping criteria: Local BIC Global BIC

21 Speaker clustering ( Optimization-oriented approaches ) Optimization-oriented approaches For a given number of cluster and a set of cluster indices H = [ h 1, h 2, …, h N ] for N utterances X 1, X 2,…, X N, the average cluster purity is o i is the true speaker index of utterance X i, ( 1  o i  P )  Maximum purity clustering ( W. H. Tsai et. al., IEEE Trans. ASLP, 2007 )  (o i, o j ) (the ground truth) is unknown and needs to be estimated.  (o i, o j ) is approximated by S(X i,X j ): similarity between utterances X i and X j R[S(X i,X j )]: rank of inter-utterance similarity S(X i,X j ) among S(X i,X 1 ), S(X i,X 2 ), …, S(X i,X N ) in descending order  i : utterance most similar to X i, i.e., R[S(X i,X  i )] = 2. mth-cluster ; n m =4 …

22 Use BIC to determine the cluster number Let denote the estimated purity. Use Genetic Algorithm to find H* such that Speaker clustering ( Optimization-oriented approaches )  Minimum rand index clustering ( W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process  (o i, o j ) (the ground truth) is unknown and needs to be estimated.

23 Speaker clustering ( Optimization-oriented approaches ) Use Genetic Algorithm to find H* such that    N i N j ji M j M i N i N j M j M i M oohhhhR 11 )()( 11 )()()( ),( ˆ ),(2),()( ˆ H ij  (o i,o j ) is approximated by a normalized inter-utterance similarity: S max is the maximum among the similarities S(X i, X j ),  i  j. where (Generalized likelihood Ratio)

24 Two leading systems  LIMSI’s system (Barras et. al., 2006) Fixed-size sliding window segmentation Boundary refinement Use ΔBIC to measure the inter-cluster similarity, To filter out short-duration silence segments that were not removed in the initial speech detection step To remove only long regions without speech such as silence, music, and noise using GMM Use the cross-likelihood ratio, to measure the inter-cluster similarity. M i is a MAP-adapted GMM. Boundary refinement; Align the change boundaries to silence portions

25 Two leading systems  Cambridge’s system (Sinha et. al., 2005) SD: speech detection Speaker identification (SID) clustering: MAP adaptation (mean-only) was applied towards each cluster from the appropriate gender/bandwidth UBM. Use the cross likelihood ratio (CLR) between any two given clusters. CPD: change point detection IAC: iterative agglomerative clustering

26 Reference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization of Broadcast News,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. NIST 2003 Spring, http://www.nist.gov/speech/tests/rt/rt2003/spring/ R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005. S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation Systems,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998. C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,” IEEE Transactions on Audio, Speech and Language Processing, 2006. M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech and Language, 2005. M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio,” in Proc. DARPA Speech Recognition Workshop, 1997. P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111-126, 2000. 王駿發, 林博川, 王家慶, 宋豪靜, “ 以支援向量機為基礎之新穎語者切換偵測演算法,” in Proc. ROCLING 2005.

27 Reference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization of Broascast News," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no.5, pp. 1505-1512, 2006. Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation," IEEE Trans. on Audio, Speech, and Language Processing, volume 15, number 4, pages 1461-1474, May 2007. Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand Index," IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP2007), April 2007. R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005.

28 Thank You

Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng.

Similar presentations

Presentation on theme: "Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng.

Similar presentations

Presentation on theme: "Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng."— Presentation transcript:

Similar presentations

About project

Feedback