Download presentation
Presentation is loading. Please wait.
Published byKenzie Gander Modified over 10 years ago
2
Cross-channel Text-independent and Text-dependent Speaker Verification Speaker : 張文杰 National Taipei University of Technology National Central University
3
2 Outline Introduction Speaker verification (SV) Resources and speaker recognition evaluation Resources NIST-SRE and ISCSLP'2006-SRE Channel/handset/session mismatch problem Channel/handset mismatch compensation approaches Robust features Feature and score normalization Feature and model compensation State-of-the-art systems MIT Lincoln laboratory SRI International's Speech Technology and Research (STAR) Laboratory National Taipei University of Technology
4
3 Speaker Verification (1/2)
5
4 Speaker Verification (2/2)
6
5 Text-independent and Text-dependent SVs Text-independent SV (TI-SV) –Do not rely on a specific text being spoken –More flexible Text-dependent SV (TD-SV) –A fixed word, or test-constraint speech –The system has prior knowledge of the text to be spoken
7
6 Introduction Speaker verification (SV) Resources and speaker recognition evaluation Resources and speaker recognition evaluation Resources NIST-SRE and ISCSLP'2006-SRE Channel/handset/session mismatch problem Channel/handset mismatch compensation approaches Robust features Feature and score normalization Feature and model compensation State-of-the-art systems MIT Lincoln laboratory SRI International's Speech Technology and Research (STAR) Laboratory National Taipei University of Technology
8
7 Software Hidden Markov Model Toolkit (HTK) –http://htk.eng.cam.ac.uk/ System fusion by LNKnet (from MIT) –http://www.ll.mit.edu/SST/lnknet/
9
8 Corpora (1/2) Linguistic Data Consortium (LDC) –YOHO; KING; Switchboard Ⅰ - Ⅱ & NIST Evaluation Subsets; TIMIT, et al.; European Language Resources Association (ELRA) –SIVA; PolyVar; POLYCOST; Oregon Graduate Institute (OGI) –Speaker Recognition Corpus
10
9 Corpora (2/2) Oregon Graduate Institute (OGI) –Approximately 100 speakers (a future release may contain 600 speakers) calling from different telephone environments and at different times –Each of these speakers calls OGI’s system 12 times over a 2- year period –Several different types of data were requested from each speaker to provide a corpus useful for vocabulary-dependent and vocabulary-independent speaker identification and verification systems
11
10 NIST-SRE (1/3) 1996 ~ 1999 Speaker Recognition Evaluation: –Training “One-session” training “Two-session” training “Two handset” training “Two-session-full” training –Test Test segment duration Same/different handset –1999 One-speaker detection Multi-speaker detection Speaker tracking
12
11 NIST-SRE (2/3) 2000 ~ 2003 Speaker Recognition Evaluation : –Speaker segmentation –One speaker detection A test using Spanish language data A test using cellular telephone data A test to explore idiolectal characteristics Multi-modal data –Speaker segmentation - various data sources Telephone conversations Broadcast news recordings Recordings of meetings
13
12 NIST-SRE (3/3) 2004 ~ 2006 Speaker Recognition Evaluation : –Speaker detection –Several distinct training conditions Single channel excerpts Single channel conversation sides Two channel conversations Summed channel conversations –Several distinct test conditions Single channel excerpts Single channel conversation sides Two channel conversations Summed channel conversations Auxiliary microphone conversations
14
13 ISCSLP'2006-SRE Chinese Corpus Consortium (CCC) –TI-SV Development data set: 300 male speaker ’ s data Evaluation data set: 800 registered speaker Land-line and cellular-phone channels 11800 trials –TD-SV Development data set: 5 male and 5 female speaker ’ s data Evaluation data set: 591 registered speaker Three microphone channels 11181 trials
15
14 Channel/handset/session mismatch problem Different channel impose different characteristics on acoustic signal Distort the short-term distribution of the speech features Those basic features will become corrupted under mismatched conditions
16
15 Introduction Speaker verification (SV) Resources and speaker recognition evaluation Resources NIST-SRE and ISCSLP'2006-SRE Channel/handset/session mismatch problem Channel/handset mismatch compensation approaches Channel/handset mismatch compensation approaches Robust features Feature and score normalization Feature and model compensation State-of-the-art systems MIT Lincoln laboratory SRI International's Speech Technology and Research (STAR) Laboratory National Taipei University of Technology
17
16 Robust features (1/2)
18
17 Robust features (2/2) Speaker-Specific Glottal and Prosodic Information for Speaker Verification –Everyone has his own style Martin Luther King John F. Kennedy –Weakly sensitive to channel/handset Martin Luther King again (channel distortion) –However, it is affected by many non-speaker specific factors phonetic content syntactic structure of the text speaking style (read/spontaneous speech, fast/normal/slow, etc) emotion (neutral, happy, angry, etc) –John F. Kennedy again (beginning of his speech) –How to reliably extracted only speaker-specific factors for real-life speaker verification? Usually, limited available training and test data
19
18 Feature-based normalization (1/5) Furui (IEEE ASSP.,1981) –Cepstral mean subtraction (CMS) Subtract the long-term average of handset-distorted cepstral features System should wait until the end of speech Accurate cepstral mean cannot be estimated especially when the utterance is short May remove speaker ’ s characteristics
20
19 Feature-based normalization (2/5) Hermansky (IEEE Trans. on SAP.,1994) –RelAtive SpecTrA (RASTA) Suppress the spectral components that change more slowly or quickly than the typical range of change of speech
21
20 Feature-based normalization (3/5) Pelecanos (Speaker Odyssey Conf.,2001) –Feature warping –B. Xing (ICASSP,2002) Short-time Gaussianization –Yanlu (ICASSP,2006) Kurtosis normalization
22
21 Feature-based normalization (4/5) Torre (IEEE Trans. on SAP.,2005) –Histogram equalization (HEQ)
23
22 Feature-based normalization (5/5) Chen (UWEETR,2003) –MVA M – mean subtraction eliminate the bias term due to noise V – variance normalization counters the magnitude shrink A – ARMA is shown to further reduce the Euclidean distance between clean and noisy feature streams
24
23 Score-based normalization (1/2) Reynolds (EuroSpeech,1997) –Zero normalization (Znorm) Compute scores from a set of imposter speech segments to normalize a speaker model –Reynolds (Digital Sign. Proc.,2000) Handset normalization (Hnorm)
25
24 Score-based normalization (2/2) Auckenthaler (Digital Sign. Proc.,2000) –Test normalization (Tnorm) Normalize the score of the test segment against claimed model –Sturim (ICASSP,2005) Adaptive-Tnorm (ATnorm)
26
25 Feature-based compensation (1/2) Douglas (ICASSP,2003) –Feature mapping Map feature vectors from different channels into a channel independent feature space –Zhonghua (ICSP,2004) Extended feature mapping (EFM)
27
26 Feature-based compensation (2/2) Yang (ICSLP,2004) –Maximum Likelihood A Priori Knowledge Interpolation (ML- AKI)
28
27 Model-based compensation (1/2) Teunen (ICSLP,2000) –Speaker model synthesis (SMS) Learn how model parameters change between different channels and applies this transform to synthesize speaker models
29
28 Model-based compensation (2/2) Yang (ICSLP,2004) –MLLR (ML-AKI)
30
29 Introduction Speaker verification (SV) Resources and speaker recognition evaluation Resources NIST-SRE and ISCSLP'2006-SRE Channel/handset/session mismatch problem Channel/handset mismatch compensation approaches Robust features Feature and score normalization Feature and model compensation State-of-the-art systems State-of-the-art systems MIT Lincoln laboratory SRI International's Speech Technology and Research (STAR) Laboratory National Taipei University of Technology
31
30 MIT Lincoln laboratory Spectral based –RASTA, feature mapping, mean and variance normalization –GMM-UBM, Support Vector Machine (SVM) –Tnorm, ATnorm Prosodic based –Pitch and Energy GMM –Slope and Duration N-gram Phonetic based –Phone N-gram –Phone SVM Idiolectal based –Word N-gram System fusion –LNKnet
32
31 SRI International's Speech Technology and Research (STAR) Laboratory Acoustic feature based systems –Cepstral GMM system –Cepstral SVM system –MLLR transform SVM system Stylistic feature based systems –Word N-gram SVM system –SNERF system Uses a set of prosodic features –Duration systems State level Word level System combination methods –Neural network combiner –SVM combiner –Class-dependent combiner
33
32 National Taipei University of Technology
34
33 Thank You !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.