Download presentation
Presentation is loading. Please wait.
Published byRhoda Reeves Modified over 9 years ago
1
http://diuf.unifr.ch/diva Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI, Paris 3-4 December 2003, Biometrics Tutorials, Uni. Fribourg ALISP based improvement of GMM ’ s for Text-independent Speaker Verification
2
Biometrics, 3-4 Dec. 2003, Fribourg 2 Overview 1. Why segmental speaker verification systems ? 2. Speech segmentation problems 3. Proposed segmental system based on DTW distance measure 4. Experimental setup 5. Results 6. Conclusions and perspectives
3
Biometrics, 3-4 Dec. 2003, Fribourg 3 1 Why segmental speaker verification systems ? Current reference speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated independently) Speech is composed of different sounds Phonemes have different discriminant characteristics for speaker verification (see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…) nasals and vowels convey more speaker characteristics than other speech classes we would like to exploit this fact We need a automatic speech segmentation tool !
4
Biometrics, 3-4 Dec. 2003, Fribourg 4 1.1 Advantages and disadvantages of the speech segmentation Problems: Need of a speech segmentation tool Speaker modeling per speech classes => more data needed More complicated systems Advantages Possibility to use it in combination with a dialogue based systems, for which a speech segmentation is already done Possibility to use it in combination with a dialogue based systems, for which a speech segmentation is already done Possibility to introduce text-prompted speaker verification, designed to include a maximum number of speaker specific units
5
Biometrics, 3-4 Dec. 2003, Fribourg 5 2 Speech Segmentation Large Vocabulary Continuous Speech Recognition (LVCSR) System good results for a small set of languages need huge amount of annotated speech data language (and task) dependent we do not have such a for American English
6
Biometrics, 3-4 Dec. 2003, Fribourg 6 2.1 ALISP Speech Segmentation Data-driven speech segmentation not yet usable for speech recognition purposes no annotated databases needed language and task independent we could use it to segment the speech data for a text-independent speaker verification task We will use the data driven speech segmentation method ALISP (Automatic Language Independent Speech Processing)
7
Biometrics, 3-4 Dec. 2003, Fribourg 7 2.2 ALISP principles
8
Biometrics, 3-4 Dec. 2003, Fribourg 8 3 Proposed speaker verification system: ALISP segments and DTW 3.1 Segmentation problem Segmentation of the speech data with N ALISP HMM models N= 64 speech classes Need of (not transcribed) speech data, to train the 64 ALISP HMM models With so much speech classes we should change the speaker modeling method, not enough data for GMM adaptation===> Use of Dynamic Time Warping (DTW)
9
Biometrics, 3-4 Dec. 2003, Fribourg 9 3.2 DTW distance measure for speaker verification Dynamic Time Warping (DTW) was already used for speaker verification, in a text-dependent mode (Rosenberg `76, Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…) The DTW distance measure between two speech segments conveys speaker specific characteristics Originality: used DTW in text-independent mode We first proceed to the segmentation of speech data in ALISP classes Measure the “distance “ between speaker and non-speaker segments Speaker specific information is extracted from the : ALISP based speech segments = > Client Dictionary Non-speaker (world speakers) : ALISP based speech segments => World Dictionary
10
Biometrics, 3-4 Dec. 2003, Fribourg 10 3.3 Searching in the client and world speech dictionaries for speaker verification purposes
11
Biometrics, 3-4 Dec. 2003, Fribourg 11 4 Evaluation of the proposed system: experimental setup Development data: one subset from NIST 2002 cellular data (American English) world speakers (60 female + 59 male): used to train the ALISP speech segmenter and to model the non-speakers (world speakers) Evaluated on another subset from NIST 2002 (111 + 79 male speakers)
12
Biometrics, 3-4 Dec. 2003, Fribourg 12 4.1 Speech segmentation example 2 another occurrences of the English phone : ay ; the corresponding ALISP sequences: HX - Hf and (HM) - Hf - Ha- previous slide : (Hf )-Ha or (HM) - HZ -Ha previous slide : (Hf )-Ha or (HM) - HZ -Ha
13
Biometrics, 3-4 Dec. 2003, Fribourg 13 4.2 Results: GMM, ALISP-DTW systems and their fusion
14
Biometrics, 3-4 Dec. 2003, Fribourg 14 4.3 Results: EER comparison System EER % ALISP-DTWGMM22.717.4 Linear fusion (no score normalization) LR fusion (no score normalization ) LR fusion (normalized scores) Linear fusion (normalized scores) 18.91312.612.2
15
Biometrics, 3-4 Dec. 2003, Fribourg 15 4.4 Importance of fusion (33% improvement)
16
Biometrics, 3-4 Dec. 2003, Fribourg 16 4.5 Using only GMM’s scores to segments=> segmental Gmm system
17
Biometrics, 3-4 Dec. 2003, Fribourg 17 5. Conclusions State of the art NIST 2002 results for EER: (best 8% to worst 28%) Fusion of classical system with a segmental systems : big improvements Why: higher level informations present in the segmental system complement usefully the short therm frequency informations present in the GMM system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.