Download presentation
Presentation is loading. Please wait.
1
1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines d’applications 2.Les indices de l’identité dans la parole 3.Vérification du locuteur 1.Théorie de la decision 2.Dépendante / Indépendante du texte 4.L’imposture vocale 5.Vérification audio-visuelle de l’identité 6.Evaluations 7.Conclusions
2
2 Why should a computer recognize who is speaking ? Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) Limited access (secured areas, data bases) Personalization (only respond to its master’s voice) Locate a particular person in an audio-visual document (information retrieval) Who is speaking in a meeting ? Is a suspect the criminal ? (forensic applications)
3
3 Tasks in Automatic Speaker Recognition Speaker verification (Voice Biometrics) Are you really who you claim to be ? Identification (Speaker ID) : Is this speech segment coming from a known speaker ? How large is the set of speakers (population of the world) ? Speaker detection, segmentation, indexing, retrieval, tracking : Looking for recordings of a particular speaker Combining Speech and Speaker Recognition Adaptation to a new speaker, speaker typology Personalization in dialogue systems
4
4 Applications Access Control Physical facilities, Computer networks, Websites Transaction Authentication Telephone banking, e-Commerce Speech data Management Voice messaging, Search engines Law Enforcement Forensics, Home incarceration
5
5 Voice Biometric Avantages Often the only modality over the telephone, Low cost (microphone, A/D), Ubiquity Possible integration on a smart (SIM) card Natural bimodal fusion : speaking face Disadvantages Lack of discretion Possibility of imitation and electronic imposture Lack of robustness to noise, distortion,… Temporal drift
6
6 Speaker Identity in Speech Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values) 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage The differences between Voices of Twins is a limit case Voices can also be imitated or disguised
7
7 spectral envelope of / i: / f A Speaker A Speaker B Speaker Identity segmental factors (~30ms) glottal excitation: fundamental frequency, amplitude, voice quality (e.g., breathiness) vocal tract: characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef) suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits
8
8 What are the sources of difficulty ? Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) Recording conditions (filtering, noise,…) Channel mismatch between enrolment and testing Temporal drift Intentional imposture Voice disguise
9
9 Acoustic features Short term spectral analysis
10
10 Intra- and Inter-speaker variability
11
11 Speaker Verification Typology of approaches (EAGLES Handbook) Text dependent Public password Private password Customized password Text prompted Text independent Incremental enrolment Evaluation
12
12 History of Speaker Recognition
13
13 Current approaches
14
14 Dynamic Time Warping (DTW) Best path “Bonjour” locuteur test Y “Bonjour” locuteur X “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.
15
15 Vector Quantization (VQ) best quant. Dictionnaire locuteur 1 Dictionnaire locuteur 2 Dictionnaire locuteur n “Bonjour” locuteur test Y Dictionnaire locuteur X SOONG, ROSENBERG 1987
16
16 Hidden Markov Models (HMM) Best path “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n “Bonjour” locuteur test Y “Bonjour” locuteur X ROSENBERG 1990, TSENG 1992
17
17 Ergodic HMM Best path HMM locuteur 1 HMM locuteur 2 HMM locuteur n “Bonjour” locuteur test Y HMM locuteur X PORITZ 1982, SAVIC 1990
18
18 Gaussian Mixture Models (GMM) REYNOLDS 1995
19
19 HMM structure depends on the application
20
20 Some issues in Text-dependent Speaker Verification Systems : The CAVE and PICASSO projects Sequences of digits Speaker independent HMM of each digit Adaptation of these HMMs to the client voice (during enrolment and incremental enrolment) EER of less than 1 % can be achieved Customized password The client chooses his password using some feedback from the system Deliberate imposture
21
21 Gaussian Mixture Model Parametric representation of the probability distribution of observations:
22
22 Gaussian Mixture Models 8 Gaussians per mixture
23
23 GMM speaker modeling Front-end GMM MODELING WORLD GMM MODEL Front-end GMM model adaptation TARGET GMM MODEL
24
24 Baseline GMM method HYPOTH. TARGET GMM MOD. Front-end WORLD GMM MODEL Test Speech LLR SCORE =
25
25 Two types of errors : False rejection (a client is rejected) False acceptation (an impostor is accepted) Decision theory : given an observation O and a claimed identity H 0 hypothesis : it comes from an impostor H 1 hypothesis : it comes from our client H 1 is chosen if and only if P(H 1 |O) > P(H 0 |O) which could be rewritten (using Bayes law) as Decision theory for identity verification
26
26 Signal detection theory
27
27 Decision
28
28 Distribution of scores
29
29 Detection Error Tradeoff (DET) Curve
30
30 Evaluation Decision cost (FA, FR, priors, costs,…) Receiver Operating Characteristic Curve Reference systems (open software) Evaluations (algorithms, field trials, ergonomy,…)
31
31 NIST Speaker Verification Evaluations A reference standard to compare algorithms and stimulate new developments Distribution (via LDC) of development and test databases with : Increasing difficulty (from land line to mobile) Several hundreds of speakers (2 mn of training data per client), Several thousands test accesses (5 to 50 sec per access), Participation of 15-20 labs every year (MIT, IBM, Nuance, Queensland Univ, ELISA consortium,….) Annual workshop, Special issues in Journals, …
32
32 National Institute of Standards & Technology (NIST) Speaker Verification Evaluations Annual evaluation since 1995 Common paradigm for comparing technologies
33
33 Speaker Verification (text independent) The ELISA consortium ENST, LIA, IRISA,... http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers NIST evaluations http://www.nist.gov/speech/tests/spk/index.htm
34
34 NIST evaluations : Results
35
35 Evaluations: NIST 2004
36
36 Combining Speech Recognition and Speaker Verification. Speaker independent phone HMMs Selection of segments or segment classes which are speaker specific Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)
37
37 ALISP : Automatic Language Independent Speech Processing Data-driven speech segmentation
38
38 Searching in client and world speech dictionaries for speaker verification purposes
39
39 Fusion
40
40 Fusion results
41
41 Voice Transformations and Forgery (occasional, dedicated) Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures Prevention by predicting many different forgery scenarios
42
42 Voice Forgery using ALISP The same words or not Impostor The same words or not client transformation A modification of a source speaker‘s speech to imitate a target speaker
43
43 Conversion system: ALISP encoder Speech MFCC analysis HNM HMM recognition Harmonic envelope Symbol index - Representative index - DTW path Choice of the best representative unit Prosody (energy+pitch) MFCC + delta Database of HNM Representatives HMM models Noise envelope
44
44 Conversion system: ALISP Decoder Concatenation of HNM parameters for each representative HNM Synthesis Speech signalSymbol index Pitch, energy, timing Representative index DTW path
45
45 Preliminary results: DET curves Fa before forgery : 16 ± 2.0 % (1700 files) Fa after forgery : 26 ± 2.0 % (1700 files)
46
46 Preliminary results True distributions
47
47 Multimodal Identity Verification M2VTS (face and speech) front view and profile pseudo-3D with coherent light BIOMET: (face, speech, fingerprint, signature, hand shape) data collection reuse of the M2VTS and DAVID data bases experiments on the fusion of modalities
48
48 Speaking Faces : Motivations In many situation a video sequence is acquired Fusion of face and speech increases robustness Forgery is more difficult
49
49 Talking Face Recognition (hybrid verification)
50
50 Lip features Tracking lip movements
51
51 A talking face model Using Hidden Markov Models (HMMs) Acoustic parameters Visual parameters
52
52 Imposture Model
53
53 Cloning
54
54 Conclusions, Perspectives Deliberate imposture is a challenge for speech only systems Verification of identity based on features extracted from talking faces should be developped Common databases and evaluation protocols are necessary Free access to reference systems will facilitate future developments
55
55 BioSecure Residential Workshop Aug. 1st - 26th, 2005 in ENST, Paris Reference systems for speech, face, talking face, fingerprint, iris, hand, signature, … Comparative evaluations on large databases (BIOMET, BANCA, FVC,…) Fusion of modalities http://www.biosecure.info
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.