Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines.

Similar presentations


Presentation on theme: "1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines."— Presentation transcript:

1 1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines d’applications 2.Les indices de l’identité dans la parole 3.Vérification du locuteur 1.Théorie de la decision 2.Dépendante / Indépendante du texte 4.L’imposture vocale 5.Vérification audio-visuelle de l’identité 6.Evaluations 7.Conclusions

2 2 Why should a computer recognize who is speaking ? Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) Limited access (secured areas, data bases) Personalization (only respond to its master’s voice) Locate a particular person in an audio-visual document (information retrieval) Who is speaking in a meeting ? Is a suspect the criminal ? (forensic applications)

3 3 Tasks in Automatic Speaker Recognition Speaker verification (Voice Biometrics)  Are you really who you claim to be ? Identification (Speaker ID) :  Is this speech segment coming from a known speaker ?  How large is the set of speakers (population of the world) ? Speaker detection, segmentation, indexing, retrieval, tracking :  Looking for recordings of a particular speaker Combining Speech and Speaker Recognition  Adaptation to a new speaker, speaker typology  Personalization in dialogue systems

4 4 Applications Access Control  Physical facilities, Computer networks, Websites Transaction Authentication  Telephone banking, e-Commerce Speech data Management  Voice messaging, Search engines Law Enforcement  Forensics, Home incarceration

5 5 Voice Biometric Avantages  Often the only modality over the telephone,  Low cost (microphone, A/D), Ubiquity  Possible integration on a smart (SIM) card  Natural bimodal fusion : speaking face Disadvantages  Lack of discretion  Possibility of imitation and electronic imposture  Lack of robustness to noise, distortion,…  Temporal drift

6 6 Speaker Identity in Speech Differences in  Vocal tract shapes and muscular control  Fundamental frequency (typical values)  100 Hz (Male), 200 Hz (Female), 300 Hz (Child)  Glottal waveform  Phonotactics  Lexical usage The differences between Voices of Twins is a limit case Voices can also be imitated or disguised

7 7 spectral envelope of / i: / f A Speaker A Speaker B Speaker Identity segmental factors (~30ms)  glottal excitation: fundamental frequency, amplitude, voice quality (e.g., breathiness)  vocal tract: characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef) suprasegmental factors  speaking speed (timing and rhythm of speech units)  intonation patterns  dialect, accent, pronunciation habits

8 8 What are the sources of difficulty ? Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) Recording conditions (filtering, noise,…) Channel mismatch between enrolment and testing Temporal drift Intentional imposture Voice disguise

9 9 Acoustic features Short term spectral analysis

10 10 Intra- and Inter-speaker variability

11 11 Speaker Verification Typology of approaches (EAGLES Handbook)  Text dependent  Public password  Private password  Customized password  Text prompted  Text independent Incremental enrolment Evaluation

12 12 History of Speaker Recognition

13 13 Current approaches

14 14 Dynamic Time Warping (DTW) Best path “Bonjour” locuteur test Y “Bonjour” locuteur X “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.

15 15 Vector Quantization (VQ) best quant. Dictionnaire locuteur 1 Dictionnaire locuteur 2 Dictionnaire locuteur n “Bonjour” locuteur test Y Dictionnaire locuteur X SOONG, ROSENBERG 1987

16 16 Hidden Markov Models (HMM) Best path “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur n “Bonjour” locuteur test Y “Bonjour” locuteur X ROSENBERG 1990, TSENG 1992

17 17 Ergodic HMM Best path HMM locuteur 1 HMM locuteur 2 HMM locuteur n “Bonjour” locuteur test Y HMM locuteur X PORITZ 1982, SAVIC 1990

18 18 Gaussian Mixture Models (GMM) REYNOLDS 1995

19 19 HMM structure depends on the application

20 20 Some issues in Text-dependent Speaker Verification Systems : The CAVE and PICASSO projects Sequences of digits  Speaker independent HMM of each digit  Adaptation of these HMMs to the client voice (during enrolment and incremental enrolment)  EER of less than 1 % can be achieved Customized password  The client chooses his password using some feedback from the system Deliberate imposture

21 21 Gaussian Mixture Model Parametric representation of the probability distribution of observations:

22 22 Gaussian Mixture Models 8 Gaussians per mixture

23 23 GMM speaker modeling Front-end GMM MODELING WORLD GMM MODEL Front-end GMM model adaptation TARGET GMM MODEL

24 24 Baseline GMM method HYPOTH. TARGET GMM MOD. Front-end WORLD GMM MODEL Test Speech LLR SCORE =

25 25 Two types of errors :  False rejection (a client is rejected)  False acceptation (an impostor is accepted) Decision theory : given an observation O and a claimed identity  H 0 hypothesis : it comes from an impostor  H 1 hypothesis : it comes from our client H 1 is chosen if and only if P(H 1 |O) > P(H 0 |O) which could be rewritten (using Bayes law) as Decision theory for identity verification

26 26 Signal detection theory

27 27 Decision

28 28 Distribution of scores

29 29 Detection Error Tradeoff (DET) Curve

30 30 Evaluation Decision cost (FA, FR, priors, costs,…) Receiver Operating Characteristic Curve Reference systems (open software) Evaluations (algorithms, field trials, ergonomy,…)

31 31 NIST Speaker Verification Evaluations A reference standard to compare algorithms and stimulate new developments Distribution (via LDC) of development and test databases with :  Increasing difficulty (from land line to mobile)  Several hundreds of speakers (2 mn of training data per client),  Several thousands test accesses (5 to 50 sec per access), Participation of 15-20 labs every year (MIT, IBM, Nuance, Queensland Univ, ELISA consortium,….) Annual workshop, Special issues in Journals, …

32 32 National Institute of Standards & Technology (NIST) Speaker Verification Evaluations Annual evaluation since 1995 Common paradigm for comparing technologies

33 33 Speaker Verification (text independent) The ELISA consortium  ENST, LIA, IRISA,...  http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers NIST evaluations  http://www.nist.gov/speech/tests/spk/index.htm

34 34 NIST evaluations : Results

35 35 Evaluations: NIST 2004

36 36 Combining Speech Recognition and Speaker Verification. Speaker independent phone HMMs Selection of segments or segment classes which are speaker specific Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

37 37 ALISP : Automatic Language Independent Speech Processing Data-driven speech segmentation

38 38 Searching in client and world speech dictionaries for speaker verification purposes

39 39 Fusion

40 40 Fusion results

41 41 Voice Transformations and Forgery (occasional, dedicated) Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures  Prevention by predicting many different forgery scenarios

42 42 Voice Forgery using ALISP The same words or not Impostor The same words or not client transformation A modification of a source speaker‘s speech to imitate a target speaker

43 43 Conversion system: ALISP encoder Speech MFCC analysis HNM HMM recognition Harmonic envelope Symbol index - Representative index - DTW path Choice of the best representative unit Prosody (energy+pitch) MFCC + delta Database of HNM Representatives HMM models Noise envelope

44 44 Conversion system: ALISP Decoder Concatenation of HNM parameters for each representative HNM Synthesis Speech signalSymbol index Pitch, energy, timing Representative index DTW path

45 45 Preliminary results: DET curves Fa before forgery : 16 ± 2.0 % (1700 files) Fa after forgery : 26 ± 2.0 % (1700 files)

46 46 Preliminary results True distributions

47 47 Multimodal Identity Verification M2VTS (face and speech)  front view and profile  pseudo-3D with coherent light BIOMET: (face, speech, fingerprint, signature, hand shape)  data collection  reuse of the M2VTS and DAVID data bases  experiments on the fusion of modalities

48 48 Speaking Faces : Motivations In many situation a video sequence is acquired Fusion of face and speech increases robustness Forgery is more difficult

49 49 Talking Face Recognition (hybrid verification)

50 50 Lip features Tracking lip movements

51 51 A talking face model Using Hidden Markov Models (HMMs) Acoustic parameters Visual parameters

52 52 Imposture Model

53 53 Cloning

54 54 Conclusions, Perspectives Deliberate imposture is a challenge for speech only systems Verification of identity based on features extracted from talking faces should be developped Common databases and evaluation protocols are necessary Free access to reference systems will facilitate future developments

55 55 BioSecure Residential Workshop Aug. 1st - 26th, 2005 in ENST, Paris Reference systems for speech, face, talking face, fingerprint, iris, hand, signature, … Comparative evaluations on large databases (BIOMET, BANCA, FVC,…) Fusion of modalities http://www.biosecure.info


Download ppt "1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines."

Similar presentations


Ads by Google