Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Talking Elevator, WS2006 UdS, Speaker Recognition 1.

Similar presentations


Presentation on theme: "A Talking Elevator, WS2006 UdS, Speaker Recognition 1."— Presentation transcript:

1 A Talking Elevator, WS2006 UdS, Speaker Recognition 1

2 A Talking Elevator, WS2006 UdS, Speaker Recognition 2 A Talking Elevator An introduction to the main concepts of speaker recognition © Jacques Koreman NTNU

3 A Talking Elevator, WS2006 UdS, Speaker Recognition 3 QWhat is the problem? ATwo types of biometrics: ◘behavioral ◘physical

4 A Talking Elevator, WS2006 UdS, Speaker Recognition 4 QWhat causes the problem? AVariability ◘Repetitions ◘Sessions ◘Channel ◘Background noise Variability across speakers is good, variability within speakers is not.)

5 A Talking Elevator, WS2006 UdS, Speaker Recognition 5 QHow does variability affect ◘Speech recognition? ◘Speaker recognition? AStructure of acoustic space enhances decoding of the linguistic content of a message. Schematic representation of the distribution of phones (fill colours) and speakers (border colours) in the acoustic space

6 A Talking Elevator, WS2006 UdS, Speaker Recognition 6 QWhat is the difference between A(Closed set) speaker identif- ication selects the most likely speaker from a given set. Speaker verification is concerned with ascertaining a claimed identity. Open set speaker identificat- ion combines the two. ◘speaker identification and ◘speaker verification?

7 A Talking Elevator, WS2006 UdS, Speaker Recognition 7 QWhat is the difference between speaker recognition? AIn TI recognition, the speaker can produce any speech, while in TD recognition the speaker must pronounce a fixed or prompted phrase ◘text-independent (TI) and ◘text-dependent (TD) (Dis)advantages? User-friendliness, variabilty, cooperativeness.

8 A Talking Elevator, WS2006 UdS, Speaker Recognition 8 QWhat is the best way to select a prompt in TD recognition? A The prompt can be ◘fixed …good to find consistent speaker differences, but im- postors know the prompt too. ◘self-selected …more secret, but users may choose a short, easy-to-guess prompt. ◘variable …but you need a lot of enrollment data to model all possible contexts.

9 A Talking Elevator, WS2006 UdS, Speaker Recognition 9 Q How are the training data selected? A ◘Training data should reflect test (=operation) conditions to prevent training-test mismatch. ◘More data need for TI than for TD models. ◘Better speaker models with more training data, but less user-friendly.

10 A Talking Elevator, WS2006 UdS, Speaker Recognition 10 QHow can we deal with noise in the recordings? ATwo ways: ◘Pre-processing: normalize the signals, e.g. by cepstral mean substraction (cms). ◘Modelling: create multi- condition speaker models based on signals recorded in different environments

11 A Talking Elevator, WS2006 UdS, Speaker Recognition 11 Steps: ◘microphone recording ◘preprocessing ◘modeling/testing QHow does a speaker (speech) recognizer work? ATwo parts: ◘enrollment/training ◘testing

12 A Talking Elevator, WS2006 UdS, Speaker Recognition 12 QWhat are these speaker models? AStatistical models of the enrollment data: ◘hidden Markov models (HMMs) ◘Gaussian mixture models (GMMs)

13 A Talking Elevator, WS2006 UdS, Speaker Recognition 13 QWhy use statistical models? ABecause of variation in the signal (behavioral biometric) ◘often not noticed by human listeners, but ◘detrimental to computer performance if not modelled appropriately

14 A Talking Elevator, WS2006 UdS, Speaker Recognition 14 QWhat is an HMM? AThis question needs several slides to answer. Let’s start with a simple Markov model, which represents a sequence of observations (feature vector from preprocessing) by states and transitions.

15 A Talking Elevator, WS2006 UdS, Speaker Recognition 15 ◘Stochastic model of sequence of events. ◘Start at container (state) S, which is empty. ◘Go to container (state) 1 (with p=1) and take out a black ball (observation). S E 1 0.40.3 0.5 0.6 0.5 0.7 123 QWhat is a MM?

16 A Talking Elevator, WS2006 UdS, Speaker Recognition 16 ◘Go to state 2 (with p=0.4) and take a red ball, or ◘stay in state 1 and take another black ball out of the container. S E 1 0.40.3 0.5 0.6 0.5 0.7 123 QWhat is a MM?

17 A Talking Elevator, WS2006 UdS, Speaker Recognition 17 ◘…and so on, until you get to state E and have a row of colored balls (cf. feature vectors obtained from the speech signal). S E 1 0.40.3 0.5 0.6 0.5 0.7 123 QWhat is a MM?

18 A Talking Elevator, WS2006 UdS, Speaker Recognition 18 QWhat is an HMM? AOnly difference with a MM: The same observations (color- ed balls) can be emitted by different states (containers). In the example: all containers contain balls of different colors. Different percentages of each color are modeled by their emission probabilities.

19 A Talking Elevator, WS2006 UdS, Speaker Recognition 19 S E 1 0.40.3 0.5 0.6 0.5 0.7 123 QWhat is an HMM? ◘Start at state S, which is empty. ◘Go to state 1 (with p=1) and take out a ball, which can be black, red or yellow.

20 A Talking Elevator, WS2006 UdS, Speaker Recognition 20 S E 1 0.40.3 0.5 0.6 0.5 0.7 123 QWhat is an HMM? ◘Go to state 2 (with p=0.4) and take out a ball, or ◘stay in state 1 and take another ball out, ◘until you get to state E.

21 A Talking Elevator, WS2006 UdS, Speaker Recognition 21 QWhat is an HMM? ◘…And so on, until you get to state E and have a sequence of colored balls. ◘Notice left-to-right nature: order of sounds in a word is fixed.

22 A Talking Elevator, WS2006 UdS, Speaker Recognition 22 QWhat is an HMM? ◘You now have a sequence of colored balls, ◘but cannot tell from the sequence of balls which containers they were taken from (unlike in a MM): the states are „hidden“. 11 1112222333 111222223333 etc.

23 A Talking Elevator, WS2006 UdS, Speaker Recognition 23 QWhat is an HMM? ◘HMM for speech: ◘state = (part of a) phone ◘colored ball = feature vector which represents the frequency spectrum of the speech signal. ◘Task of HMM (Viterbi algorithm): find most likely state sequence to model observation sequence.

24 A Talking Elevator, WS2006 UdS, Speaker Recognition 24 QWhat is an HMM? ◘In the example, the observations were discreet (colors). ◘Usually, Gaussian mixtures (normal distributions) are used to decribe the (contin- uous) observations.

25 A Talking Elevator, WS2006 UdS, Speaker Recognition 25 QWhat is a GMM, and when is it used instead of an HMM? AAn HMM that consists of only one state. It can be used if we do not need any information about the linguistic content (time structure) of the speech.

26 A Talking Elevator, WS2006 UdS, Speaker Recognition 26 QHow much enrollment data is needed to train an HMM/GMM ? ABalance between ◘representativeness of speaker and operation conditions (as much as possible) ◘user-friendliness (as little as possible) Adaptation from universal background model (UBM)

27 A Talking Elevator, WS2006 UdS, Speaker Recognition 27 QHow is a UBM used? ATwo ways: ◘in training: to initialize the client speaker models (cf. previous slide) ◘In testing (normally only in verification, not identific- ation): to compare likeli- hood of client model with that of UBM (normalisation by taking likelihood ratio)

28 A Talking Elevator, WS2006 UdS, Speaker Recognition 28 QHow good is a system? AEvaluation of test data: ◘for identification: percent- age correct identifications ◘for verification: comparison of number of false accept- ances with false rejections. ◘DET instead of ROC curve ◘Selected operating point depends on required security level: r=1 (EER), but also r=0.1 or 10). HTER=½(%FA+%FR) gives equal weight to client and impostor accesses to overcome a possible imbalance in the training data

29 A Talking Elevator, WS2006 UdS, Speaker Recognition 29 QHow good is a system? HTER ROC curveDET curve (receiver operating characteristic)(detection error tradeoff) Alvin Martin et al. (1997). The DET curve in assessment of detection task performance, www.nist.gov/speech/publications/

30 A Talking Elevator, WS2006 UdS, Speaker Recognition 30 This lecture has familiarized you with ◘the main concepts in speaker recognition ◘speaker modeling at a conceptual level We should now ◘take a closer look at the signal which is modelled in speaker recognition Summary

31 A Talking Elevator, WS2006 UdS, Speaker Recognition 31 A Talking Elevator An introduction to the main concepts of speaker recognition Jacques Koreman NTNU

32 A Talking Elevator, WS2006 UdS, Speaker Recognition 32


Download ppt "A Talking Elevator, WS2006 UdS, Speaker Recognition 1."

Similar presentations


Ads by Google