A PRESENTATION BY SHAMALEE DESHPANDE SPEAKER RECOGNITION A PRESENTATION BY SHAMALEE DESHPANDE
INTRODUCTION Speaker Recognition * Automatically recognizing speaker * Uses individual information from the speaker’s speech waves
INTRODUCTION Two Approaches Text-Dependant Recognition Text-Independent Recognition
INTRODUCTION Two Approaches Text-Dependant Recognition *Use of keywords or sentences having the same text for the templates and the recognition Text-Independent Recognition
INTRODUCTION Text-Dependant Recognition Text-Independent Recognition Two Approaches Text-Dependant Recognition Text-Independent Recognition * Does not rely on a specific text being spoken.
INTRODUCTION Classes of Sound: Voiced, unvoiced, Plosive Production of Pitch Frequency and Formants Glottal Waveform
BLOCK DIAGRAM OF A SPEAKER RECOGNITION SYSTEM
DESIRABLE ATTRIBUTES OF A SPEAKER RECOGNITION SYS Feature should occur naturally and frequently in speech Easily measurable Doesn’t change over time or be affected by speakers health Isn’t affected by background noise Not be subject to mimicry
SOURCES OF VARIABILITY IN SPEECH Phonetic Identity Two samples may correspond to different phonetic segments. E.g. Vowel and fricative Pitch Pitch, other features like breathiness and amplitude can be varied Speaker Differences due to source physiology, emotions Microphone Environment
Possible Acoustic Parameters * Formant Frequencies * LPC * Pitch * Nasal Co articulation * Gain
COMMON SPEAKER RECOGNITION TECHNIQUES DISCRETE FOURIER TRANSFORM LINEAR PREDICTIVE CODING CEPSTRAL ANALYSIS DYNAMIC TIME WARPING HIDDEN MARKOV MODELS
DISCRETE / FAST FOURIER TRANSFORM Changes time domain signals into freq domain signal representations Enables reduced complexity for processor Read N speech samples from input Append N-L zeroes to the input data Calculation of DFT Windowing
LINEAR PREDICTIVE CODING TUBE Vocal tract BUZZER Glottal excitation Characterized by intensity and pitch Characterized by formants LPC model of the speech producing organs of the body
CEPSTRAL ANALYSIS Dis-adv of DFT/FFT is that formant freqs may shift the pitch or overlap it In Cepstral analysis, formants are completely removed from the spectrum Defined as Fourier Transform of the Log of the power spectrum S(n) = p(n) * v(n) X(n) = w(n) * s(n) S’(w) = p’(w) * v’(w) Fourier Transform Log S’(w)=log p’(w) + log v’(w) C(q)= log S’(q) = log p’(q) + log v’(q) Q – quefrency , C(q) – complex cepstrum
CEPSTRAL ANALYSIS Window DFT LOG IDFT Speech Cepstrum
DYNAMIC TIME WARPING Incoming speech is usually compared frame by frame with stored template Achieved via a pair wise comparison of feature vectors from each sequence Dis Adv – variation in length of corresponding phonemes DTW takes into account non linear relation between lengths of the two signals Used as a matching algorithm Example DTW grid
HIDDEN MARKOV MODELS Speech signal is identified during search process rather than explicitly Comprises of – Hidden Markov Chain representing temporal variability Observable process representing spectral variability Portrayed as stochastic pair (X,Y) HMM is a Finite State Machine where a Probability Density Function p(x|s) is associated with each state s
FUTURE RESEARCH To extract and apply all levels and information from the speech signal conveying speaker identity Acoustic – use spectral features conveying vocal tract information Prosodic - use features derived from pitch, energy tracks to classify information Phonetic – use phone sequences to characterize speaker specific pronunciations Idiolect – use words to characterize user specific word patterns Linguistic – use linguistic patterns to characterize speaker specific conversation style
APPLICATIONS Access Control- physical facilities, computer networks and websites PC Login and Password Reset Secured Transactions – remote banking and online credit card purchase authentication Time Attendance - workplaces Law Enforcement – forensics, parole