A Text-Independent Speaker Recognition System

Slides:

Advertisements

Similar presentations

1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1,

Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson

Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

Automatic Identification of Bacterial Types using Statistical Image Modeling Sigal Trattner, Dr. Hayit Greenspan, Prof. Shimon Abboud Department of Biomedical.

GMM-Based Multimodal Biometric Verification Yannis Stylianou Yannis Pantazis Felipe Calderero Pedro Larroy François Severin Sascha Schimke Rolando Bonal.

A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.

Speaker Adaptation for Vowel Classification

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

Optimal Adaptation for Statistical Classifiers Xiao Li.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification

Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Advisor: Prof. Tony Jebara

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh

Representing Acoustic Information

Study of Word-Level Accent Classification and Gender Factors

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Implementing a Speech Recognition System on a GPU using CUDA

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.

Jacob Zurasky ECE5526 – Spring 2011

Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.

Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.

ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.

Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.

A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.

BIOMETRICS VOICE RECOGNITION. Meaning Bios : LifeMetron : Measure Bios : LifeMetron : Measure Biometrics are used to identify the input sample when compared.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,

ARTIFICIAL NEURAL NETWORKS

Spoken Digit Recognition

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

3. Applications to Speaker Verification

Statistical Models for Automatic Speech Recognition

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

A maximum likelihood estimation and training on the fly approach

SNR-Invariant PLDA Modeling for Robust Speaker Verification

Presentation transcript:

A Text-Independent Speaker Recognition System Catie Schwartz Advisor: Dr. Ramani Duraiswami Mid-Year Progress Report

Speaker Recognition System ENROLLMENT PHASE – TRAINING (OFFLINE) VERIFICATION PHASE – TESTING (ONLINE)

Schedule/Milestones Fall 2011 October 4 Have a good general understanding on the full project and have proposal completed. Marks completion of Phase I November 4 GMM UBM EM Algorithm Implemented GMM Speaker Model MAP Adaptation Implemented Test using Log Likelihood Ratio as the classifier Marks completion of Phase II December 19 Total Variability Space training via BCDM Implemented i-vector extraction algorithm Implemented Test using Discrete Cosine Score as the classifier Reduce Subspace LDA Implemented LDA reduced i-vector extraction algorithm Implemented Marks completion of Phase III

Algorithm Flow Chart Background Training Background Speakers Feature Extraction (MFCCs + VAD) Use consistent tense Factor Analysis Total Variability Space (BCDM) GMM UBM (EM) Reduced Subspace (LDA)

Algorithm Flow Chart GMM Speaker Models Feature Extraction (MFCCs + VAD) GMM Speaker Models Reference Speakers GMM Speaker Models (MAP Adaptation) Log Likelihood Ratio (Classifier) Test Speaker

Total Variability Space Feature Extraction Background Speakers Feature Extraction (MFCCs + VAD) Use consistent tense Factor Analysis Total Variability Space (BCDM) GMM UBM (EM) Reduced Subspace (LDA)

MFCC Algorithm Step 1: Compute FFT power spectrum Input: utterance; sample rate Output: matrix of MFCCs by frame Parameters: window size = 20 ms; step size = 10 ms nBins = 40; d = 13 (nCeps) Step 1: Compute FFT power spectrum Step II : Compute mel-frequency m-channel filterbank Step III: Convert to ceptra via DCT (0th Cepstral Coefficient represents “Energy”)

MFCC Validation Code modified from tool set created by Dan Ellis (Columbia University) Compared results of modified code to original code for validation Ellis, Daniel P. W. PLP and RASTA (and MFCC, and Inversion) in Matlab. PLP and RASTA (and MFCC, and Inversion) in Matlab. Vers. Ellis05-rastamat. 2005. Web. 1 Oct. 2011. <http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/>.

VAD Algorithm Step 1 : Segment utterance into frames Input: utterance, sample rate Output: Indicator of silent frames Parameters: window size = 20 ms; step size = 10 ms Step 1 : Segment utterance into frames Step II : Find energies of each frame Step III : Determine maximum energy Step IV: Remove any frame with either: a) less than 30dB of maximum energy b) less than -55 dB overall

VAD Validation original silent speech Visual inspection of speech along with detected speech segments original silent speech

Gaussian Mixture Models (GMM) as Speaker Models Represent each speaker by a finite mixture of multivariate Gaussians The UBM or average speaker model is trained using an expectation-maximization (EM) algorithm Speaker models learned using a maximum a posteriori (MAP) adaptation algorithm

Total Variability Space EM for GMM Algorithm Background Speakers Feature Extraction (MFCCs + VAD) Use consistent tense Factor Analysis Total Variability Space (BCDM) GMM UBM (EM) Reduced Subspace (LDA)

EM for GMM Algorithm (1 of 2) Input: Concatenation of the MFCCs of all background utterances ( ) Output: Parameters: K = 512 (nComponents); nReps = 10 Step 1: Initialize randomly Step II: (Expectation Step) Obtain conditional distribution of component c

EM for GMM Algorithm (2 of 2) Step III: (Maximization Step) Mixture Weight: Mean: Covariance: Step IV: Repeat Steps II and III until the delta in the relative change in maximum likelihood is less than .01

EM for GMM Validation (1 of 9) Ensure maximum log likelihood is increasing at each step Create example data to visually and numerically validate EM algorithm results

EM for GMM Validation (2 of 9) Example Set A: 3 Gaussian Components

EM for GMM Validation (3 of 9) Example Set A: 3 Gaussian Components Tested with K = 3

EM for GMM Validation (4 of 9) Example Set A: 3 Gaussian Components Tested with K = 3

EM for GMM Validation (5 of 9) Example Set A: 3 Gaussian Component Tested with K = 2

EM for GMM Validation (6 of 9) Example Set A: 3 Gaussian Component Tested with K = 4

EM for GMM Validation (7 of 9) Example Set A: 3 Gaussian Component Tested with K = 7

EM for GMM Validation (8 of 9) Example Set B: 128 Gaussian Components

EM for GMM Validation (9 of 9) Example Set B: 128 Gaussian Components

Algorithm Flow Chart GMM Speaker Models Feature Extraction (MFCCs + VAD) GMM Speaker Models Reference Speakers GMM Speaker Models (MAP Adaptation) Log Likelihood Ratio (Classifier) Test Speaker

MAP Adaption Algorithm Input: MFCCs of utterance for speaker ( ); Output: Parameters: K = 512 (nComponents); r=16 Step I : Obtain via Steps II and III in the EM for GMM algorithm (using ) Step II: Calculate where

MAP Adaptation Validation (1 of 3) Use example data to visual MAP Adaptation algorithm results

MAP Adaptation Validation (2 of 3) Example Set A: 3 Gaussian Components

MAP Adaptation Validation (3 of 3) Example Set B: 128 Gaussian Components

Algorithm Flow Chart Log Likelihood Ratio Feature Extraction (MFCCs + VAD) GMM Speaker Models Reference Speakers GMM Speaker Models (MAP Adaptation) Log Likelihood Ratio (Classifier) Test Speaker

Classifier: Log-likelihood test Compare a sample speech to a hypothesized speaker where leads to verification of the hypothesized speaker and leads to rejection. Reynolds, D. "Speaker Verification Using Adapted Gaussian Mixture Models." Digital Signal Processing 10.1-3 (2000): 19-41. Print.

Preliminary Results Using TIMIT Dataset Dialect Region(dr) #Male #Female Total ---------- --------- --------- ---------- 1 31 (63%) 18 (27%) 49 (8%) 2 71 (70%) 31 (30%) 102 (16%) 3 79 (67%) 23 (23%) 102 (16%) 4 69 (69%) 31 (31%) 100 (16%) 5 62 (63%) 36 (37%) 98 (16%) 6 30 (65%) 16 (35%) 46 (7%) 7 74 (74%) 26 (26%) 100 (16%) 8 22 (67%) 11 (33%) 33 (5%) ------ --------- --------- ---------- 8 438 (70%) 192 (30%) 630 (100%)

GMM Speaker Models DET Curve and EER

Conclusions MFCC validated VAD validated EM for GMM validated MAP Adaptation validated Preliminary test results show acceptable performance Next steps: Validate FA algorithms and LDA algorithm Conduct analysis tests using TIMIT and SRE data bases

Questions?

Bibliography [1]Biometrics.gov - Home. Web. 02 Oct. 2011. <http://www.biometrics.gov/>. [2] Kinnunen, Tomi, and Haizhou Li. "An Overview of Text-independent Speaker Recognition: From Features to Supervectors." Speech Communication 52.1 (2010): 12-40. Print. [3] Ellis, Daniel. “An introduction to signal processing for speech.” The Handbook of Phonetic Science, ed. Hardcastle and Laver, 2nd ed., 2009. [4] Reynolds, D. "Speaker Verification Using Adapted Gaussian Mixture Models." Digital Signal Processing 10.1-3 (2000): 19-41. Print. [5] Reynolds, Douglas A., and Richard C. Rose. "Robust Text-independent Speaker Identification Using Gaussian Mixture Speaker Models." IEEE Transations on Speech and Audio Processing IEEE 3.1 (1995): 72-83. Print. [6] "Factor Analysis." Wikipedia, the Free Encyclopedia. Web. 03 Oct. 2011. <http://en.wikipedia.org/wiki/Factor_analysis>. [7] Dehak, Najim, and Dehak, Reda. “Support Vector Machines versus Fast Scoring in the Low- Dimensional Total Variability Space for Speaker Verification.” Interspeech 2009 Brighton. 1559- 1562. [8] Kenny, Patrick, Pierre Ouellet, Najim Dehak, Vishwa Gupta, and Pierre Dumouchel. "A Study of Interspeaker Variability in Speaker Verification." IEEE Transactions on Audio, Speech, and Language Processing 16.5 (2008): 980-88. Print. [9] Lei, Howard. “Joint Factor Analysis (JFA) and i-vector Tutorial.” ICSI. Web. 02 Oct. 2011. http://www.icsi.berkeley.edu/Speech/presentations/AFRL_ICSI_visit2_JFA_tutorial_icsitalk.pdf [10] Kenny, P., G. Boulianne, and P. Dumouchel. "Eigenvoice Modeling with Sparse Training Data." IEEE Transactions on Speech and Audio Processing 13.3 (2005): 345-54. Print. [11] Bishop, Christopher M. "4.1.6 Fisher's Discriminant for Multiple Classes." Pattern Recognition and Machine Learning. New York: Springer, 2006. Print. [12] Ellis, Daniel P. W. PLP and RASTA (and MFCC, and Inversion) in Matlab. PLP and RASTA (and MFCC, and Inversion) in Matlab. Vers. Ellis05-rastamat. 2005. Web. 1 Oct. 2011. <http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/>.

Milestones Fall 2011 October 4 Have a good general understanding on the full project and have proposal completed. Present proposal in class by this date. Marks completion of Phase I November 4 Validation of system based on supervectors generated by the EM and MAP algorithms Marks completion of Phase II December 19 Validation of system based on extracted i-vectors Validation of system based on nuisance-compensated i-vectors from LDA Mid-Year Project Progress Report completed. Present in class by this date. Marks completion of Phase III Spring 2012 Feb. 25 Testing algorithms from Phase II and Phase III will be completed and compared against results of vetted system. Will be familiar with vetted Speaker Recognition System by this time. Marks completion of Phase IV March 18 Decision made on next step in project. Schedule updated and present status update in class by this date. April 20 Completion of all tasks for project. Marks completion of Phase V May 10 Final Report completed. Present in class by this date. Marks completion of Phase VI

Spring Schedule/Milestones

Algorithm Flow Chart GMM Speaker Models Enrollment Phase Feature Extraction (MFCCs + VAD) GMM Speaker Models Reference Speakers GMM Speaker Models (MAP Adaptation)

Algorithm Flow Chart GMM Speaker Models Verification Phase Feature Extraction (MFCCs + VAD) GMM Speaker Models GMM Speaker Models (MAP Adaptation) Log Likelihood Ratio (Classifier) Test Speaker

Algorithm Flow Chart (2 of 7) GMM Speaker Models Enrollment Phase Feature Extraction (MFCCs + VAD) GMM Speaker Models Reference Speakers GMM Speaker Models (MAP Adaptation)

Algorithm Flow Chart (3 of 7) GMM Speaker Models Verification Phase Feature Extraction (MFCCs + VAD) GMM Speaker Models GMM Speaker Models (MAP Adaptation) Log Likelihood Ratio (Classifier) Test Speaker

Algorithm Flow Chart (4 of 7) i-vector Speaker Models Enrollment Phase GMM Speaker Models Feature Extraction (MFCCs + VAD) i-vector Speaker Models Reference Speakers i-vector Speaker Models

i-vector Speaker Models Algorithm Flow Chart (5 of 7) i-vector Speaker Models Verification Phase GMM Speaker Models Feature Extraction (MFCCs + VAD) i-vector Speaker Models i-vector Speaker Models Cosine Distance Score (Classifier) Test Speaker

Algorithm Flow Chart (6 of 7) LDA reduced i-vector Speaker Models Enrollment Phase Feature Extraction (MFCCs + VAD) Reference Speakers LDA Reduced i-vector Speaker Models

Algorithm Flow Chart (7 of 7) LDA reduced i-vector Speaker Models Verification Phase Feature Extraction (MFCCs + VAD) LDA Reduced i-vector Speaker Models Cosine Distance Score (Classifier) Test Speaker

Feature Extraction Mel-frequency cepstral coefficients (MFCCs) are used as the features Voice Activity Detector (VAD) used to remove silent frames

Mel-Frequency Cepstral Coefficents MFCCs relate to physiological aspects of speech Mel-frequency scale – Humans differentiate sound best at low frequencies Cepstra – Removes related timing information between different frequencies and drastically alters the balance between intense and weak components Ellis, Daniel. “An introduction to signal processing for speech.” The Handbook of Phonetic Science, ed. Hardcastle and Laver, 2nd ed., 2009.

Voice Activity Detection Detects silent frames and removes from speech utterance

GMM for Universal Background Model By using a large set of training data representing a set of universal speakers, the GMM UBM is where This represents a speaker-independent distribution of feature vectors The Expectation-Maximization (EM) algorithm is used to determine

GMM for Speaker Models Represent each speaker, , by a finite mixture of multivariate Gaussians where Utilize , which represents speech data in general The Maximum a posteriori (MAP) Adaptation is used to create Note: Only means will be adjusted, the weights and covariance of the UBM will be used for each speaker