Speaker Detection Without Models Dan Gillick July 27, 2004.

Slides:

Advertisements

Similar presentations

Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.

Language modeling for speaker recognition Dan Gillick January 20, 2004.

Reduced Support Vector Machine

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.

You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University.

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Classes, Objects, and World-level Methods Alice. Programming in Alice© 2006 Dr. Tim Margush2 Class / Object Class A template describing the characteristics.

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Advisor: Prof. Tony Jebara

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Introduction to Automatic Speech Recognition

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

Study of Word-Level Accent Classification and Gender Factors

Educational Software using Audio to Score Alignment Antoine Gomas supervised by Dr. Tim Collins & Pr. Corinne Mailhes 7 th of September, 2007.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Speech and Language Processing

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

Descriptive Statistics A Short Course in Statistics.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.

Implementing a Speech Recognition System on a GPU using CUDA

K. Selçuk Candan, Maria Luisa Sapino Xiaolan Wang, Rosaria Rossini

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/

Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

A NONPARAMETRIC BAYESIAN APPROACH FOR

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Statistical Models for Automatic Speech Recognition

Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan

Parallelizing Dynamic Time Warping

Classes, Objects, and World-level Methods

Presentation for EEL6586 Automatic Speech Processing

Statistical Models for Automatic Speech Recognition

Isolated word, speaker independent speech recognition

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

Outline System architecture Current work Experiments Next Steps

Speaker Identification:

Learning Long-Term Temporal Features

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

Speaker Detection Without Models Dan Gillick July 27, 2004

Dan Gillick (2)July 27, 2004Speaker Detection Without Models Motivation Want to develop a speaker ID algorithm that: captures sequential information takes advantage of extended data combines well with existing baseline systems

Dan Gillick (3)July 27, 2004Speaker Detection Without Models The Algorithm Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames. We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture. The comparisons are guided by token-level alignments extracted from a speech recognizer.

Dan Gillick (4)July 27, 2004Speaker Detection Without Models Front-End Using 40 MFCC features per 10ms frame –19 Cepstrals and Energy (C 0 ) –Their deltas

Dan Gillick (5)July 27, 2004Speaker Detection Without Models The Algorithm: Overview Cut the test and target data into tokens –use word or phone-level time-alignments from the SRI recognizer –note that these alignments have lots of errors (both word errors and alignment errors)

Dan Gillick (6)July 27, 2004Speaker Detection Without Models The Algorithm: Overview Compare test and target data 1.Take the first test token 2.Find every instance of this token in the target data 3.Measure the distance between the test token and each target instance 4.Move on to the next test token

Dan Gillick (7)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data

Dan Gillick (8)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output Hello

Dan Gillick (9)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Find every instance of this token in the target data” Hello Hello (1) Hello (2) Hello (3)

Dan Gillick (10)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 25

Dan Gillick (11)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 40 Distance = 25 Distance = 40

Dan Gillick (12)July 27, 2004Speaker Detection Without Models The Algorithm Test dataTraining data “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Hello Hello (1) Hello (2) Hello (3) Euclidian distance function Distance = 18 Distance = 25 Distance = 40 Distance = 18

Dan Gillick (13)July 27, 2004Speaker Detection Without Models The Algorithm: Distance Function But these instances have different lengths. How do we line up the frames? Here are some possibilities: 1. Line up the first frames and cut off the longer at the shorter 2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance. 3. Use dynamic time warping (DTW) Hello (test) Hello (3) Euclidian distance function Distance = 18

Dan Gillick (14)July 27, 2004Speaker Detection Without Models The Algorithm: Take the 1-Best Test dataTraining data Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information. Hello Hello (1) Hello (2) Hello (3) Distance = 25 Distance = 40 Distance = 18 Token Score = 18

Dan Gillick (15)July 27, 2004Speaker Detection Without Models The Algorithm: Scoring Test dataTraining data So we accumulate scores for each token. What do we do with these? Some options: 1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score) 2. Focus on some subset of the scores a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ] b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ] HelloToken Score = 18 myToken Score = 16.5 nameToken Score = 21 Etc…

Dan Gillick (16)July 27, 2004Speaker Detection Without Models Normalization Most systems use a UBM (universal background model) to center the test pieces –Since this system has no model, we create a background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data ZNorm to center the “models” –Find the mean score for each “model” or training set by running a number of held-out imposters against each one.

Dan Gillick (17)July 27, 2004Speaker Detection Without Models Results Results reported on split 1 (of 6) of Switchboard I (1624 test vs. target scores)

Dan Gillick (18)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigramssw14none For reference: GMM performance on the same data set: 0.67% EER; DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Dan Gillick (19)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none For reference: GMM performance on the same data set: 0.67% EER; DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Dan Gillick (20)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none For reference: GMM performance on the same data set: 0.67% EER; DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Dan Gillick (21)July 27, 2004Speaker Detection Without Models Results TOKENSTYLEBKGZNORM BSCR EER HS EER COMB EER COMB DCF word unigrams sw dtw 14 none word bigrams sw dtw phone unigramsdtw phone bigramsdtw phone trigramsdtw For reference: GMM performance on the same data set: 0.67% EER; DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Dan Gillick (22)July 27, 2004Speaker Detection Without Models Results How do positive and negative evidence compare? Word-bigrams + bkg (positive evidence)3.16% EER Word-bigrams + bkg (negative evidence) 26.5% EER

Dan Gillick (23)July 27, 2004Speaker Detection Without Models Results How is the system effected by errorful recognizer transcripts? Word bigrams + bkg + znorm (recognized transcripts)1.83% EER Word bigrams + bkg + znorm (true transcripts)1.16% EER

Dan Gillick (24)July 27, 2004Speaker Detection Without Models Results How does the system combine with the GMM? This experiment was done on the first half (splits 1,2,3) of Switchboard I EERDCF SRI GMM system Best phone-bigram system GMM + phone-bigrams

Dan Gillick (25)July 27, 2004Speaker Detection Without Models Future Stuff Try larger background population, larger znorm set Try other, non-Euclidian distance functions Change the front-end features (Feature mapping) Run the system on Switchboard II; 2004 eval. data Dynamic token selection –While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring.

Dan Gillick (26)July 27, 2004Speaker Detection Without Models Thanks Steve (wrote all the DTW code, versions 1 through 5…) Barry (tried to make my slides fancy) Barbara Everyone else in the Speaker ID group