Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Building an ASR using HTK CS4706
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
I Need Out Because He Wants In the House: The Subject Pronoun in need and want Phrasal Constructions 1 Gregory Paules & Dr. Erica J. Benson English Department,
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Pitch Recognition with Wavelets Final Presentation by Stephen Geiger.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Factor Analysis There are two main types of factor analysis:
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
© Bob York l Analysis of speech segments. A) Variation of sound pressure level over time for a representative utterance from the TIMIT corpus (the sentence.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
Isolated-Word Speech Recognition Using Hidden Markov Models
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Abstract The emergence of big data and deep learning is enabling the ability to automatically learn how to interpret EEGs from a big data archive. The.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Multimodal Information Analysis for Emotion Recognition
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
A Fully Annotated Corpus of Russian Speech
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
Crystal Reinhart, PhD & Beth Welbes, MSPH Center for Prevention Research and Development, University of Illinois at Urbana-Champaign Social Norms Theory.
IIS for Speech Processing Michael J. Watts
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Korean Phoneme Discrimination
Figure 1. In utero RNAi of Kiaa0319 (KIA−) caused delayed speech-evoked LFPs in both awake and anesthetized rats. LFPs in panels (A) and (C) were created.
College of Engineering
AP Statistics Chapter 3 Part 3
Put your name here Name of the Department, School or College
Predict Failures with Developer Networks and Social Network Analysis
Put your name here Name of the Department, School or College
Put your name here Name of the Department, School or College
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Speaker Identification:
Comparisons of HARQ transmission schemes for 11be
Comparisons of HARQ transmission schemes for 11be
Auditory Morphing Weyni Clacken
Presentation transcript:

Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel-scale, which is an auditory perception-based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback– Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified. The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods: 1) using segments of a whole speech file 2) varying the number of speech files used, and 3) splicing together the vowel phones of a speech segment The TIMIT Corpus The TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers. PHONETIC y axr WORD your ORTHOGRAPHIC She had your dark suit in greasy wash water all year. In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced. Acknowledgments Grateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics. ResultsConclusions While optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results. Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study. With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme. However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion. Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University Literature cited Cole, Ronald A., et al The Contribution of Consonants Versus Vowels to Word Recognition in Fluent Speech Van Heerden, C.J, E. Bernard Speaker-specific variability of Phoneme Durations. Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa Phoneme Based Speaker Modeling to Improve Speaker Identification For further information Figure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario. The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT Figure 1. Speaker recognition based on 144 vowel-based files Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected. Vowels only Individual phonemes are exceedingly difficult to use when predicting speaker based on entire wav files. Train: 5V, Test: 3F Evaluated: 342 Correct: 242, Percentage: Train: 5F, Test: 5V Evaluated: 570 Correct: 53, Percentage: Control Train: 5F, Test: 5F (not including SA) Evaluated: 570 Correct: 493, Percentage: Difference in number Reducing both the number of training and testing files to be consistent results in an optimal (~84.2%) success rate, but only up to 3 files. Train: 3F, Test: 5F Evaluated: 570 Correct: 440, Percentage: Train: 5F, Test: 3F Evaluated: 342 Correct: 292, Percentage: Train: 3F, Test: 3F Evaluated: 342 Correct: 290, Percentage: The following nomenclature is adopted in this poster: F: Full (complete) speech file H: Speech file segmented at middle V: File consisting of vowel phones only Difference in size Performance is considerably better with full speech files during testing, regardless of which half of the file we use. Train: 5F, Test: 5H Evaluated: 570 Correct: 327, Percentage: Train: 5H, Test: 5F Evaluated: 570 Correct: 442, Percentage: Train: 5H, Test: 3F Evaluated: 342 Correct: 277, Percentage: Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files. Vowels only, cont. If individual phoneme files are used for training and testing, the results are impressive. Train: 5V, Test: 5V Evaluated: 570 Correct: 533, Percentage: Train: 3V, Test: 3V Evaluated: 342 Correct: 278, Percentage: Further examination In order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually. Please contact Further details about the methodology may be read online at