ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Slides:



Advertisements
Similar presentations
FUNCTION FITTING Student’s name: Ruba Eyal Salman Supervisor:
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Chapter 5 Multiple Linear Regression
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.
Outline 1) Objectives 2) Model representation 3) Assumptions 4) Data type requirement 5) Steps for solving problem 6) A hypothetical example Path Analysis.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Bootstrapping pronunciation models: a South African case study Presented at the CSIR Research and Innovation Conference Marelie Davel & Etienne Barnard.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg,
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Interactive Learning of the Acoustic Properties of Objects by a Robot
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Abstract Automatic detection of sleep state is important to enhance the quick diagnostic of sleep conditions. The analysis of EEGs is a difficult time-consuming.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Speech Enhancement based on
A NONPARAMETRIC BAYESIAN APPROACH FOR
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Olivier Siohan David Rybach
College of Engineering
Conditional Random Fields for ASR
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Model generalization Brief summary of methods
MATH 6380J Mini-Project 1: Realization of Recent Trends in Machine Learning Community in Recent Years by Pattern Mining of NIPS Words Chan Lok Chun
Machine Learning with Clinical Data
Presentation transcript:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University Search Term Strength Prediction Our approach is to use machine learning algorithms to learn the relationship between a phonetic representation of a word and its reliability ( word error rate or WER). Algorithms include: Linear Regression, Feed-forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in phonetic space. Other algorithm including Random Forest and K- nearest neighbors in acoustic space applied for the extension of this work (see section 6). Features include: Duration, #Syllables, #Syllables, #Consonants, #vowels, #Occurrences in the language model (count), monophone, broad phonetic class (BPC) and consonant-vowel-consonant (CVC) frequencies. Biphone frequencies, 2-grams of the BPC and CVC frequencies, and 3-grams of the CVC frequencies. Search Term Strength Prediction Our approach is to use machine learning algorithms to learn the relationship between a phonetic representation of a word and its reliability ( word error rate or WER). Algorithms include: Linear Regression, Feed-forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in phonetic space. Other algorithm including Random Forest and K- nearest neighbors in acoustic space applied for the extension of this work (see section 6). Features include: Duration, #Syllables, #Syllables, #Consonants, #vowels, #Occurrences in the language model (count), monophone, broad phonetic class (BPC) and consonant-vowel-consonant (CVC) frequencies. Biphone frequencies, 2-grams of the BPC and CVC frequencies, and 3-grams of the CVC frequencies. Experimentation Data: NIST Spoken Term Detection 2006 evaluation results; Cross-validation use for training. Different features are correlated with the word strength (1-WER), but the variance is high. Correlation (R) and mean square error (MSE) are used to assess the prediction quality. It is shown that “duration” is the most important feature. A duration model based on N-gram representation developed and trained using TIMIT dataset. Experimentation Data: NIST Spoken Term Detection 2006 evaluation results; Cross-validation use for training. Different features are correlated with the word strength (1-WER), but the variance is high. Correlation (R) and mean square error (MSE) are used to assess the prediction quality. It is shown that “duration” is the most important feature. A duration model based on N-gram representation developed and trained using TIMIT dataset. Observations Prediction accuracy for the NIST 2006 results is relatively poor. The best correlation obtained between the prediction and reference is around 0.46 which means the predictor can explain only 21% of the variance in the data. Using more data, better algorithms (random forest and KNN in acoustic space ), new features (count) and combining different approaches (using PSO optimizer), the predictions improve significantly. Correlation between the prediction and reference is as large as 0.76 which means predictor can explain around 57% of the variance in the data. Therefore this predictor can be used in practice to help users who need search speech data frequently. Part of the error rate related to factors beyond the “structure” of the word itself. For example, speech rate or acoustic channel are greatly effect the error rate associated with a word. Since the data used in this research is not restricted to acoustically clean data and with standard accent and speech rate, the trained models have some intrinsic inaccuracy. Observations Prediction accuracy for the NIST 2006 results is relatively poor. The best correlation obtained between the prediction and reference is around 0.46 which means the predictor can explain only 21% of the variance in the data. Using more data, better algorithms (random forest and KNN in acoustic space ), new features (count) and combining different approaches (using PSO optimizer), the predictions improve significantly. Correlation between the prediction and reference is as large as 0.76 which means predictor can explain around 57% of the variance in the data. Therefore this predictor can be used in practice to help users who need search speech data frequently. Part of the error rate related to factors beyond the “structure” of the word itself. For example, speech rate or acoustic channel are greatly effect the error rate associated with a word. Since the data used in this research is not restricted to acoustically clean data and with standard accent and speech rate, the trained models have some intrinsic inaccuracy. Future Work Use data generated carefully from acoustically clean speech with proper speech rate and accent for training. Finding features with small correlation to the existed set of features (“count” was such a feature). Among candidates is confusability score. Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. Key References J. G. Fiscus, et al., “Results of the 2006 spoken term detection evaluation,” Proc. Workshop Searching Spont. Conv. Speech, pp. 45–50, Amsterdam, NL, July D. Miller, et al., “Rapid and Accurate Spoken Term Detection,” Proceedings of INTERSPEECH, pp , Antwerp, Belgium, Sep Future Work Use data generated carefully from acoustically clean speech with proper speech rate and accent for training. Finding features with small correlation to the existed set of features (“count” was such a feature). Among candidates is confusability score. Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. Key References J. G. Fiscus, et al., “Results of the 2006 spoken term detection evaluation,” Proc. Workshop Searching Spont. Conv. Speech, pp. 45–50, Amsterdam, NL, July D. Miller, et al., “Rapid and Accurate Spoken Term Detection,” Proceedings of INTERSPEECH, pp , Antwerp, Belgium, Sep Spoken Term Detection (STD) Goal of STD system ‎ : “to rapidly detect the presence of a term in large audio corpus of heterogeneous speech material.” STD phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.missed detections. Spoken Term Detection (STD) Goal of STD system ‎ : “to rapidly detect the presence of a term in large audio corpus of heterogeneous speech material.” STD phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.missed detections. Introduction Searching audio, unlike text data, is approximate and is typically based on a likelihood computed from some sort of pattern recognition system. Performance depends on acoustic channel, speech rate, accent, language, confusability of search terms. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to the way password checking tools to predict the reliability or strength of a search term. Introduction Searching audio, unlike text data, is approximate and is typically based on a likelihood computed from some sort of pattern recognition system. Performance depends on acoustic channel, speech rate, accent, language, confusability of search terms. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to the way password checking tools to predict the reliability or strength of a search term. Results Correlation between the prediction and reference is not satisfactory. Insufficient amount of data. Training data is not based on clean speech. Results Correlation between the prediction and reference is not satisfactory. Insufficient amount of data. Training data is not based on clean speech. Further Results using BBN data † KNN in acoustic space (R=0.6)/new feature(count)+Random Forest (R= 0.7). Particle Swarm Optimization (PSO) used to combine different machines. More than 70 machines of different types are trained (Table 3). Further Results using BBN data † KNN in acoustic space (R=0.6)/new feature(count)+Random Forest (R= 0.7). Particle Swarm Optimization (PSO) used to combine different machines. More than 70 machines of different types are trained (Table 3). Figure 1. A screenshot of our demonstration software tool that assesses voice keyword search term strength and displays a confidence measure. Figure 2-Spoken term detection can be partitioned into two tasks: indexing and search. One common approach to indexing is to use a speech to text system (after Fiscus et al., 2007). Figure 3. An overview of our approach to search term strength prediction that is based on decomposing terms into features. Figure 4. The relationship between duration and error rate shows that longer words generally result in better performance. Features TrainEval RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Duration + Length + No. Syllables /Duration Duration + No. Consonants + Length/Duration + No. Syllables / Duration + CVC K TrainEval MSERMSRR Table1- Results for feature based method over NIST Table2- Results for KNN in Phonetic space for BBN dataset. TrainEvalRelative Contribution MachinesMSER RAcousticPhoneticFeature All %10.5%48.3% NN+RF %15.7%39.5% Table3-Best results after combining different machine using PSO. Figure 5. The predicted error rate is plotted against the reference error rate, demonstrating good correlation between the two. † : Result of this section will be published in “Journal of Speech Technology” shortly.