ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Slides:

Advertisements

Similar presentations

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Advertisements

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.

Building an ASR using HTK CS4706

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.

On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.

Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Structure of Spoken Language

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

Speech Signal Processing

English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.

Structure of Spoken Language

Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.

CS 551/652: Structure of Spoken Language Lecture 2: Spectrogram Reading and Introductory Phonetics John-Paul Hosom Fall 2010.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

LINEAR DYNAMIC MODEL FOR CONTINUOUS SPEECH RECOGNITION Ph.D. Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Quantitative and qualitative differences in understanding sentences interrupted with noise by young normal-hearing and elderly hearing-impaired listeners.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.

TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.

Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)

Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

A NONPARAMETRIC BAYESIAN APPROACH FOR

An Efficient Online Algorithm for Hierarchical Phoneme Classification

Structure of Spoken Language

Structure of Spoken Language

College of Engineering

Conditional Random Fields for ASR

Structure of Spoken Language

EEG Recognition Using The Kaldi Speech Recognition Toolkit

Phonetics and Phonemics

Segment-Based Speech Recognition

Word Embedding Word2Vec.

Phonetics and Phonemics

Presentation transcript:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University Experimentation Correlation (R) and mean square error (MSE) are used to assess the prediction quality. Duration is the most significant feature with around 40% correlation. Experimentation Correlation (R) and mean square error (MSE) are used to assess the prediction quality. Duration is the most significant feature with around 40% correlation. Features Most features can be derived from some basic representations (Table 3). A duration model based on N-gram phonetic representation developed and trained using TIMIT dataset. Features Most features can be derived from some basic representations (Table 3). A duration model based on N-gram phonetic representation developed and trained using TIMIT dataset. Machine Learning Algorithms Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word. Strength Score =1-WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Neural Network has two layers with 30 neurons and trained by back-propagation algorithm. Machine Learning Algorithms Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word. Strength Score =1-WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Neural Network has two layers with 30 neurons and trained by back-propagation algorithm. Figure 4. The relationship between duration and error rate shows that longer words generally result in better performance. Results Table 5- Results for feature based method over NIST Table 6- Results for KNN in Phonetic space for BBN dataset. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Figure 1. A screenshot of our demonstration software: Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Figure 2. A common approach in STD is to use a speech to text system to index the speech signal (J. G. Fiscus, et al., 2007). Features TrainEval RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Duration + Length + No. Syllables /Duration Duration +#Consonants + Length/Duration + #Syllables / Duration +CVC K TrainEval MSERMSRR Figure 5. The predicted error rate is plotted against the reference error rate, demonstrating good correlation between the two. Figure 3. An overview of our approach to search term strength prediction that is based on decomposing terms into features. Wordtsunami Phonemes t s uh n aa m iy Vowels uh aa iy Consonantst s n m SyllablesTsoo nah mee BPCS F V N V N V CVC C C V C V C V ClassPhone Stops (S) b p d t g k Fricative (F) jh ch s sh z zh f th v dh hh Nasals (N) m n ng en Liquids (L) l el r w y Vowels (V) iy ih eh ey ae aa aw ay ah ao ax oy ow uh uw er Table 2- Broad Phonetic Classes (BPC) Table 3- Example of converting word into basic feature representations Observations Maximum correlation around 46% means just around 21% of the variance in the data is explained by the predictor. Features used in this work are highly correlated. Part of the error rate related to factors beyond the “structure” of the word itself. For example, speech rate or acoustic channel are greatly effect the error rate associated with a word. KNN on phonetic space shows a better prediction capability. Trained models have some intrinsic inaccuracy because of the conditional diversity of the Data. Feature Set DurationLength No. of Syllables No. Consonants No. of Vowels phoneme Freq. BPC Freq. CVC Freq. 2-Grams of Phoneme, BPC, CVC 2-Gram Of BPC 2-Grmas of CVC 3-Grams of CVC No. Syllables/ Duration Length/ Duration No. Vowels/ No Conso nants Start- End Phone Table1- Features used to predict the keyword strength. Data SetNIST Spoken Term Detection 2006 Evaluation results SitesBBNIBMSRI Sources Broadcast News (3hrs) Conversational Telephone Speech (3 hrs) Conference Meetings (2 hrs). Table 4- Results for feature based method over NIST Future Works Use data generated carefully from acoustically clean speech with proper speech rate and accent for training. Finding features with small correlation to the existed set of features. Among the candidates are confusability score and expected number of occurrences of a word in the language model. Combining the outputs of several machines using optimization techniques such as particle swarm optimization (PSO). Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. We have developed algorithms based on some of these suggestions which improves the correlation to around 76% which corresponds to explaining 58% of the variance of the observed data. The results will be published in the near future.