Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Building an ASR using HTK CS4706

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.

Part I: Classification and Bayesian Learning

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Natural Language Understanding

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.

Abstract EEGs, which record electrical activity on the scalp using an array of electrodes, are routinely used in clinical settings to.

Neuroscience Program's Seminar Series HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and.

Proseminar HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple.

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Emerging Directions in Statistical Modeling in Speech Recognition Joseph Picone and Amir Harati Institute for Signal and Information Processing Temple.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Speech Signal Processing

7-Speech Recognition Speech Recognition Concepts

Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

A NONPARAMETRIC BAYESIAN APPROACH FOR

College of Engineering

Conditional Random Fields for ASR

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Presentation transcript:

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering Temple University URL:

Temple University: Slide 1 Abstract Spoken term detection is an extension of text-based searching that allows users to type keywords and search audio files containing spoken language for their existence. Performance is dependent on many external factors such as the acoustic channel, language and the confusability of the search term. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. In this presentation we will review conventional approaches to keyword search. Goal: Develop a tool similar to the way password checking tools currently work. Approach: develop models that predict the quality of a search term based on its spelling (and underlying phonetic context).

Temple University: Slide 2 Motivation 1)What makes machine understanding of human language so difficult?  “In any natural history of the human species, language would stand out as the preeminent trait.”  “For you and I belong to a species with a remarkable trait: we can shape events in each other’s brains with exquisite precision.” S. Pinker, The Language Instinct: How the Mind Creates Language, )According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. (J. Gray, ) 3)Hundreds of linguistic phenomena must be taken into account to understand written language.  Each can not always be perfectly identified (e.g., Microsoft Word)  95% x 95% x … x … x … x … x … = a small number Keyword search becomes a viable alternative to speech to text transcription, especially if it can be done quickly.

Temple University: Slide 3 Maybe We Don’t Need to Understand Language? See ISIP Phonetic Units to run a demo of the influence of phonetic units on different speaking styles.ISIP Phonetic Units

Temple University: Slide 4 The World’s Languages There are over 6,000 known languages in the world.6,000 known languages The dominance of English is being challenged by growth in Asian and Arabic languages. Common languages are used to facilitate communication; native languages are often used for covert communications. U.S Census Non-English Languages

Temple University: Slide 5 The “Needle in a Haystack” Problem Detection Error Tradeoff (DET) curves are a common way to characterize system performance (ROC curves). Intelligence applications often demand very low false alarm rates AND low miss probabilities. Consider a 0.1% false alarm rate applied to 1M phone calls per day. This yields 1,000 calls per day that must be reviewed – too many! The reality is that current HLT does not operate reliably at such extremes.

Temple University: Slide 6 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

Temple University: Slide 7 Statistical Approach: Noisy Communication Channel Model

Temple University: Slide 8 Top Down vs. Bottom Up Speech recognition systems typically work either in a top-down or bottom-up mode, trading speed for accuracy. The top-down approach exploits linguistic context through the use of a word-based language model. The bottom-up approach spots N-grams of phones and favors speed over accuracy. The general approach is to precompute a permuted database of phone indices (10 to 50 xfRT). This database can be quickly searched for words or word combinations (~1000 xfRT).

Temple University: Slide 9 Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio search terms ATWV cost parameters indexing searching From Miller, et al., “Rapid and Accurate Spoken Term Detection”Rapid and Accurate Spoken Term Detection A Typical Word-Based STD System

Temple University: Slide 10 Predicting Search Term Performance Data: 2006 STD data was a mix of Broadcast News (3 hrs), Conversational Telephone Speech (3 hrs) and Conference Meetings (2 hrs).  1100 unique reference terms; 14,421 occurrences (skewed by frequency)  475 unique terms after removing multi-word terms and terms that occurred less than three times. Evaluation Paradigm:  Closed-Loop: All 475 search terms used in one run.  Open-Loop: Data randomly partitioned into train (80%) and eval (20%) for 100 iterations. Results are averaged across all runs. Three Machine Learning Approaches:  Multiple Linear Regression (regress): preprocessed data using SVD and then fit the data using least squares.  Neural Network (newff): a simple 2 layer network that used backpropagation for training and SVD for feature decorrelation.  Decision Tree (treefit): a binary tree with a twoing splitting rule. Goal: Predict error rate as a function of feature combinations including linguistic content (e.g., phones, phonetic class, syllables) and duration.

Temple University: Slide 11 NIST 2006 Spoken Term Detection Evaluation Weighted Performance Measure  Maximum TWV is 1.0 Approach: Measure error rates using a manually transcribed reference corpus. Data: Use a mixture of languages and sources:  High Quality: Broadcast News  Medium Quality: Telephone Speech  Low Quality: Conference Meetings Error Counting

Temple University: Slide 12 NIST 2006 Spoken Term Detection Evaluation Phonetic-Based Approaches Word-Based Approaches

Temple University: Slide 13 Search Term Error Rates Search term error rates typically vary with the duration of the word. Monosyllabic words tend to have a high error rate. Polysyllabic words occur less frequently and are harder to estimate. Multi-word sequences are common (e.g., Google search). Alternate measures, such as TWV, model the localization of the search hit. These have produced unpredictable results in our work. Average error rate (misses and false alarms) as a function of the number of syllables shows a clear correlation. Query length is not the whole story.

Temple University: Slide 14 word something phonessilsahmthihngsil CVCsilCVCCVC CVC bigrams sil+C (0) C+V (7) V+C (2) C+C (4) C+V (1) V+C (2) C+sil (4) N/A CVC trigrams N/A sil-C+V (20) C-V+C (4) V-C+C (10) C-C+V (2) C-V+C (4) V-C+sil (12) N/A BPC sil (0) fricative (2) vowel (5) nasal (3) fricative (2) vowel (5) nasal (3) sil (0) BPC Bigrams sil+f (3) f+v (18) v+n (34) n+f (21) f+v (18) v+n (34) n+sil (19) N/A Feature Generation Input Search Term Feature Generation Feature Generation Post- Processing Post- Processing Preprocessing and SVD Preprocessing and SVD Machine Learning Machine Learning Final Score Features are decorrelated using Singular Value Decomposition (SVD): the goal is to statistically normalize features that have significantly different ranges, means and variances.

Temple University: Slide 15 Machine Learning Approaches Multivariate Linear Regression (regress): Multilayer Perceptron Neural Network (newff): Classification and Regression Decision Tree (treefit): Input Search Term Feature Generation Post- Processing Preprocessing and SVD Machine Learning Final Score

Temple University: Slide 16 Baseline Experiments - Duration Features Closed-LoopOpen-Loop RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration No. Syllables No. Phones No. Vowels No. Consonants No. Characters Duration is the average word duration based on all word tokens. Duration has long been known to be an important cue in speech processing. The “length” of a search term, as measured in duration, number of syllables, or number of phones has been observed to be significant “operationally.” Number of phones (or number of characters) slightly better than the number of syllables.

Temple University: Slide 17 Baseline Experiments – Phone Type Features Closed-LoopOpen-Loop RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration Init. Phone Typ Final Phone Typ No. Vowels / No. Consonants CVC BPC Broad Phonetic Class (BPC) Consonant Vowel Consonant (CVC) (“Cat”  C V C) ClassPhone Stopsb p d t g k Fricativejh ch s sh z zh f th v dh hh Nasalsm n ng en Liquidsl el r w y Vowels iy ih eh ey ae aa aw ay ah ao ax oy ow uh uw er

Temple University: Slide 18 CVC and BPC N-grams Features Closed-LoopOpen-Loop RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration CVC BPC BPC Bigrams CVC Bigrams CVC Trigrams Insufficient amount of training data to support phone N-grams. Explored many different ways to select the most influential N-grams (e.g. most common N-grams in the most accurate and least accurate words) with no improvement in performance. Also explored the relationship of the position in the word with little effect.

Temple University: Slide 19 Feature Combinations Features Closed-LoopOpen-Loop RegressionNNDTRegressionNNDT MSER R R R RMSRR Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Duration + Length + No. Syllables /Duration Duration + No. Consonants + Length/Duration + No. Syllables / Duration + CVC

Temple University: Slide 20 Future Directions How do we get better?  We need more data and are in the process of acquiring 10x more data from both word and phonetic search engines.  Need more data from both clean and noisy conditions.  More data will provide better estimates of search term accuracy and also allow us to build more complex prediction functions.  More data will let us explore more sophisticated features, such as phone N-grams. How can we improve performance with the current data?  Combining multiple prediction functions is an obvious way to improve performance.  We are not convinced MSE or R are the proper metrics for performance. We have explored postprocessing the error functions to limit the effects of outliers, but this has not resulted in better overall performance. What are the limits of performance?  Predicting error rates only from spellings ignores a number of important factors that contribute to recognition performance, such as speaking rate.  Correlating metadata with keyword search results can be powerful.

Temple University: Slide 21 Brief Bibliography of Related Research S. Pinker, The Language Instinct: How the Mind Creates Language, William Morrow and Company, New York, New York, USA, “The NIST 2006 Spoken Term Detection Evaluation,” available at F. Juang and L.R. Rabiner, “Automatic Speech Recognition - A Brief History of the Technology,” Elsevier Encyclopedia of Language and Linguistics, 2 nd Edition, P. Yu, K. Chen, C. Ma and F. Seide, “Vocabulary-Independent Indexing of Spontaneous Speech,” IEEE Transactions on Speech and Audio Processing, vol.13, no.5, pp , Sept (doi: /TSA ). R. Wallace, R. Vogt and S. Sridharan, “Spoken term Detection Using Fast Phonetic Decoding," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp , April 2009 (doi: /ICASSP ).

Temple University: Slide 22 Biography Amir H Harati Nejad Torbati is a PhD student in the Department of Electrical and Computer Engineering at Temple University. He graduated from University of Tabriz with a BS in Electrical Engineering. He received his MS in Electrical Engineering, Communication System Major, from K.N. Toosi University of Technology Tehran-Iran in He is a student member of IEEE. His interests include signal and speech processing. He is currently pursuing research on new statistical model approaches in speech recognition.