Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Slides:

Advertisements

Similar presentations

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Advertisements

Building an ASR using HTK CS4706

1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Profile of Phoneme Auditory Perception Ability in Children with Hearing Impairment and Phonological Disorders By Manal Mohamed El-Banna (MD) Unit of Phoniatrics,

On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Structure of Spoken Language

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

Speech Signal Processing

Structure of Spoken Language

Old Dominion University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.

CS 551/652: Structure of Spoken Language Lecture 2: Spectrogram Reading and Introductory Phonetics John-Paul Hosom Fall 2010.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

LINEAR DYNAMIC MODEL FOR CONTINUOUS SPEECH RECOGNITION Ph.D. Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Quantitative and qualitative differences in understanding sentences interrupted with noise by young normal-hearing and elderly hearing-impaired listeners.

The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Department of Electrical and Computer Engineering.

TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.

Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.

Interactive Learning of the Acoustic Properties of Objects by a Robot

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.

Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis.

A NONPARAMETRIC BAYESIAN APPROACH FOR

An Efficient Online Algorithm for Hierarchical Phoneme Classification

Structure of Spoken Language

Structure of Spoken Language

College of Engineering

Structure of Spoken Language

Speech Recognition - H02A CHAPTER 1 Introduction

Speech Technology for Language Learning

Jennifer J. Venditti Postdoctoral Research Associate

EEG Recognition Using The Kaldi Speech Recognition Toolkit

From Word Spotting to OOV Modeling

Phonetics and Phonemics

Human Speech Perception and Feature Extraction

Robust Full Bayesian Learning for Neural Networks

Phonetics and Phonemics

Presentation transcript:

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around 40% correlation. Core Features Machine Learning Algorithms Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word: Strength Score = 1 − WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Two-layer, 30-neuron neural network that used back-propagation for training. Machine Learning Algorithms Using machine learning algorithms to learn the relationship between a phonetic representation of a word and its word error rate (WER). The score is defined based on average WER predicted for a word: Strength Score = 1 − WER Algorithms: Linear Regression, Feed-Forward Neural Network, Regression Tree and K-nearest neighbors (KNN) in the phonetic space. Preprocessing includes whitening using singular value decomposition (SVD). Two-layer, 30-neuron neural network that used back-propagation for training. Figure 4. The relationship between duration and error rate shows that longer words generally result in better performance. Results Table 2. KNN’s predictions show a relatively good correlation with reference WER. Summary The overall correlation between the reference and predictions is not large enough. One of the serious limitation for the current work was the size and quality of the data set. Despite of all problems the developed system works and can help the users to choose better keywords. Future Work Use data generated from acoustically clean speech with proper speech rate and accent. Finding features with small correlation to the existed set of features. Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. An extension of this work is currently under development with promising results. Summary The overall correlation between the reference and predictions is not large enough. One of the serious limitation for the current work was the size and quality of the data set. Despite of all problems the developed system works and can help the users to choose better keywords. Future Work Use data generated from acoustically clean speech with proper speech rate and accent. Finding features with small correlation to the existed set of features. Using more complicated models such as nonparametric Bayesian models (e.g. Gaussian process.) for regression. An extension of this work is currently under development with promising results. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Introduction Searching audio, unlike text data, is approximate and is based on likelihoods. Performance depends on acoustic channel, speech rate, accent, language and confusability. Unlike text-based searches, the quality of the search term plays a significant role in the overall perception of the usability of the system. Goal: Develop a tool similar to how password checkers assess the strength of a password. Figure 1. A screenshot of our demonstration software: Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Spoken Term Detection (STD) STD Goal: “…detect the presence of a term in large audio corpus of heterogeneous speech…” STD Phases: 1.Indexing the audio file. 2.Searching through the indexed data. Error types: 1. False alarms. 2.Missed detections. Figure 2. A common approach in STD is to use a speech to text system to index the speech signal (J. G. Fiscus, et al., 2007). Features TrainEval RegNNDTRegNNDT Duration Duration + No. Syllables Duration + No. Consonants Duration + No. Syllables + No. Consonants Dur. + Length + No. Syllables /Dur Dur. + # Consonants + CVC2 + Length/Dur. + #Syllables/Dur KTrainEval Figure 5. Correlation between the predicted and reference error rates. Figure 3. An overview of our approach to search term strength prediction that is based on decomposing terms into features. Wordtsunami Phonemest s uh n aa m iy Vowelsuh aa iy Consonantst s n m SyllablesTsoo nah mee BPCS F V N V N V CVCC C V C V C V ClassPhone Stops (S)b p d t g k Fricative (F)jh ch s sh z zh f th v dh hh Nasals (N)m n ng en Liquids (L)l el r w y Vowels (V) iy ih eh ey ae aa aw ay ah ao ax oy ow uh uw er Data Set NIST Spoken Term Detection 2006 Evaluation Results SitesBBNIBMSRI Sources Broadcast News (3hrs) Conversationa l Telephone Speech (3 hrs) Conference Meetings (2 hrs). ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University  Duration  Length  No. of Syllables  No. of Vowels  No. of Consonants  Phoneme Frequency  BPC and CVC Frequency  Length/Duration  No. Syllables/Duration  No. Vowels/No. Consonants  Start-End Phoneme  2-Grams of Phonemes  2-Grams of BPC  2- and 3-Grams of CVCs Table 1- The correlation between the hypothesis and the reference WER for both training and evaluations subsets. Maximum correlation is 46%, which explains 21% of the variance. Many of the core features are highly correlated. KNN demonstrates the most promising prediction capability. Data set is not balanced. Number of data points with low error rate is much more than points with high error rate which reduce the accuracy of the predictor. A significant portion of the error rate is related to factors beyond the spelling of the search term, such as speech rate.