Underspecified feature models for pronunciation variation in ASR Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW.

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Introduction to Automatic Speech Recognition

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Phonetics and Phonology

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.

Landmark-Based Speech Recognition: Status Report, 7/21/2004.

7-Speech Recognition Speech Recognition Concepts

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Learning Long-Term Temporal Features

Presenter: Shih-Hsiang(士翔)

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Presentation transcript:

Underspecified feature models for pronunciation variation in ASR Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW - Speech Recognition & Intrinsic Variation 20 May 2006

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Fill in the blanks 3, 6, __, 12, 15, __, 21, 24 A B C __ E F __ H You’re going to Toulouse? Drink a bottle of _____ for me! What’s the red object? We’re very good at filling in the blanks when we have context! IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Filling in the blanks: missing data Missing data approaches have been used to integrate over noisy acoustics Wang & Hu 06 IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} {l,r} {eh,ih,iy} s er ch {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate dictionary pronunciation IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate dictionary pronunciation as marked by transcribers (Buckeye Corpus of Speech) IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation What do these tasks have in common? Recovering from erroneous information? –Context plays a big role in helping “clean up” IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation What do these tasks have in common? Recovering from erroneous information? –Context plays a big role in helping “clean up” Recovering from incomplete information! –We should be treating pronunciation variation as a missing data problem Integrate over “missing” phonological features –How much information do you need to decode words? Particularly taking into account the context of the word, syllabic context of phones, etc… Information theory problem IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Outline Problems with phonetic representations of variation –Potential advantages of phonological features Re-examining the role of phonetic transcription Phonological feature approaches to ASR –Feature attribute detection –Feature combination methods –Learning to (dis-)trust features A challenge for the future IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation “The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation “The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: –Lack of progress on spontaneous speech WER McAllaster et al (98): 50% improvement possible Finke & Waibel (97): 6% WER reduction IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation “The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: –Lack of progress on spontaneous speech WER –Independence of decisions in phone-based models When pronunciation variation is modeled on phone-by- phone level, unusual baseforms are often created Word-based learning fails to generalize across words IntroductionWhy features?Role of transcriptionApproachesVision Riley et al 98

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation “The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: –Lack of progress on spontaneous speech WER –Independence of decisions in phone-based models –Lack of granularity Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec) Much variation is already handled by triphone context IntroductionWhy features?Role of transcriptionApproachesVision Jurafsky et al 01 Saraçlar et al 00

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation “The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: –Lack of progress on spontaneous speech WER –Independence of decisions in phone-based models –Lack of granularity –Difficulty in transcription Phonetic transcription is expensive and time consuming Many decisions difficult to make for transcribers IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Using phonological features Finer granularity –Some phonological changes don’t result in canonical phones for a language English: uw can sometimes be fronted (toot) Common enough: TIMIT introduced a special phone (ux) Symbol change loses all commonality between phones (uw->ux) –Handling odd phonological effects Phone deletions: many “deletions” really leave small traces of coarticulation on neighboring segments E.g. vowel nasalization with nasal deletion Features may provide basis for cross-lingual recognition International Phonetic Alphabet IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Issues with phonological features Interlingua: “high vowels in English are not the same as high vowels in Japanese” –Richard Wright, lunch Wednesday, ICASSP 2006 Concept of “independent directions” false –Correlation of feature values –Distances no longer euclidean among feature dimensions Dealing with feature spreading Even more difficulty in transcription –(but: Karen Livescu’s group, JHU workshop 2006) Articulatory vs. acoustic features –No two definitions are exactly the same (see Richard’s talk) IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Phonetic transcription There have been a number of efforts to transcribe speech phonetically –American English TIMIT (4 hr read speech) Switchboard (4 hr spontaneous speech) Buckeye Corpus (40 hr spontaneous speech) ASR researchers have found it difficult to utilize phonetic transcriptions directly IntroductionWhy features?Role of transcriptionApproachesVision Riley et al 99

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation ASR & Phonetic Transcription Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y –Compared means of x:y to x:x, y:y –Data showed that x:y means often fell between x:x and y:y, sometimes closer to x:x Another view: data from Buckeye Corpus –/ae/ is sometimes transcribed as [eh] –Examined 80 vowels from one speaker Formant frequencies from center of vowel IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation

higher than eh opposite side of ae from eh mixed ae/eh ae territory

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Can you trust transcription? Perceptual marking ≠ acoustic measurement –Can’t take transcription at face value What are the transcribers are trying to tell us? –This phone doesn’t sound like a canonical phone –Perhaps we can look at commonalities across canonical/transcribed phone ae:eh -> front vowel (& not high?) Phonological features may help us represent transcription differences. IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Variation in single-phone changes Compared canonical vs. transcribed consonants with single-phone substitutions in Switchboard, Buckeye –Differences in manner, place, voicing counted MannerPlaceVoicingSWB %BCS % single dimension common manner, voicing variants more common than place IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Recent approaches to feature modeling in ASR Since 90’s there has been increased interest in phonological feature modeling –Deng et al (92 ff), Kirchhoff (96 ff) Current directions of research –Approaches for detecting phonological features from data –Methods of combining phonological features –Knowing when to ignore information IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Feature detection methods Frame-level decisions –Most common: artificial neural network methods Input: various flavors of spectral/cepstral representations Output: estimating posterior P(feature|acoustics) on a per- frame level –Recent competitor: support vector machines Typically used for binary decision problems Segmental-level decisions: integrate over time –HMM detectors –Hybrid ANN/Dynamic Bayesian Network IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Binary vs. n-ary features Features can either be described as binary or n-ary if they can contrast –Binary: /t/ : +stop -fricative … –N-ary: /t/ : manner=stop No real conclusion on whether which is better –Binary more matched to SVM learning –N-ary allows for discrimination among classes Should a segment be allowed to be +stop +fricative? –Anecdotally (our lab) we find n-ary features slightly better IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Hierarchical representations Phonological features are not truly independent –Chang et al (01): Place prediction improves if manner is known ANN predicts P(place=x|manner=y,X) vs P(place=x|X) Suggests need for hierarchical detectors –Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities –Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Combining features into higher-level structures Once you have (frame-level) estimates of phonological features, need to combine –Temporal integration: Markov structures –Phonetic spatial integration: combining into higher-level units (phones, syllables, words) Differences in methodologies: –spatial first, then temporal –joint/factored spatio-temporal integration –phone-level temporal integration with spatial rescoring IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Combining features into higher-level structures Tandem ANN/HMM Systems –ANN feature posterior estimates are used as replacements for MFCCs for Mixture of Gaussians HMM system –We find decorrelation of features (via PCA) necessary to keep models well conditioned Lattice rescoring with Landmarks –Maximum entropy models for local word discrimination –SVMs used as local features for MaxEnt model. Dynamic Bayesian Models –Model asynchrony as a hidden variable –SVM outputs used as observations of features IntroductionWhy features?Role of transcriptionApproachesVision Launay et al 02 Hasegawa-Johnson et al 05 Livescu 05

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Combining features into higher-level structures Conditional random fields –CRFs jointly model spatio-temporal integration –Probability expressed in terms of indicator functions s (state), t (transition) Usually binary in NLP applications –Frame-level ANN posteriors are bounded Probabilities can serve as observation feature functions –s stop (/t/,x,i)=P(manner=stop|x i ) IntroductionWhy features?Role of transcriptionApproachesVision Morris & Fosler-Lussier 06

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Conditional Random Fields +CRFs make no independence assumptions about input –Posteriors can be used directly without decorrelation –Can combine features, phones, … –No assumption of temporal independence +Entire label sequence is modeled jointly –Monophone feature CRF phone recog. similar to triphone HMM +Learning parameters (,  ) determines importance of feature/phone relationships –Implicit model of partial phonological underspecification –Slow to train IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Underspecification All of these models learn what phonological information is important in higher-level processing –Ignoring “canonical” feature definitions for phone is a form of underspecification –Traditional underspecification: some features are undefined for a particular phone –Weighted models: partial underspecification When can you ignore phonetic information? –Crucially, when it doesn’t help you disambiguate between word hypotheses IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Underspecification Example: unstressed syllables tend to show more phonetic variation than stressed syllables –Experiment: reduce phonetic representation for unstressed syllables to manner class –Allowing recognizer to choose best representation (phone/manner) during training (WSJ0): Minor degradation for clean speech (9.9 vs. 9.1 WER) Larger improvement in 10dB car noise (15.8 vs 13.0 WER) Moral: we don’t need to have exact phonetic representation to decode words –But we may need to integrate more higher-level knowledge IntroductionWhy features?Role of transcriptionApproachesVision Fosler-Lussier et al 05

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Vision for the Future Acoustic-phonetic variation is difficult –Still significant cause of errors in ASR Underspecified models give a new way of looking at the problem –Rather than the “change x to y” model Challenge for the field: –Current techniques for accent modeling, intrinsic pronunciation variation separate –Can we build a model that handles both? IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Conclusions We have come quite a distance since 1999 –New methods for phonological feature detection –New methods for feature integration –New ways of thinking about variation: underspecification Still have a long way to go –Integrating more knowledge sources Stress, prosody, word confusability –Solving the pronunciation adaptation problem in a general way IntroductionWhy features?Role of transcriptionApproachesVision

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation Fin

Fosler-Lussier / Underspecified Feature Models ITRW Speech Recognition and Intrinsic Variation An example feature grid OBSVOWOBSVOWSONVOWOBSVOWSONOBSVOWSON VCDVLSVCDVLSVCDVLSVCD SP- -AT-FE-NLSP-NL VR-AR-LB-PL-VRAR- -MD-HH-LW-HH-MD- -BK- - -CL- - -RD- -ND TE- - -LX- - CLASS: VOICED: CMANNER: CPLACE: VHEIGHT: VFRONTNESS: VROUND: VTENSE: gowtuwwaashixngtaxn gotowashington IntroductionWhy features?Role of transcriptionApproachesVision