Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe,

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

Hidden Markov Models in NLP

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):

The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Graphical models for part of speech tagging

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Training Tied-State Models Rita Singh and Bhiksha Raj.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Sridhar Raghavan and Joseph Picone URL:

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Tight Coupling between ASR and MT in Speech-to-Speech Translation

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Presentation transcript:

Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University

EARS Workshop, December 2003, St. Thomas2 Overview ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6 Richer Acoustic Modeling –Across-phone Clustering –Gaussian Transition Modeling –Modalities –Articulatory Features

EARS Workshop, December 2003, St. Thomas3 Decoding Strategy System Combination –Combine tree-150, tree-6; 8ms, 10ms output –Confusion networks over multiple lattices and Rover –Confidences computed from combined CNs –Best single output (Tree-150):25.4 –CNC + Rover: 24.9 Results on eval03 –Tree-150 single system: 24.2 –CNC + Rover: 23.4

EARS Workshop, December 2003, St. Thomas4 Vocabulary Vocabulary Size 41k vocabulary selected from SWB, BN, CNN Pronunciation Variants 95k entries generated by rule-based approach Pronunciation Probabilities From frequencies (forced alignment of training data) –Viterbi decoding: penalties (e.g. max = 1) –Confusion networks: real probabilities (e.g. sum = 1)

EARS Workshop, December 2003, St. Thomas5 Clustering Entropy-based Divisive Clustering Standard way : –Grow tree for each context independent HMM state –50 phones, 3 states : 150 trees Alternative : clustering across phones –Global tree  parameter sharing across phones –Computationally expensive to cluster  6 trees (begin, middle, end for vowels and consonants) –Quint-phone context

EARS Workshop, December 2003, St. Thomas6 Motivation for Alternative Clustering Pronunciation modeling is important for recognizing conversational speech Adding pronunciation variants often gives marginal improvements due to increased confuseability Case study: Flapping of /T/ BETTERB EH T AXR BETTER(2)B EH DX AXR  Dictionary only contains single pronunciation and the phonetic decision tree chooses whether or not to flap /T/

EARS Workshop, December 2003, St. Thomas7 Clustering Across Phones: Tree construction How to grow a single tree? We expand the question set to allow questions regarding the substate identity and center phone identity.  Computationally expensive on 600k SWB quint-phones Two dictionaries: conventional dictionary with 2.2 variants per word (almost) single pronunciation dictionary with 1.1 variants per word A simple procedure is used to reduce the number of pronunciation variants. Variants with a relative frequency of <20% are removed. For unobserved words, only the baseform is kept.

EARS Workshop, December 2003, St. Thomas8 Allows better parameter tying (tying now possible across phones and sub- states) Alleviates lexical problems: over-specification and inconsistencies  no need for an optimal phone set, preferable for multi-lingual / non-native speech recognition Implicitly models subtle reduction in sloppy speech AX-b IX-m AX-m 0=vowel? 0=obstruent?0=begin-state? -1=syllabic?0=mid-state?-1=obstruent?0=end-state? Clustering Across Phones

EARS Workshop, December 2003, St. Thomas9 Clustering Across Phones: Experiments Cross-substate clustering doesn’t make any difference Cross-phone clustering with 6 trees: {vowel|consonant}-{b|m|e} Single pronunciation lexicon has 1.1 variants per word (instead of 2.2 variants per word) DictionaryClustering WER 66hr training set WER 180hr training set multi- pronunciation traditional cross-phone33.9- single pronunciation traditional34.1- cross-phone Results are based on first pass decoding on dev01

EARS Workshop, December 2003, St. Thomas10 Analysis Flexible tying works better with single pronunciation lexicon:  Higher consistency, data-driven approach Significant cross-phone sharing: ~30% of the leaf nodes are shared by multiple phones Commonly tied vowels: AXR & ER, AE & EH, AH & AX ~consonants: DX & HH, L & W, N & NG -1=voiced? -1=consonant?0=high-vowel? 1=front-vowel?0=high-vowel?-1=obstruent?0=L | R | W? Vowel-b

EARS Workshop, December 2003, St. Thomas11 Gaussian Transition Modeling A linear sequence of GMMs may contain a mix of different model sequences. To further distinguish these paths, we can model transitions between Gaussians in adjacent states.

EARS Workshop, December 2003, St. Thomas12 Frame-independence Assumption HMM assumes each speech frames to be conditionally independent given the hidden state sequence frames models … … …… HMM as a generative model

EARS Workshop, December 2003, St. Thomas13 Gaussian Transition Modeling GTM models transition probabilities between Gaussians

EARS Workshop, December 2003, St. Thomas14 GTM for Modeling Sloppy Speech Partial reduction/ realization may be better modeled at sub-phoneme level GTM can be thought of as pronunciation network at the Gaussian level GTM can handle a large number of trajectories Advantages over Parallel Path HMMs/ Segmental HMMs –Number of paths is very limited –Hard to determine the right number of paths

EARS Workshop, December 2003, St. Thomas15 Experiments GTM can be readily trained using Baum-Welch algorithm Data sufficiency an issue since we are modeling 1 st order variable Pruning transitions is important (backing-off) Pruning Threshold Avg. #transitions per Gaussian WER (%) Baseline e e WERs on Switchboard (hub5e-01)

EARS Workshop, December 2003, St. Thomas16 Experiments II GTM offers better discrimination between trajectories All trajectories are nonetheless still allowed. Pruning away unlikely transitions leads to a more compact and prudent model. However, we need to be careful not to prune away unseen trajectories due to a limited training set. Using a first-order acoustic model in decoding requires maintaining the left history, which is expensive at word boundaries. Viterbi approximation is used in current implementation. Log-Likelihood improvements during Baum-Welch training: to

EARS Workshop, December 2003, St. Thomas17 Modalities Would like to include additional information into divisive clustering, e.g.: –Gender –Signal-noise-ratio –Speaking rate –Speaking style (normal vs hyper-articulated) –Dialect –Show-type, Data-type (CNN, NBC,...) Data-driven approach: sharing still possible

EARS Workshop, December 2003, St. Thomas18 Modalities II Suitable for different corpora? Example: –German Dialects –Male/ Female -1=vowel? -1=obstruent?0=bavarian? -1=syllabic?0=suabian?-1=obstruent?0=female?

EARS Workshop, December 2003, St. Thomas19 Modalities III Tested on German Verbmobil data Not enough time to test on SWB/ RT-03 Proved beneficial in several applications –Labeled data needed –Our tests were not done on highly optimized systems (VTLN) –Hyperarticulation: -1.7%for Hyper +0.3% for Normal

EARS Workshop, December 2003, St. Thomas20 Modalities Results

EARS Workshop, December 2003, St. Thomas21 Articulatory Features Idea: combine very specific sub-phone models with generic models Articulatory Features: Linguistically Motivated /F/ = UNVOICED, FRICATIVE, LAB-DNT,... Introduce new Degrees of Freedom for –Modeling –Adaptation Integrate into existing architecture, use existing training techniques (GMMs) for feature detectors Articulatory (Voicing) Features in Front-end did not help

EARS Workshop, December 2003, St. Thomas22 Articulatory Features Output from Feature Detectors: p(FEAT)-p(NON_FEAT)+p0

EARS Workshop, December 2003, St. Thomas23 Articulatory Features A-symmetric Stream Setup: ~4k models –~4k GMMs in stream 0 –2 GMMs in stream 1...N („Feature Streams“)

EARS Workshop, December 2003, St. Thomas24 Articulatory Features Results I Test on Read Speech (BN-F0) 13.4%  11.6% with Articulatory Features Test on Multilingual Data 13.1%  11.5% (English with ML detectors) Significant Improvements also seen on –Hyper-Articulated Speech –Spontaneous, Clean Speech (ESST)

EARS Workshop, December 2003, St. Thomas25 Articulatory Features Results II Test on Switchboard (RT-03 devset) Sub Del Ins WER –Baseline| | –Features| | Result: –Substitutions, Insertions  –Deletions  No overall improvement yet  will work on setup

EARS Workshop, December 2003, St. Thomas26 Thank You,... the ISL team!

EARS Workshop, December 2003, St. Thomas27 Related Work D. Jurafsky, et al.: What kind of pronunciation variation is hard for triphones to model? ICASSP’01 T. Hain: Implicit pronunciation modeling in ASR. ISCA Pronunciation Modeling Workshop, 2002 M. Saraclar, et al.: Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language, Apr. 2000

EARS Workshop, December 2003, St. Thomas28 Related Work R. Iyer, et al.: Hidden Markov models for trajectory modeling, ICSLP’98 M. Ostendorf, et al.: From HMMs to segment models: A unified view of stochastic modeling for speech recognition, IEEE trans. Sap, 1996

EARS Workshop, December 2003, St. Thomas29 Publications F. Metze and A. Waibel: A Flexible Stream Architecture for ASR using Articulatory Features; ICSLP 2002; Denver, CO C. Fügen and I. Rogina: Integrating Dynamic Speech Modalities into Context Decision Trees; ICASSP 2000; Istanbul, Turkey H. Yu and T. Schultz: Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition; Eurospeech 2003; Geneva H. Soltau, H. Yu, F. Metze, C. Fügen, Q. Jin, and S. Jou: The ISL transcription system for conversational telephony speech; submitted to ICASSP 2004; Vancouver ISL web page: