Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe,

Similar presentations

Presentation on theme: "Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe,"— Presentation transcript:

1 Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University

2 EARS Workshop, December 2003, St. Thomas2 Overview ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6 Richer Acoustic Modeling –Across-phone Clustering –Gaussian Transition Modeling –Modalities –Articulatory Features

3 EARS Workshop, December 2003, St. Thomas3 Decoding Strategy System Combination –Combine tree-150, tree-6; 8ms, 10ms output –Confusion networks over multiple lattices and Rover –Confidences computed from combined CNs –Best single output (Tree-150):25.4 –CNC + Rover: 24.9 Results on eval03 –Tree-150 single system: 24.2 –CNC + Rover: 23.4

4 EARS Workshop, December 2003, St. Thomas4 Vocabulary Vocabulary Size 41k vocabulary selected from SWB, BN, CNN Pronunciation Variants 95k entries generated by rule-based approach Pronunciation Probabilities From frequencies (forced alignment of training data) –Viterbi decoding: penalties (e.g. max = 1) –Confusion networks: real probabilities (e.g. sum = 1)

5 EARS Workshop, December 2003, St. Thomas5 Clustering Entropy-based Divisive Clustering Standard way : –Grow tree for each context independent HMM state –50 phones, 3 states : 150 trees Alternative : clustering across phones –Global tree  parameter sharing across phones –Computationally expensive to cluster  6 trees (begin, middle, end for vowels and consonants) –Quint-phone context

6 EARS Workshop, December 2003, St. Thomas6 Motivation for Alternative Clustering Pronunciation modeling is important for recognizing conversational speech Adding pronunciation variants often gives marginal improvements due to increased confuseability Case study: Flapping of /T/ BETTERB EH T AXR BETTER(2)B EH DX AXR  Dictionary only contains single pronunciation and the phonetic decision tree chooses whether or not to flap /T/

7 EARS Workshop, December 2003, St. Thomas7 Clustering Across Phones: Tree construction How to grow a single tree? We expand the question set to allow questions regarding the substate identity and center phone identity.  Computationally expensive on 600k SWB quint-phones Two dictionaries: conventional dictionary with 2.2 variants per word (almost) single pronunciation dictionary with 1.1 variants per word A simple procedure is used to reduce the number of pronunciation variants. Variants with a relative frequency of <20% are removed. For unobserved words, only the baseform is kept.

8 EARS Workshop, December 2003, St. Thomas8 Allows better parameter tying (tying now possible across phones and sub- states) Alleviates lexical problems: over-specification and inconsistencies  no need for an optimal phone set, preferable for multi-lingual / non-native speech recognition Implicitly models subtle reduction in sloppy speech AX-b IX-m AX-m 0=vowel? 0=obstruent?0=begin-state? -1=syllabic?0=mid-state?-1=obstruent?0=end-state? Clustering Across Phones

9 EARS Workshop, December 2003, St. Thomas9 Clustering Across Phones: Experiments Cross-substate clustering doesn’t make any difference Cross-phone clustering with 6 trees: {vowel|consonant}-{b|m|e} Single pronunciation lexicon has 1.1 variants per word (instead of 2.2 variants per word) DictionaryClustering WER 66hr training set WER 180hr training set multi- pronunciation traditional34.433.4 cross-phone33.9- single pronunciation traditional34.1- cross-phone33.131.6 Results are based on first pass decoding on dev01

10 EARS Workshop, December 2003, St. Thomas10 Analysis Flexible tying works better with single pronunciation lexicon:  Higher consistency, data-driven approach Significant cross-phone sharing: ~30% of the leaf nodes are shared by multiple phones Commonly tied vowels: AXR & ER, AE & EH, AH & AX ~consonants: DX & HH, L & W, N & NG -1=voiced? -1=consonant?0=high-vowel? 1=front-vowel?0=high-vowel?-1=obstruent?0=L | R | W? Vowel-b

11 EARS Workshop, December 2003, St. Thomas11 Gaussian Transition Modeling A linear sequence of GMMs may contain a mix of different model sequences. To further distinguish these paths, we can model transitions between Gaussians in adjacent states.

12 EARS Workshop, December 2003, St. Thomas12 Frame-independence Assumption HMM assumes each speech frames to be conditionally independent given the hidden state sequence frames models … … …… HMM as a generative model

13 EARS Workshop, December 2003, St. Thomas13 Gaussian Transition Modeling GTM models transition probabilities between Gaussians

14 EARS Workshop, December 2003, St. Thomas14 GTM for Modeling Sloppy Speech Partial reduction/ realization may be better modeled at sub-phoneme level GTM can be thought of as pronunciation network at the Gaussian level GTM can handle a large number of trajectories Advantages over Parallel Path HMMs/ Segmental HMMs –Number of paths is very limited –Hard to determine the right number of paths

15 EARS Workshop, December 2003, St. Thomas15 Experiments GTM can be readily trained using Baum-Welch algorithm Data sufficiency an issue since we are modeling 1 st order variable Pruning transitions is important (backing-off) Pruning Threshold Avg. #transitions per Gaussian WER (%) Baseline14.434.1 1e-59.733.7 1e-36.633.7 0.014.633.6 0.052.733.9 WERs on Switchboard (hub5e-01)

16 EARS Workshop, December 2003, St. Thomas16 Experiments II GTM offers better discrimination between trajectories All trajectories are nonetheless still allowed. Pruning away unlikely transitions leads to a more compact and prudent model. However, we need to be careful not to prune away unseen trajectories due to a limited training set. Using a first-order acoustic model in decoding requires maintaining the left history, which is expensive at word boundaries. Viterbi approximation is used in current implementation. Log-Likelihood improvements during Baum-Welch training: -50.67 to -49.18

17 EARS Workshop, December 2003, St. Thomas17 Modalities Would like to include additional information into divisive clustering, e.g.: –Gender –Signal-noise-ratio –Speaking rate –Speaking style (normal vs hyper-articulated) –Dialect –Show-type, Data-type (CNN, NBC,...) Data-driven approach: sharing still possible

18 EARS Workshop, December 2003, St. Thomas18 Modalities II Suitable for different corpora? Example: –German Dialects –Male/ Female -1=vowel? -1=obstruent?0=bavarian? -1=syllabic?0=suabian?-1=obstruent?0=female?

19 EARS Workshop, December 2003, St. Thomas19 Modalities III Tested on German Verbmobil data Not enough time to test on SWB/ RT-03 Proved beneficial in several applications –Labeled data needed –Our tests were not done on highly optimized systems (VTLN) –Hyperarticulation: -1.7%for Hyper +0.3% for Normal

20 EARS Workshop, December 2003, St. Thomas20 Modalities Results

21 EARS Workshop, December 2003, St. Thomas21 Articulatory Features Idea: combine very specific sub-phone models with generic models Articulatory Features: Linguistically Motivated /F/ = UNVOICED, FRICATIVE, LAB-DNT,... Introduce new Degrees of Freedom for –Modeling –Adaptation Integrate into existing architecture, use existing training techniques (GMMs) for feature detectors Articulatory (Voicing) Features in Front-end did not help

22 EARS Workshop, December 2003, St. Thomas22 Articulatory Features Output from Feature Detectors: p(FEAT)-p(NON_FEAT)+p0

23 EARS Workshop, December 2003, St. Thomas23 Articulatory Features A-symmetric Stream Setup: ~4k models –~4k GMMs in stream 0 –2 GMMs in stream 1...N („Feature Streams“)

24 EARS Workshop, December 2003, St. Thomas24 Articulatory Features Results I Test on Read Speech (BN-F0) 13.4%  11.6% with Articulatory Features Test on Multilingual Data 13.1%  11.5% (English with ML detectors) Significant Improvements also seen on –Hyper-Articulated Speech –Spontaneous, Clean Speech (ESST)

25 EARS Workshop, December 2003, St. Thomas25 Articulatory Features Results II Test on Switchboard (RT-03 devset) Sub Del Ins WER –Baseline| 72.5 20.0 7.5 4.4 31.9 67.2 | –Features| 68.3 18.3 13.4 2.2 33.9 68.4 | Result: –Substitutions, Insertions  –Deletions  No overall improvement yet  will work on setup

26 EARS Workshop, December 2003, St. Thomas26 Thank You,... the ISL team!

27 EARS Workshop, December 2003, St. Thomas27 Related Work D. Jurafsky, et al.: What kind of pronunciation variation is hard for triphones to model? ICASSP’01 T. Hain: Implicit pronunciation modeling in ASR. ISCA Pronunciation Modeling Workshop, 2002 M. Saraclar, et al.: Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language, Apr. 2000

28 EARS Workshop, December 2003, St. Thomas28 Related Work R. Iyer, et al.: Hidden Markov models for trajectory modeling, ICSLP’98 M. Ostendorf, et al.: From HMMs to segment models: A unified view of stochastic modeling for speech recognition, IEEE trans. Sap, 1996

29 EARS Workshop, December 2003, St. Thomas29 Publications F. Metze and A. Waibel: A Flexible Stream Architecture for ASR using Articulatory Features; ICSLP 2002; Denver, CO C. Fügen and I. Rogina: Integrating Dynamic Speech Modalities into Context Decision Trees; ICASSP 2000; Istanbul, Turkey H. Yu and T. Schultz: Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition; Eurospeech 2003; Geneva H. Soltau, H. Yu, F. Metze, C. Fügen, Q. Jin, and S. Jou: The ISL transcription system for conversational telephony speech; submitted to ICASSP 2004; Vancouver ISL web page:

Download ppt "Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe,"

Similar presentations

Ads by Google