Download presentation
Presentation is loading. Please wait.
1
WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei 2 1. HLTC, Department of Electrical & Electronic Engineering, the Hong Kong University of Science & Technology, Hong Kong 2. SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195 Problems –N-gram LMs require large quantities of data that is matched to the target recognition task both in terms of style and topic. –Recent words may be sparsely represented in the training data. –Collecting data for conversational speech recognition is costly, especially for Mandarin. Solution –Gather text from the web, filtering for topic and style –Apply topic models with web data Conversation- like ngrams Conversation- like data Topic-related Conversation- like ngrams Topic-dependent data “ 对对对 ” ( right, right, right) “ 呃我觉得 ” (uh, I think) “ 呃怎么说呢 ” (uh, how to say) examples “ 我觉得吃饭 ” ( I think about dinner) “ 抽烟我觉得 ” ( Smoking, I think) “ 他他老婆 ” (his- his wife) examples Topic-dependent data collection General data collection Getting the right data from the Web Introduction phrasing the html files Ignore documents containing corrupted characters Text normalization (written spoken) Automatic word segmentation Perplexity-based filtering of web pages Raw Web-data Usable web-data Cleaning up web data Experiment setup Available Corpora –CallHome+CallFriend (CC) 479K words –Train04 (in-domain data) 398K words –Conversational style 100M words –Topic-based 244M words Static General Models –Train trigram LMs for individual data sources Pcc, P04,Pconv and P topic –Mixture LM: interpolate the individual LMs by using the optimized mixture weights on a held out set. a) 3-comp general LM: Pcc+P04+Pconv b) 4-comp general LM: Pcc+P04+Pconv+ P topic Language Model Construction Topic-Based Models –Static Topic Model 40 topics in Train04 Topic Clustering 10 topic clusters Train one LM for each cluster Interpolated all of the 13 LMs by optimizing the likelihood on a held out set Static topic model Dynamic Topic Model - The dynamic mixture weights for T 1…10 (Assume equally likely topics): Where V* is the set of discriminative vocabulary items. -Model combination: T 1…10 where 0.64,0.04,0.16 and 0.16 were the optimized weights for the static general models Marginal Adaptation Model ASR hypothesis Topic-Identification t * =argmax t P(h| t ) H: hypothesis, t : the unigram for the 10 topic clusters Maximum Entropy Adaptation 4-component general LM The target topic LM P t* marginally adapted model LM Data sources Perplexity Character Error Rate CC+Train04269.338.8% 3-comp general LM202.236.4% 4-comp general LM192.636.1% Static topic mixture196.636.3% Dynamic topic mixture-36.2% Marginal adaptation-36.4% Baseline Recognition System MFC nonCW Thin lattices PLP CW PLP CW MFC CW SRI 5xRT system Nbest Lists Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output Conf. Network combination Experiment Results Conclusions Conversational Web data helped 28.5% reduction in word perlexity 7% relative reduction in CER Topic-based web data gave a marginal gain (0.3% absolute) in CER, but only when pooled (not in explicit topic modeling)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.