Jun Wu Department of Computer Science and

Slides:



Advertisements
Similar presentations
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Center for Language and Speech Processing, The Johns Hopkins University. April Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
Graphical models for part of speech tagging
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Supertagging CMSC Natural Language Processing January 31, 2006.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Center for Language and Speech Processing, The Johns Hopkins University. May Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Smoothing Issues in the Strucutred Language Model
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Olivier Siohan David Rybach
NLP.
Tools for Natural Language Processing Applications
2 Research Department, iFLYTEK Co. LTD.
An overview of decoding techniques for LVCSR
Chapter 11 Language Modeling
Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.
Zhifei Li and Sanjeev Khudanpur Johns Hopkins University
CSc4730/6730 Scientific Visualization
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Research on the Modeling of Chinese Continuous Speech Recognition
Text Categorization Berlin Chen 2003 Reference:
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
INF 141: Information Retrieval
Learning Long-Term Temporal Features
Presenter : Jen-Wei Kuo
Combination of Feature and Channel Compensation (1/2)
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Maximum Entropy Language Modeling with Semantic, Syntactic and Collocational Dependencies Jun Wu Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD 21218 August 20, 2002

Outline Motivation Semantic (Topic) dependencies Syntactic dependencies Maximum entropy (ME) models with topic and syntactic dependencies Training ME models in an efficient way Hierarchical training (N-gram) Generalized hierarchical training (syntactic model) Divide-and-conquer (topic-dependent model) Conclusion and future work

Outline Motivation Semantic (Topic) dependencies Syntactic dependencies ME models with topic and syntactic dependencies Training ME models in an efficient way Hierarchical training (N-gram) Generalized hierarchical training (syntactic model) Conclusion and future work

Exploiting Semantic and Syntactic Dependencies Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

Exploiting Semantic and Syntactic Dependencies Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

Exploiting Semantic and Syntactic Dependencies Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

Training a Topic Sensitive Model Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: f ( w ) f ( w ) log t ³ t t f ( w ) ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

Training a Topic Sensitive Model Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

Training a Topic Sensitive Model Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

Experimental Setup Switchboard WS97 dev test set. Vocabulary: 22K (closed), LM training set: 1100 conversations, 2.1M words, AM training set: 60 hours of speech data, Acoustic model: state-clustered cross-word triphone model, Front end: 13 MF-PLP + Δ + Δ Δ , per conv. side CMS, Test set: 19 conversations (2 hours), 18K words, No speaker adaptation. The evaluation is based on rescoring 100-best lists of the first pass speech recognition.

Experimental Setup (Cont.) Broadcast News Hub-4 96 eval set. Vocabulary: 64K, LM training set: 125K stories, 130M words, AM training set: 72 hours of speech data, Acoustic model: state-clustered cross-word triphone model, Front end: 13 MFCC + Δ + Δ Δ , Test set: 2 hours, 22K words, No speaker adaptation. The evaluation is based on rescoring 100-best lists of the first pass speech recognition.

Experimental Results (Switchboard) Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

Experimental Results (Switchboard) Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

Experimental Results (Switchboard) Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

Experimental Results (Broadcast News) Utterance level topic detection based on 10 best lists. The ME trigram model duplicates the performance of the corresponding backoff model. Topic-dependencies help reduce perplexity by 10% and WER 0.6% (absolute).

Exploiting Syntactic Dependencies contract NP ended VP The h with a loss of 7 cents after w DT NN VBD IN CD NNS i-2 i-1 i nt i-1 nt i-2 All sentences in the training set are parsed by a left-to-right parser. A stack of parse trees for each sentence prefix is generated.

Exploiting Syntactic Dependencies (Cont.) A probability is assigned to each word as: å P ( w | w i - 1 ) = P ( w | w i - 1 , T ) × r ( T | w i - 1 ) i 1 i 1 i i 1 T Î S i i å = P ( w | w , w , h , h , nt , nt ) × r ( T | w i - 1 ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i ended VP contract nt i-1 NP nt i-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1

Training a Syntactic ME Model Estimate an ME model with syntactic constraints: See Khudanpur and Wu CSL’00 and Chelba and Jelinek ACL’98 for details. P ( w | , 2 1 - i w , h , h , nt , nt ) i i - 1 i - 2 i - 1 i - 2 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , i w ) × e l ( nt , w ) × e l ( nt , nt , w ) i 1 - i i - 2 i 1 - i i 1 - i i - 2 i i - 1 i - 1 i i - 2 i - 1 i = Z ( w , w , h , h , nt , nt ) i - 1 i - 2 i - 1 i - 2 i - 1 i - 2 å # [ w , w , w ] where P ( h , h , nt , nt , w | w , w ) = i - 2 i - 1 i i - 1 i - 2 i - 1 i - 2 i i - 2 i - 1 # [ w , w ] h , h , nt , nt - - i - 2 - 1 i - 2 i - 1 i 2 i 1 å - = 1 2 , ] [ # ) | ( i nt w h P # [ nt , nt , w ] å P ( w , w , h , h , w | nt , nt ) = - 2 i i - 1 i i - 1 i - 2 i - 1 i - 2 i i - 2 i - 1 # [ nt , nt ] w - , w - h - , h - - 2 i 2 i 1 i 2 1 i - 1

Experimental Results for Switchboard Baseline Katz back-off trigram model: PPL – 79, WER – 38.5%. Interpolated mode: PPL – 4%, WER – 0.6%. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

Experimental Results for Switchboard Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

Experimental Results for Switchboard Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

Experimental Results for Switchboard Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

Experimental Results for Switchboard Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

Experimental Results for Broadcast News* Baseline trigram model: PPL – 214, WER – 35.3%. Interpolated mode: PPL – 7%, WER – 0.6%. *14M words of data

Experimental Results for Broadcast News Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. ME model achieves slightly better performance than interpolated model. *14M words of data

Experimental Results for Broadcast News Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. ME model achieves slightly better performance than interpolated model. *14M words of data

Combining Topic, Syntactic and N-gram Dependencies in an ME Framework Probabilities are assigned as: The ME composite model is trained: Only marginal trigram like constraints are necessary. å P ( w | w i - 1 ) = P ( w | w , w , h , h , nt , nt , topic ) × r ( T | w i - 1 ) i 1 i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i P ( w | w , w , h , h , nt , nt , topic ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , w ) × e l ( nt , w ) × l ( nt , h , w ) × l ( topic , w ) i i - 1 i i - 2 i - - e e 1 i i 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i i = Z ( w , w , h , h , nt , nt , topic ) i - 2 i - 1 i - 2 i - 1 i - 2 i - 1

Combining Topic, Syntactic and N-gram Dependencies in an ME Framework Probabilities are assigned as: The ME composite model is trained: Only marginal trigram like constraints are necessary. å P ( w | w i - 1 ) = P ( w | w , w , h , h , nt , nt , topic ) × r ( T | w i - 1 ) i 1 i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i P ( w | w , w , h , h , nt , nt , topic ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , w ) × e l ( nt , w ) × e l ( nt , h , w ) × ) , ( e w topic i l i i - 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i = i Z ( w , w , h , h , nt , nt , topic ) i - 2 i - 1 i - 2 i - 1 i - 2 i - 1

Overall Experimental Results for SWBD Baseline trigram WER is 38.5%. Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

Overall Experimental Results for BN Repeated improvements are achieved on the 14M Broadcast News task topic-dependent constraints and Syntactic constraints individually reduce WER by 0.7% absolute. They together reduce the WER by 1.2%. This WER is lower than the trigram model trained with 130M words. The gains from topic and syntactic dependencies are nearly additive.

Advantages and Disadvantage of Maximum Entropy Method Creating a “smooth” model that satisfies all empirical constraints. Incorporating various sources of information (e.g. topic and syntax) in a unified language model. Disadvantages: High computational complexity of model parameter estimation procedure. Heavy computation load in using ME models during recognition.

Advantages and Disadvantage of Maximum Entropy Method Creating a “smooth” model that satisfies all empirical constraints. Incorporating various sources of information (e.g. topic and syntax) in a unified language model. Disadvantages: High computational complexity of model parameter estimation procedure. Heavy computation load in using ME models during recognition.

Estimating Model Parameters Using GIS Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .

Estimating Model Parameters Using GIS Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .

Estimating Model Parameters Using GIS Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .

The Computation of Denominators For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given . Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators. å P ( w , w ) E ( n ) [ g ( w )] = i - 2 i - 1 a ( w ) g ( ) a ( , ) a ( , , 1 w i ( w , w ) g 2 w i - 1 w i ( w , w , w ) g 3 w i - 2 w i - 1 w ) i 1 i Z ( w , w ) i i - 1 i i - 2 i - 1 i w i - 2 , w i - 1 i - 2 i - 1 Z ( w , w ) w i - 2 i - 1 i E [ × ] w , w w i - 2 i - 1 i

The Computation of Denominators For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given . Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators.

The Computation of Denominators For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given . Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators.

State-of-the-Art Implementation Della Pietra etc suggest . The straight-forward implementation needs . SWBD: 6 billion vs 8 trillion (~1300 fold). For a given history, only a few words have conditional (bigram or trigram) features activated.

State-of-the-Art Implementation Della Pietra etc suggest . The straight-forward implementation needs . SWBD: 6 billion vs 8 trillion (~1300 fold). For a given history, only a few words have conditional (bigram or trigram) features activated.

State-of-the-Art Implementation (cont’d) Unigram-Caching Unigram-caching (Della Pietra etc): Complexity: . In practice, . E.g. SWBD: , 120 million vs 6 billion (~50).

Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . Value of Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

Speed-up of the Hierarchical Training Method Baseline: Unigram-caching (Della Pietra, et al.) Nominal Speed-up The hierarchical training methods achieve a nominal speed-up of two orders of magnitude for Switchboard, and three orders of magnitude for Broadcast News. a real speed-up of 30 folds for SWBD, 85 folds for BN.

Feature Hierarchy N-gram Model Syntactic Model 1 P ( w | w , w , h , h ) = a ( w ) g ( w ) a g ( w , w ) a g ( h , w ) a g ( nt , w ) 1 i ( w , w ) 2 i - 1 i ( h , w ) 3 i - 1 i ( nt , w ) 4 i - 1 i i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 Z i i - 1 i i - 1 i i - 1 i a ( w , w , w ) g ( w a a 5 i - 2 , w i - 1 , w i ) ( h , h , w ) g 6 ( h i - 2 , h i - 1 , w i ) ( nt , nt , w ) g 7 ( nt i - 2 , nt i - 1 , w i ) i - 2 i - 1 i i - 2 i - 1 i i - 2 i - 1 i

Generalized Hierarchical Training Syntactic Model 1 P ( w | w , w , h , h , nt , nt ) = a ( w ) g ( w ) a 1 i ( w , w ) g ( w , w ) a g ( h , w ) a g ( nt 2 i - 1 i ( h , w ) 3 i - 1 i ( nt , w ) 4 i - , w ) 1 i i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 Z i i - 1 i i - 1 i i - 1 i a ( w , w , w ) g ( w , w , w ) a ( h , h , w ) g ( h , h a 5 i - 2 i - 1 i 6 i - 2 i - 1 , w i ) ( nt , nt , w ) g 7 ( nt i - 2 , nt i - 1 , w i ) i - 2 i - 1 i i - 2 i - 1 i i - 2 i - 1 i

Speed-up of the Generalized Hierarchical Training Evaluation is based on the syntactic model. The generalized hierarchical training method achieves a nominal speed-up of about two orders of magnitude for the Switchboard and Broadcast News. a real speed-up of 17 folds in Switchboard. Training the syntactic model for the subset of Broadcast News is impossible without GHT, and it needs 480 CPU-hours per iteration even after speed up.

Topic-dependent models can be trained by divide-and-conquer. Training Topic Models and Composite Models by Hierarchical Training and Divide-and-Conquer Topic-dependent models can be trained by divide-and-conquer. Partition the training data in parts, train the model based on each parts and then collect partial feature expectations. Divide-and-conquer and be used together with hierarchical training. Real training time for topic-dependent models: SWBD:21 CPU-hours 0.5 CPU-hours. BN: (~85) CPU-hours 2.3 CPU-hours.

Simplify the Computation in Calculating Probabilities Using ME Models Converting ME N-gram models to ARPA back-off format. Speedup: 1000+ fold. Approximating denominator for topic-dependent models. Speedup: 400+ fold. Almost the same speech recognition accuracy. Caching the last accessed histories. Speedup: 5 fold.

Concluding Remarks Accuracy Flexibility Efficiency ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

Concluding Remarks Accuracy Flexibility Efficiency ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

Concluding Remarks Accuracy Flexibility Efficiency ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

Concluding Remarks Switchboard Broadcast News Speech recognition performance has been improved by using ME models. The computational requirements for the estimation of a large class of maximum entropy model have been vastly simplified. The computation in using ME models is also reduced by three means pre-normalization, denominator approximation and history caching (not discussed). Switchboard Broadcast News Speed- up Perfo rmance Nom. Real PPL WER Trigram 170 30 79 38.5% Topic 400 50+ -7% -0.7% Syntax 88 17 -6% -1.0% Comp. - -13% -1.5% Speed- up Perfo rmance Nom. Real PPL WER Trigram 560 85 174 34.6% Topic 1400 - -10% -0.6% Syntax 150 -7%* -0.7%* Comp -18%* -1.2%* *14M words of BN data.

Future Work Multi-lingual topic-dependent model. Syntactic parsing. Other applications question answering Automatic summarization Information retrieval

Acknowledgement I thank my advisor Sanjeev Khudanpur for giving me valuable advice and generous help when necessary. My dissertation work could not have been finished without his help. I thank David Yarowsky and Fred Jelinek for their help during my Ph.D. program. I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba, Frederick Jelinek and Peng Xu for providing the syntactic parser, Shankar Kumar, Veera Venkataramani, Dimitra Vergyri and Vlasios Doumpiotis for their help on generating N-best lists, and Andreas Stolcke for providing SRI LM tools and the WS97 baseline training setup. This work is partially sponsored by NSF STIMULATE grant.

Publications Wu & Khudanpur. Building A Topic-Dependent Maximum Entropy Language Model for Very Large Corpora. ICASSP2002. Wu & Khudanpur. Efficient Training Methods for Maximum Entropy Language Modeling. ICSLP2000. Khudanpur & Wu, Maximum Entropy Techniques for Exploiting Syntactic, Semantic and Collocational Dependencies in Language Modeling. Computer Speech and Language 2000. Wu & Khudanpur, Syntactic Heads in Statistical Language Modeling. ICASSP2000. Wu & Khudanpur, Combining Nonlocal, Syntactic and N-Gram Dependencies in Language Modeling. Eurospeech99. Khudanpur & Wu. A Maximum Entropy Language Model Integrating N-Grams and Topic Dependencies for Conversational Speech Recognition. ICASSP99. Brill & Wu. Classifier Combination for Improved Lexical Disambiguation. COLING-ACL'98 Kim, Khudanpur and Wu, Smoothing Issues in the Structured Language Model, Eurospeech 2001