Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.
Bag-Of-Word normalized n-gram models ISCA 2008 Abhinav Sethy, Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY Presented by Patty.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.
Center for Language and Speech Processing, The Johns Hopkins University. April Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Scalable Text Mining with Sparse Generative Models
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
Graphical models for part of speech tagging
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Yuya Akita , Tatsuya Kawahara
Language and Statistics
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Natural Language Processing Statistical Inference: n-grams
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Smoothing Issues in the Strucutred Language Model
Jun Wu Department of Computer Science and
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Language and Statistics
CS249: Neural Language Model
Presentation transcript:

Center for Language and Speech Processing, The Johns Hopkins University. May Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University Baltimore, MD May, 2001 NSF STIMULATE Grant No. IRI

2 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Outline zMotivation zSemantic (Topic) dependencies in natural language zSyntactic dependencies in natural language zME models with topic and syntactic dependencies zTraining ME models in an efficient way (1 hour) zConclusion and future work

3 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Outline zMotivation zSemantic (Topic) dependencies in natural language zSyntactic dependencies in natural language zME models with topic and syntactic dependencies zTraining ME models in an efficient way (5 mins) zConclusion and future work

4 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

5 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

6 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

7 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 The Maximum Entropy Principle zThe maximum entropy (ME) principle When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have. zExample (Dice) Let be the probability that the facet with dots faces-up. Seek model, that maximizes From Lagrangian So, choose :

8 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 The Maximum Entropy Principle (Cont.) zExample 2: Seek probability distribution with constraints. ( is the empirical distribution.) The feature: Empirical expectation: Maximize subject to So

9 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Maximum Entropy Language Modeling zModel where: zFor define a collection of binary features: zObtain their target expectations from the training data. zFind zIt can be shown that

10 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Advantages and Disadvantage of Maximum Entropy Language Modeling zAdvantages: yCreating a “smooth” model that satisfies all empirical constraints. yIncorporating various sources of information in a unified language model. zDisadvantage: yComputation complexity of model parameter estimation procedure. (solved!)

11 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     Where

12 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     Where

13 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     where

14 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Recognition Using a Topic-Sensitive Model zDetect the current topic from yRecognizer’s N-best hypotheses vs. reference transcriptions. xUsing N-best hypotheses causes little degradation (in perplexity and WER). zAssign a new topic for each yConversation vs. utterance. xTopic assignment for each utterance is better than topic assignment for the whole conversation. zSee Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details.

15 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Recognition Using a Topic-Sensitive Model zDetect the current topic from yRecognizer’s N-best hypotheses vs. reference transcriptions. xUsing N-best hypotheses causes little degradation (in perplexity and WER). zAssign a new topic for each yConversation vs. utterance. xTopic assignment for each utterance is better than topic assignment for the whole conversation. zSee Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details.

16 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Recognition Using a Topic-Sensitive Model zDetect the current topic from yRecognizer’s N-best hypotheses vs. reference transcriptions. xUsing N-best hypotheses causes little degradation (in perplexity and WER). zAssign a new topic for each yConversation vs. utterance. xTopic assignment for each utterance is better than topic assignment for the whole conversation. zSee Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details.

17 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Setup for Switchboard zThe experiments are based on WS97 dev test set. yVocabulary: 22K (closed), yLM training set: 1100 conversations, 2.1M words, yAM training set: 60 hours of speech data, yAcoustic model: state-clustered cross-word triphone model, yFront end: 13 MF-PLP + +, per conv. side CMS, yTest set: 19 conversations (2 hours), 18K words, yNo speaker adaptation. zThe evaluation is based on rescoring 100-best lists of the first pass speech recognition.

18 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Reference Trans vs Hypotheses zEven with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. zComparisons with the oracle indicate that there is little room for further improvement.

19 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Reference Trans vs Hypotheses zEven with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. zComparisons with the oracle indicate that there is little room for further improvement.

20 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Reference Trans vs Hypotheses zEven with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. zComparisons with the oracle indicate that there is little room for further improvement.

21 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Reference Trans vs Hypotheses zEven with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. zComparisons with the oracle indicate that there is little room for further improvement.

22 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Conv. Level vs Utterance Level zTopic assignment based on utterances brings a slightly better result than that based on whole conversations. zMost of utterances prefer the topic-independent model. zLess than one half of the remaining utterances prefer a topic other than that assigned at the conversation level.

23 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Conv. Level vs Utterance Level zTopic assignment based on utterances brings a slightly better result than that based on whole conversations. zMost of utterances prefer the topic-independent model. zLess than one half of the remaining utterances prefer a topic other than that assigned at the conversation level.

24 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 ME Method vs Interpolation zME model with only topic dependent unigram constraints outperforms the interpolated topic dependent trigram model. zME method is an effective means of integrating topic- dependent and topic- independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size499K+70*11K+70*26K+70*55K+16K

25 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Model vs Cache-Based Model zCache-based model reduces the perplexity, but increase the WER. zCache-based model brings (0.6%) more repeated errors than the trigram model does. zCache model may not be practical when the baseline WER is high.

26 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Model vs Cache-Based Model zCache-based model reduces the perplexity, but increase the WER. zCache-based model brings (0.6%) more repeated errors than the trigram model does. zCache model may not be practical when the baseline WER is high.

27 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Summary of Topic-Dependent Language Modeling zWe significantly reduce both the perplexity (7%) and WER (0.7% absolute) by incorporating a small number of topic constraints with N-grams using the ME method. zUsing N-best hypotheses causes little degradation (in perplexity and WER). zTopic assignment at utterance level is better than that at conversation level. zME method is more efficient than linear interpolation in combining topic dependencies with N-grams. zThe topic dependent model is better than the cache- based model in reducing WER when the baseline is poor.

28 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 A Syntactic Parse and Syntactic Heads The contract ended with a loss of 7 cents after … DT NN VBD IN DT NN IN CD NNS … cents of loss with ended contract ended NP PP NP VP S’ PP

29 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies zA stack of parse trees for each sentence prefix is generated. zAll sentences in the training set are parsed by a left-to- right parser. i T contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2i-1 w i nt i-1 nt i-2

30 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as: contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2 i-1 w i nt i-1 nt i-2          ii ii ST i 1iiiiiiii ST i 1ii i 1i i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|(  

31 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as:          ii ii ST i 1iiiiiiii ST i 1ii i 1i i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|(   z It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.

32 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Syntactic ME Model zEstimate an ME model with syntactic constraints: where zSee Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details.

33 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in a reduction of 0.6% in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

34 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in a reduction of 0.6% in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

35 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in 6% reduction in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

36 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 ME vs Interpolation zThe ME model is more effective in using syntactic dependencies than the interpolation model.

37 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Head Words inside vs. outside 3gram Range contract NP ended VP The h contractendedwithalossof7centsafter hww DTNPVBDINDTNNINCDNNS i-2i-1i-2i-1 w i contract i-2i-1 NP with The h contractendedwithaloss h ww DTNP VBD INDT i-2i-1 w i IN a DT ended VBD

38 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Syntactic Heads inside vs. outside Trigram Range zThe WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. zLexical head words are much more helpful in reducing WER when they are outside trigram range (1.5%) than they are within trigram range. zHowever, non-terminal N-gram constraints help almost evenly in both cases. yCan this gain be obtained from POS class model too? zThe WER reduction for the model with both head word and non-terminal constraints (1.4%) is more than the overall reduction (1.0%) when head words are beyond trigram range.

39 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Syntactic Heads inside vs. outside Trigram Range zThe WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. zLexical head words are much more helpful in reducing WER when they are outside trigram range (1.5%) than they are within trigram range. zHowever, non-terminal N-gram constraints help almost evenly in both cases. yCan this gain be obtained from POS class model too? zThe WER reduction for the model with both head word and non-terminal constraints (1.4%) is more than the overall reduction (1.0%) when head words are beyond trigram range.

40 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Syntactic Heads inside vs. outside Trigram Range zThe WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. zLexical head words are much more helpful in reducing WER when they are outside trigram range (1.5%) than they are within trigram range. zHowever, non-terminal N-gram constraints help almost evenly in both cases. yCan this gain be obtained from POS class model too? zThe WER reduction for the model with both head word and non-terminal constraints (1.4%) is more than the overall reduction (1.0%) when head words are beyond trigram range.

41 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Contrasting the Smoothing Effect of NT Class LM vs POS Class LM zPOS model reduces PPL by 4% and WER by 0.5%. zThe overall gains from POS N-gram constraints are smaller than those from NT N- gram constraints. zSyntactic analysis seems to perform better than just using the two previous word positions. zAn ME model with part-of-speech (POS) N-gram constraints is built as:

42 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 POS Class LM vs NT Class LM zWhen the syntactic heads are beyond trigram range, the trigram coverage in the test set is relatively low. zThe back-off effect by the POS N- gram constraints is effective in reducing WER in this case. zNT N-gram constraints work in a similar manner. Overall, they are more effective perhaps because they are linguistically more meaningful. zPerformance improves further when lexical head words are applied on the top of the non-terminals.

43 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Summary of Syntactic Language Modeling zSyntactic heads in the language model are complementary to N-grams: the model improves significantly when the syntactic heads are beyond N- gram range. zHead word constraints provide syntactic information. Non-terminals mainly provide a smoothing effect. zNon-terminals are linguistically more meaningful predictors than POS tags, and therefore are more effective in supplementing N-grams. zThe Syntactic model reduces perplexity by 6.3%, WER by 1.0% (absolute).

44 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework zProbabilities are assigned as: zOnly marginal constraints are necessary. zThe ME composite model is trained:       ii ST i 1iiiiiiii i i WTtopicnt hhwwwPWwP)|(),,,,,,|()|(  ),,,,,,( ),,,,,,|( ),(),,(),(),,(),(),,(),()( topicnt hhwwZ eeeeeeee topicnt hhwwwP iiiiii wtopicwhntw whhwhwwwwww iiiiiii iiiiiiiiiiiiiiiii     

45 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework zProbabilities are assigned as: zOnly marginal constraints are necessary. zThe ME composite model is trained:       ii ST i 1iiiiiiii i i WTtopicnt hhwwwPWwP)|(),,,,,,|()|(  ),,,,,,( ),,,,,,|( ),(),,(),(),,(),(),,(),()( topicnt hhwwZ eeeeeeee topicnt hhwwwP iiiiii wtopicwhntw whhwhwwwwww iiiiiii iiiiiiiiiiiiiiiii     

46 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

47 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

48 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

49 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Content Words vs. Stop words zThe topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). zThe syntactic model improves WER more on stop words than on content words. Why? yMany content words do not have syntactic constraints. zThe composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).

50 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Head Words inside vs. outside 3gram Range zThe WER of the baseline trigram model is relatively high when head words are beyond trigram range. zTopic model helps when trigram is inappropriate. zThe WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range. zThe WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.

51 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Further Insight Into the Performance zThe composite model reduces the WER of content words by 2.6% absolute when the syntactic predicting information is beyond trigram range.

52 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training an ME Model zDarroch and Ratcliff 1972: Generalized Iterative Scaling (GIS). zDella Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS). zWu and Khudanpur 2000: Hierarchical Training Methods. yFor N-gram models and many other models, the training time per iteration is strictly bounded by which is the same as that of training a back-off model. yA real running time speed-up of one to two orders of magnitude is achieved compared to IIS. ySee Wu and Khudanpur ICSLP2000 for details.

53 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Setup for Broadcast News zAmerican English television broadcast. yVocabulary: open (>100K). yLM training set: 125K stories, 130M words, yAM training set: 70 hours of speech data, yAcoustic model: state-clustered cross-word triphone model, yTrigram model: T>1, B>2, 9.1M constraints, yFront end: 13 MFCC + +, yNo speaker adaptation, yTest set: Hub-4 96 dev-test set, 21K words. zThe evaluation is based on rescoring 100-best lists of the first pass speech recognition.

54 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 # of Operations and Nominal Speed-up zBaseline: IIS + Unigram- caching (Della Pietra, et al.) zNominal Speed-up zThe hierarchical training methods achieve a nominal speed-up of ytwo orders of magnitude for Switchboard, and yThree orders of magnitude for Broadcast News.

55 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Real Running Time zThe real speed-up is folds for the Switchboard task: y30 for the trigram model. y25 for the topic model. y15 for the composite model. zThis simplification in the training procedure make it possible the implement of ME models for large corpora. y40 minutes for the trigram model, y2.3 hours for the topic model.

56 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 More Experimental Results: Topic Dependent Models for BroadCast News zME models are created for Broadcast News corpus (130M words). zThe topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. zME method is an effective means of integrating topic-dependent and topic-independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size9.1M+100*64K+100*400K+100*600K+250K

57 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 More Experimental Results: Topic Dependent Models for BroadCast News zME models are created for Broadcast News corpus (130M words). zThe topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. zME method is an effective means of integrating topic-dependent and topic-independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size9.1M+100*64K+100*400K+100*600K+250K

58 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit has been demonstrated in speech recognition applications. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

59 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit has been demonstrated in speech recognition applications. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

60 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Publications zWu and Khudanpur, Efficient Training Methods for Maximum Entropy Language Modeling, ICSLP zKhudanpur and Wu, Maximum Entropy Techniques for Exploiting Syntactic, Semantic and Collocational Dependencies in Language Modeling, Computer Speech and Language zWu and Khudanpur, Syntactic Heads in Statistical Language Modeling, ICASSP zWu and Khudanpur, Combining Nonlocal, Syntactic and N-Gram Dependencies in Language Modeling, Eurospeech zKhudanpur and Wu, A Maximum Entropy Language Model to Integrate N-Grams and Topic Dependencies for Conversational Speech Recognition, ICASSP 1999.

61 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Other Published and Unpublished Papers zBrill and Wu, Classifier Combination for Improved Lexical Disambiguation, ACL zKim, Khudanpur and Wu, Smoothing issues in the structured language models, Eurospeech zWu and Khudanpur, Building the Topic Dependent Maximum Entropy Model for Very Large Corpora, ICASSP zWu and Khudanpur, Efficient Parameter Estimation Methods for Maximum Entropy Models, CSL or IEEE Transaction.

62 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Acknowledgement zI thank my advisor Sanjeev Khudanpur for leading me to this field and giving me valuable advice and help when necessary and David Yarowsky for his generous help during my Ph.D. program. zI thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments. zI thank all people in the NLP lab and CLSP for their assistance in my thesis work. zThis work is supported by National Science Foundation, a STIMULATE grant (IRI ).