Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218 May, 2001 NSF STIMULATE Grant No. IRI-9618874

2 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Outline zMotivation zSemantic (Topic) dependencies in natural language zSyntactic dependencies in natural language zME models with topic and syntactic dependencies zTraining ME models in an efficient way (1 hour) zConclusion and future work

3 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Outline zMotivation zSemantic (Topic) dependencies in natural language zSyntactic dependencies in natural language zME models with topic and syntactic dependencies zTraining ME models in an efficient way (5 mins) zConclusion and future work

4 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

7 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 The Maximum Entropy Principle zThe maximum entropy (ME) principle When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have. zExample (Dice) Let be the probability that the facet with dots faces-up. Seek model, that maximizes From Lagrangian So, choose :

8 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 The Maximum Entropy Principle (Cont.) zExample 2: Seek probability distribution with constraints. ( is the empirical distribution.) The feature: Empirical expectation: Maximize subject to So

9 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Maximum Entropy Language Modeling zModel where: zFor define a collection of binary features: zObtain their target expectations from the training data. zFind zIt can be shown that

10 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Advantages and Disadvantage of Maximum Entropy Language Modeling zAdvantages: yCreating a “smooth” model that satisfies all empirical constraints. yIncorporating various sources of information in a unified language model. zDisadvantage: yComputation complexity of model parameter estimation procedure. (solved!)

11 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( 12 121 topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     Where

12 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( 12 121 topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     Where

13 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( 12 121 topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      ][# ],[# )|,,( 12, 12 topic w wwwP i ww iii ii     where

14 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Recognition Using a Topic-Sensitive Model zDetect the current topic from yRecognizer’s N-best hypotheses vs. reference transcriptions. xUsing N-best hypotheses causes little degradation (in perplexity and WER). zAssign a new topic for each yConversation vs. utterance. xTopic assignment for each utterance is better than topic assignment for the whole conversation. zSee Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details.

17 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Setup for Switchboard zThe experiments are based on WS97 dev test set. yVocabulary: 22K (closed), yLM training set: 1100 conversations, 2.1M words, yAM training set: 60 hours of speech data, yAcoustic model: state-clustered cross-word triphone model, yFront end: 13 MF-PLP + +, per conv. side CMS, yTest set: 19 conversations (2 hours), 18K words, yNo speaker adaptation. zThe evaluation is based on rescoring 100-best lists of the first pass speech recognition.

18 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Reference Trans vs Hypotheses zEven with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. zComparisons with the oracle indicate that there is little room for further improvement.

22 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Conv. Level vs Utterance Level zTopic assignment based on utterances brings a slightly better result than that based on whole conversations. zMost of utterances prefer the topic-independent model. zLess than one half of the remaining utterances prefer a topic other than that assigned at the conversation level.

23 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Assignment During Testing : Conv. Level vs Utterance Level zTopic assignment based on utterances brings a slightly better result than that based on whole conversations. zMost of utterances prefer the topic-independent model. zLess than one half of the remaining utterances prefer a topic other than that assigned at the conversation level.

24 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 ME Method vs Interpolation zME model with only topic dependent unigram constraints outperforms the interpolated topic dependent trigram model. zME method is an effective means of integrating topic- dependent and topic- independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size499K+70*11K+70*26K+70*55K+16K

25 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Model vs Cache-Based Model zCache-based model reduces the perplexity, but increase the WER. zCache-based model brings (0.6%) more repeated errors than the trigram model does. zCache model may not be practical when the baseline WER is high.

26 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Topic Model vs Cache-Based Model zCache-based model reduces the perplexity, but increase the WER. zCache-based model brings (0.6%) more repeated errors than the trigram model does. zCache model may not be practical when the baseline WER is high.

27 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Summary of Topic-Dependent Language Modeling zWe significantly reduce both the perplexity (7%) and WER (0.7% absolute) by incorporating a small number of topic constraints with N-grams using the ME method. zUsing N-best hypotheses causes little degradation (in perplexity and WER). zTopic assignment at utterance level is better than that at conversation level. zME method is more efficient than linear interpolation in combining topic dependencies with N-grams. zThe topic dependent model is better than the cache- based model in reducing WER when the baseline is poor.

28 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 A Syntactic Parse and Syntactic Heads The contract ended with a loss of 7 cents after … DT NN VBD IN DT NN IN CD NNS … cents of loss with ended contract ended NP PP NP VP S’ PP

29 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies zA stack of parse trees for each sentence prefix is generated. zAll sentences in the training set are parsed by a left-to- right parser. i T contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2i-1 w i nt i-1 nt i-2

30 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as: contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2 i-1 w i nt i-1 nt i-2          ii ii ST i 1iiiiiiii ST i 1ii i 1i i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|( 1 121212 111 1  

31 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as:          ii ii ST i 1iiiiiiii ST i 1ii i 1i i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|( 1 121212 111 1   z It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.

32 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training a Syntactic ME Model zEstimate an ME model with syntactic constraints: where zSee Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details.

33 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in a reduction of 0.6% in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

34 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in a reduction of 0.6% in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

35 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Results of Syntactic LMs zNon-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. zHead word N-gram constraints result in 6% reduction in perplexity and 0.8% absolute in WER. zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute.

36 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 ME vs Interpolation zThe ME model is more effective in using syntactic dependencies than the interpolation model.

37 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Head Words inside vs. outside 3gram Range contract NP ended VP The h contractendedwithalossof7centsafter hww DTNPVBDINDTNNINCDNNS i-2i-1i-2i-1 w i contract i-2i-1 NP with The h contractendedwithaloss h ww DTNP VBD INDT i-2i-1 w i IN a DT ended VBD

38 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Syntactic Heads inside vs. outside Trigram Range zThe WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. zLexical head words are much more helpful in reducing WER when they are outside trigram range (1.5%) than they are within trigram range. zHowever, non-terminal N-gram constraints help almost evenly in both cases. yCan this gain be obtained from POS class model too? zThe WER reduction for the model with both head word and non-terminal constraints (1.4%) is more than the overall reduction (1.0%) when head words are beyond trigram range.

41 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Contrasting the Smoothing Effect of NT Class LM vs POS Class LM zPOS model reduces PPL by 4% and WER by 0.5%. zThe overall gains from POS N-gram constraints are smaller than those from NT N- gram constraints. zSyntactic analysis seems to perform better than just using the two previous word positions. zAn ME model with part-of-speech (POS) N-gram constraints is built as:

42 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 POS Class LM vs NT Class LM zWhen the syntactic heads are beyond trigram range, the trigram coverage in the test set is relatively low. zThe back-off effect by the POS N- gram constraints is effective in reducing WER in this case. zNT N-gram constraints work in a similar manner. Overall, they are more effective perhaps because they are linguistically more meaningful. zPerformance improves further when lexical head words are applied on the top of the non-terminals.

43 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Summary of Syntactic Language Modeling zSyntactic heads in the language model are complementary to N-grams: the model improves significantly when the syntactic heads are beyond N- gram range. zHead word constraints provide syntactic information. Non-terminals mainly provide a smoothing effect. zNon-terminals are linguistically more meaningful predictors than POS tags, and therefore are more effective in supplementing N-grams. zThe Syntactic model reduces perplexity by 6.3%, WER by 1.0% (absolute).

44 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework zProbabilities are assigned as: zOnly marginal constraints are necessary. zThe ME composite model is trained:       ii ST i 1iiiiiiii i i WTtopicnt hhwwwPWwP)|(),,,,,,|()|( 1 121212 1 1  ),,,,,,( ),,,,,,|( 121212 ),(),,(),(),,(),(),,(),()( 121212 121121121 topicnt hhwwZ eeeeeeee topicnt hhwwwP iiiiii wtopicwhntw whhwhwwwwww iiiiiii iiiiiiiiiiiiiiiii     

45 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework zProbabilities are assigned as: zOnly marginal constraints are necessary. zThe ME composite model is trained:       ii ST i 1iiiiiiii i i WTtopicnt hhwwwPWwP)|(),,,,,,|()|( 1 121212 1 1  ),,,,,,( ),,,,,,|( 121212 ),(),,(),(),,(),(),,(),()( 121212 121121121 topicnt hhwwZ eeeeeeee topicnt hhwwwP iiiiii wtopicwhntw whhwhwwwwww iiiiiii iiiiiiiiiiiiiiiii     

46 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

49 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Content Words vs. Stop words zThe topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). zThe syntactic model improves WER more on stop words than on content words. Why? yMany content words do not have syntactic constraints. zThe composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).

50 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Head Words inside vs. outside 3gram Range zThe WER of the baseline trigram model is relatively high when head words are beyond trigram range. zTopic model helps when trigram is inappropriate. zThe WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range. zThe WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.

51 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Further Insight Into the Performance zThe composite model reduces the WER of content words by 2.6% absolute when the syntactic predicting information is beyond trigram range.

52 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Training an ME Model zDarroch and Ratcliff 1972: Generalized Iterative Scaling (GIS). zDella Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS). zWu and Khudanpur 2000: Hierarchical Training Methods. yFor N-gram models and many other models, the training time per iteration is strictly bounded by which is the same as that of training a back-off model. yA real running time speed-up of one to two orders of magnitude is achieved compared to IIS. ySee Wu and Khudanpur ICSLP2000 for details.

53 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Experimental Setup for Broadcast News zAmerican English television broadcast. yVocabulary: open (>100K). yLM training set: 125K stories, 130M words, yAM training set: 70 hours of speech data, yAcoustic model: state-clustered cross-word triphone model, yTrigram model: T>1, B>2, 9.1M constraints, yFront end: 13 MFCC + +, yNo speaker adaptation, yTest set: Hub-4 96 dev-test set, 21K words. zThe evaluation is based on rescoring 100-best lists of the first pass speech recognition.

54 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 # of Operations and Nominal Speed-up zBaseline: IIS + Unigram- caching (Della Pietra, et al.) zNominal Speed-up zThe hierarchical training methods achieve a nominal speed-up of ytwo orders of magnitude for Switchboard, and yThree orders of magnitude for Broadcast News.

55 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Real Running Time zThe real speed-up is 15-30 folds for the Switchboard task: y30 for the trigram model. y25 for the topic model. y15 for the composite model. zThis simplification in the training procedure make it possible the implement of ME models for large corpora. y40 minutes for the trigram model, y2.3 hours for the topic model.

56 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 More Experimental Results: Topic Dependent Models for BroadCast News zME models are created for Broadcast News corpus (130M words). zThe topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. zME method is an effective means of integrating topic-dependent and topic-independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size9.1M+100*64K+100*400K+100*600K+250K

57 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 More Experimental Results: Topic Dependent Models for BroadCast News zME models are created for Broadcast News corpus (130M words). zThe topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. zME method is an effective means of integrating topic-dependent and topic-independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size9.1M+100*64K+100*400K+100*600K+250K

58 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit has been demonstrated in speech recognition applications. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: 100-1000 fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

59 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit has been demonstrated in speech recognition applications. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: 100-1000 fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

60 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Publications zWu and Khudanpur, Efficient Training Methods for Maximum Entropy Language Modeling, ICSLP 2000. zKhudanpur and Wu, Maximum Entropy Techniques for Exploiting Syntactic, Semantic and Collocational Dependencies in Language Modeling, Computer Speech and Language 2000. zWu and Khudanpur, Syntactic Heads in Statistical Language Modeling, ICASSP 2000. zWu and Khudanpur, Combining Nonlocal, Syntactic and N-Gram Dependencies in Language Modeling, Eurospeech 1999. zKhudanpur and Wu, A Maximum Entropy Language Model to Integrate N-Grams and Topic Dependencies for Conversational Speech Recognition, ICASSP 1999.

61 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Other Published and Unpublished Papers zBrill and Wu, Classifier Combination for Improved Lexical Disambiguation, ACL 1998. zKim, Khudanpur and Wu, Smoothing issues in the structured language models, Eurospeech 2001. zWu and Khudanpur, Building the Topic Dependent Maximum Entropy Model for Very Large Corpora, ICASSP 2002. zWu and Khudanpur, Efficient Parameter Estimation Methods for Maximum Entropy Models, CSL or IEEE Transaction.

62 Center for Language and Speech Processing, The Johns Hopkins University. May 2001 Acknowledgement zI thank my advisor Sanjeev Khudanpur for leading me to this field and giving me valuable advice and help when necessary and David Yarowsky for his generous help during my Ph.D. program. zI thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments. zI thank all people in the NLP lab and CLSP for their assistance in my thesis work. zThis work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874).

Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Similar presentations

Presentation on theme: "Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Similar presentations

Presentation on theme: "Center for Language and Speech Processing, The Johns Hopkins University. May 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational."— Presentation transcript:

Similar presentations

About project

Feedback