Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Similar presentations


Presentation on theme: "Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational."— Presentation transcript:

1 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science Johns Hopkins University Baltimore, MD 21218 April, 2001 NSF STIMULATE Grant No. IRI-9618874

2 2 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Outline zLanguage modeling in speech recognition zThe maximum entropy (ME) principle zSemantic (Topic) dependencies in natural language zSyntactic dependencies in natural language zME models with topic and syntactic dependencies zConclusion and future work zTopic assignment during test (15min) zRole of syntactic head (15min) zTraining ME models in an efficient way (1 hour)

3 3 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Outline

4 4 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Motivation

5 5 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Motivation Example: zA research team led by two Johns Hopkins scientists ___ found the strongest evidence yet that a virus may …... yhave yhas yhis

6 6 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Language Models in Speech Recognition zRole of language models

7 7 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Language Modeling in Speech Recognition zN-gram models zIn practice, N=1,2,3,or 4. Even these values of N pose data sparseness problem. For, a trigram model has free parameters. There are millions of unseen bigramsand billions of unseen trigramsfor which we need an estimate of the probability.

8 8 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Smoothing Techniques zRelative frequency estimates: zDeleted Interpolation: Jelinek, et al. 1980 zBack-off: Katz 1987, Witten-Bell 1990, Ney, et al. 1994.

9 9 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Measuring the Quality of Language Models zWord Error Rate: yReference: The contract ended with a loss of *** seven cents. yHypothesis: A contract ended with * loss of some even cents. yScores: S C C C D C C I S C zPerplexity: yPerplexity measures the average number of words that can follow a given history under a language model. )( 2 )(log)()( LP PH L W LP PPL WPWPPH   

10 10 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Measuring the Quality of Language Models zWord Error Rate: yReference: The contract ended with a loss of *** seven cents. yHypothesis: A contract ended with * loss of some even cents. yScores: S C C C D C C I S C zPerplexity: yPerplexity measures the average number of words that can follow a given history under a language model.

11 11 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Experimental Setup for Switchboard zAmerican English conversations over the telephone. yVocabulary: 22K (closed), yLM training set: 1100 conversations, 2.1M words. zTest set: WS97 dev-test set. y19 conversations (2 hours), 18K words, yPPL=79 (back-off trigram model), yState-of-the-art systems: 30-35% WER. zEvaluation: 100-best list rescoring. Speech Recognizer (Baseline LM) 100 Best Hyp Speech Rescoring (New LM) 1 hypothesis

12 12 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Experimental Setup for Broadcast News zAmerican English television broadcast. yVocabulary: open (>100K). yLM training set: 125K stories, 130M words. zTest set: Hub-4 96 dev-test set. y21K words, yPPL=174 (back-off trigram model), yState-of-the-art systems: 25% WER. zThe evaluation is based on rescoring 100-best lists of the first pass speech recognition.

13 13 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 The Maximum Entropy Principle zThe maximum entropy (ME) principle When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have. zExample (Dice) Let be the probability that the facet with dots faces-up. Seek model, that maximizes From Lagrangian So, choose :

14 14 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 The Maximum Entropy Principle (Cont.) zExample 2: Seek probability distribution with constraints. ( is the empirical distribution.) The feature: Empirical expectation: Maximize subject to So

15 15 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Maximum Entropy Language Modeling zUse the short-hand notation zFor words u, v, w, define a collection of binary features: zObtain their target expectations from the training data. zFind zIt can be shown that

16 16 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Advantages and Disadvantage of Maximum Entropy Language Modeling zAdvantages: yCreating a “smooth” model that satisfies all empirical constraints. yIncorporating various sources of information in a unified language model. zDisadvantage: yComputation complexity of model parameter estimation procedure.

17 17 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Training an ME Model zDarroch and Ratcliff 1972: Generalized Iterative Scaling (GIS). zDella Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS). zWu and Khudanpur 2000: Hierarchical Training Methods. yFor N-gram models and many other models, the training time per iteration is strictly bounded by which is the same as that of training a back-off model. yA real running time speed-up of one to two orders of magnitude is achieved compared to IIS. ySee Wu and Khudanpur ICSLP2000 for details.

18 18 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Motivation for Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

19 19 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Motivation for Exploiting Semantic and Syntactic Dependencies zN-gram models only take local correlation between words into account. zSeveral dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. zNeed a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

20 20 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Training a Topic Sensitive Model zCluster the training data by topic. yTF-IDF vector (excluding stop words). yCosine similarity. yK-means clustering. zSelect topic dependent words: zEstimate an ME model with topic unigram constraints: threshold wf wf wf t t  )( )( log)( ),,( ),,|( 12 ),(),,(),()( 12 121 topicwwZ eeee wwwP ii w wwwwww iii iiiiiii      where ][# ],[# )|,,( 12 topic w wwwP i iii   12,ww ii   E[f]E[f] a

21 21 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Recognition Using a Topic-Sensitive Model zDetect the current topic from yRecognizer’s N-best hypotheses vs. reference transcriptions. xUsing N-best hypotheses causes little degradation (in perplexity and WER). zAssign a new topic for each yConversation vs. utterance. xTopic assignment for each utterance is better than topic assignment for the whole conversation. zSee Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details.

22 22 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Performance of the Topic Model zThe ME model with only N-gram constraints duplicates the performance of the corresponding back-off model. zThe Topic dependent ME model reduces the perplexity by 7% and WER by 0.7% absolute.

23 23 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Content Words vs. Stop Words z1/5 of tokens in the test data are content-bearing words. zThe WER of the baseline trigram model is relatively high for content words. zTopic dependencies are much more helpful in reducing WER of content words (1.4%) than they are for stop words (0.6%).

24 24 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 A Syntactic Parse and Syntactic Heads The contract ended with a loss of 7 cents after … DT NN VBD IN DT NN IN CD NNS … cents of loss with ended contract ended NP PP NP VP S’ PP

25 25 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Exploiting Syntactic Dependencies zA stack of parse trees for each sentence prefix is generated. zAll sentences in the training set are parsed by a left-to- right parser. i T contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2i-1 w i nt i-1 nt i-2

26 26 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as: contract NP ended VP The h contractendedwithalossof7centsafter hww DTNNVBDINDTNNINCDNNS i-2i-1i-2 i-1 w i nt i-1 nt i-2          ii ii ST i iiiiiiiii ST i iii i ii i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|( 1 121212 111 1  

27 27 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Exploiting Syntactic Dependencies (Cont.) zA probability is assigned to each word as:          ii ii ST i iiiiiiiii ST i iii i ii i i WTnt hhwwwP WTTWwPWwP )|(),,,,,|( )|(),|()|( 1 121212 111 1   z It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.

28 28 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Training a Syntactic ME Model zEstimate an ME model with syntactic constraints: where zSee Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details. i

29 29 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Experimental Results of Syntactic LMs zNon-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute compared to those of trigrams.

30 30 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Head Words inside vs. outside 3gram Range contract NP ended VP The h contractendedwithalossof7centsafter hww DTNPVBDINDTNNINCDNNS i-2i-1i-2i-1 w i contract i-2i-1 NP with The h contractendedwithaloss h ww DTNP VBD INDT i-2i-1 w i IN a DT ended VBD

31 31 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Syntactic Heads inside vs. outside Trigram Range z1/4 of syntactic heads are outside trigram range. zThe WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. zLexical heads words are much more helpful in reducing WER when they are outside trigram range (1.4%) than they are within trigram range.

32 32 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework zProbabilities are assigned as: zOnly marginal constraints are necessary. zThe ME composite model is trained:       ii ST i iiiiiiiii i i WTtopicnt hhwwwPWwP)|(),,,,,,|()|( 1 121212 1 1  ),,,,,,( ),,,,,,|( 121212 ),(),,(),(),,(),(),,(),()( 121212 121121121 topicnt hhwwZ eeeeeeee topicnt hhwwwP iiiiii wtopicwhntw whhwhwwwwww iiiiiii iiiiiiiiiiiiiiiii     

33 33 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

34 34 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

35 35 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Overall Experimental Results zBaseline trigram WER is 38.5%. zTopic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. zSyntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. zTopic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

36 36 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Content Words vs. Stop words zThe topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). zThe syntactic model improves WER on both content words and stop words evenly. zThe composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).

37 37 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Head Words inside vs. outside 3gram Range zThe WER of the baseline trigram model is relatively high when head words are beyond trigram range. zTopic model helps when trigram is inappropriate. zThe WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range. zThe WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.

38 38 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Further Insight Into the Performance zThe composite model reduces the WER of content words by 2.6% absolute when the syntactic predicting information is beyond trigram range.

39 39 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Nominal Speed-up zNominal Speed-up zThe hierarchical training methods achieve a nominal speed-up of ytwo orders of magnitude for Switchboard, and yThree orders of magnitude for Broadcast News.

40 40 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Real Speed-up zThe real speed-up is 15-30 folds for the Switchboard task: y30 for the trigram model. y25 for the topic model. y15 for the composite model. zThis simplification in the training procedure make it possible the implement of ME models for large corpora. y40 minutes for the trigram model, y2.3 hours for the topic model.

41 41 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 More Experimental Results: Topic Dependent Models for BroadCast News zME models are created for Broadcast News corpus (130M words). zThe topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. zME method is an effective means of integrating topic-dependent and topic-independent constraints. Model3gram+topic 1-gram +topic 2-gram +topic 3-gram ME Size9.1M+100*64K+100*400K+100*600K+250K

42 42 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: 100-1000 fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

43 43 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Concluding Remarks zNon-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. ySwitchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) yBroadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) zThe computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. yNominal speedup: 100-1000 fold. y“Real” speedup: 15+ fold. zA General purpose toolkit for ME models is being developed for public release.

44 44 Center for Language and Speech Processing, The Johns Hopkins University. April 2001 Acknowledgement zI thank my advisor Sanjeev Khudanpur who leads me to this field and always gives me wisdom advice and help when necessary and David Yarowsky who gives generous help during my Ph.D. program. zI thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments. zI thank all people in the NLP lab and CLSP for their assistance in my thesis work. zThis work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874).


Download ppt "Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational."

Similar presentations


Ads by Google