Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for Language and Speech Processing, JHU
Abstract Combine two sources of long-range statistical dependence –the syntactic structure –the topic of a sentence These dependencies are integrated using the maximum entropy technique
Topic information Using word frequencies to construct separate N-gram models for each individual topic –Fragmentation of the training text by topic –Remedy: interpolate each topic-specific N-gram model with a topic-independent model, which constructed using all the available data Latent semantic analysis (Bellegarda, 1998)
Syntactic information Chelba and Jelinek (1998, 1999) have used a left-to- right parser to extract syntactic heads Financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange officials consider vs. colony consider Head words
Combining topic and N-gram dependencies topic-dependent features: unigram frequencies collected from all documents on a specific topic t in training corpus topic-independent features: overall N-gram frequencies in training corpus P(w i | w i-1, w i-2, t i )
ME parameters We use the long-range history w 1,…,w i-1 to assign a topic t i = t (w 1,…,w i-1 ) to a conversation h i = [w i-1, w i-2, t i ] empirical distribution constraints
ME parameters Z is a suitable normalization constant The first three numerator terms correspond to standard N-gram constraints, while the fourth one is a topic-unigram parameter determined by word frequencies in particular topic
Computational issues GIS algorithm
Topic assignment for test utterances Topic assignment must be based on recognizer hypotheses Topic of a conversation may change as the conversation progresses Assign topic to –an entire test conversation –each utterance –parts of an utterance
Experimental setup |V| = 22K words Training corpus –nearly 1200 Switchboard conversations, 2.1M words –Each conversation is annotated with one of about 70 topics Testing corpus –19 conversations (38 conversation sides), 18K words, over 2400 utterances 100-best hypotheses is generated by HTK recognizer using a back-off bigram LM WER for rescoring these hypotheses and perplexity of the transcriptions
Baseline experiments It can be seen that when only N-gram constraints are used, the ME model essentially replicates the performance of the corresponding back-off N-gram model
Estimation of topic-conditional models Each conversation side in the training corpus is processed to obtain a representative vector of weighted frequencies of vocabulary term excluding stop words, where a stop word is any of a list of about 700 words These vectors are then clustered using a K-means procedure (K~70) f t : relative word frequency in a cluster t Choose words which related to topic t: about 16K words in the 22K vocabulary, and they constitute about 8% of the 2.1M training tokens
Topic assignment during testing Hard decision cosine similarity measure Four options for assignment –Manual assignment –Reference transcriptions –10-best hypotheses –Assignment by an oracle to minimize perplexity (or WER) –Null topic, which defaults to a topic-independent baseline model, is available as one of the choices to the topic classifier
Topic assignment during testing (conversation – level) There is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions
Topic assignment during testing (utterance – level) Best recognition performance is achieved by assigning a topic to each utterance based on the 10-best hypotheses of the current and the three preceding utterances Absolute WER reduction 0.7%, relative perplexity reduction 7%
Topic of two levels 8 out of 10 utterances prefer the topic-independent model Null topic
Analysis of recognition performance Divide the vocabulary into two set: –Words which have topic-conditional unigram constraints for any of the topics –The others –About 7% of test set tokens have topic-dependent constraints
Analysis of recognition performance Divide the vocabulary simply into content-bearing words and stop words –25% in the test set tokens are content bearing words
ME vs. interpolated topic N-gram Interpolation weight –Minimize the perplexity of the test set under each interpolated model
ME vs. cache-based models N-best hypotheses for the preceding utterances in a conversation side are considered in estimating the cache model –N = 100 –interpolation weight = 0.1 Cache model caches the recognition errors repeated error. Cache-based model has about a 0.6% higher rate of repeated errors then baseline tri- gram model.
Combining syntactic and N-gram dependencies This paper parsed all sentences in the training data by the left- to-right parser presented by Chelba and Jelinek (1998). This parser generates a stack S i of candidate parse trees T ij for a sentence prefix W 1 i-1 = w 1,w 2,…,w i-1 at position i. And also assigns a probability P(w i | W 1 i-1,T ij ) to each possible following word w i given the jth partial parse T ij, and a likelihood function ρ(W 1 i-1,T ij ) for the jth partial parse, according to
Head-word Assume that the immediate history (w i-2,w i-1 ) and last two head-words h i-2, h i-1 of the partial parse T ij carry most of the useful information :
ME parameters The first two kinds of constraints involve regular N-gram counts, and the last two involve head-word N-gram counts constraints
ME parameters Z is a normalization constant. Again, the first three terms in the numerator correspond to standard N-gram constraints. The last two terms represent head-word N-gram constraints
Recognition performance of the syntactic model Compare topic and syntactic models
Analysis of recognition performance Divide all histories in the test sentences into two categories: –(h i-2, h i-1 ) = (w i-2, w i-1 ) –h i-2 ≠ w i-2 or h i-1 ≠ w i-1 –About 75% of the histories belong to the former category
ME vs. interpolated syntactic models The maximum entropy technique slightly but consistently outperforms interpolation
Combining topic, syntactic and N-gram dependencies Z is a normalization constant, and the parameters λ are computed to satisfy constraints on the marginal probability of N-grams, head-word N-grams and topic-conditional unigrams
Analysis of recognition performance Topic-dependent model improved prediction of content-bearing words Syntactic models improved prediction when the two immediately preceding head words were not within trigram range
Analysis of recognition performance These two kinds of information are independent
Conclusion Using ME to combine 2 diverse source of long-range dependence with N-gram models Topic information : content word Syntactic information : information out of trigram range These two information are independent