Download presentation
Presentation is loading. Please wait.
Published byGesche Kästner Modified over 6 years ago
1
Combining Non-local, Syntactic and N-gram Dependencies in Language Modeling
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218 September 9, 1999 NSF STIMULATE Grant No. IRI Center for Language and Speech Processing, Johns Hopkins University.
2
Center for Language and Speech Processing, Johns Hopkins University.
Motivation Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax. Center for Language and Speech Processing, Johns Hopkins University.
3
Training a Topic Sensitive Model
Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering. Select topic dependent words: Estimate an ME model with topic unigram constraints: f ( w ) f ( w ) × log t > threshold t f ( w ) ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å - Center for Language and Speech Processing, Johns Hopkins University.
4
Recognition Using a Topic-Sensitive Model
Detect the current topic from recognizer’s N-best hypotheses. Using N-best hypotheses causes little degradation (in perplexity and WER). Assign a new topic for each utterance. Topic assignment for each utterance is better than topic assignment for the whole conversation. Recognize lattices using the topic sensitive model. See Khudanpur and Wu ICASSP’99 for details. Center for Language and Speech Processing, Johns Hopkins University.
5
Exploiting Syntactic Dependencies
contract NP ended VP The h with a loss of 7 cents after w DT NN VBD IN CD NNS i-2 i-1 i All sentences in the training set are parsed by a left-to-right parser. A stack of parse trees for each sentence prefix is generated. T i Center for Language and Speech Processing, Johns Hopkins University.
6
Exploiting Syntactic Dependencies (Cont.)
A probability is assigned to each word as: å Î - × = i S T W h w P ) | ( , 1 2 r It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding head words. See Chelba and Jelinek Eurospeech’99 for details. Center for Language and Speech Processing, Johns Hopkins University.
7
Training a Syntactic ME Model
Estimate an ME model with syntactic constraints: ) , | ( 1 2 - i h w P × = Z e l where ] , [ # ) | ( 1 2 - = å i w h P Center for Language and Speech Processing, Johns Hopkins University.
8
Combining Topic, Syntactic and N-gram Dependencies in an ME Framework
Probabilities are assigned as: å Î - × = i S T W topic h w P ) | ( , 1 2 r The ME composite model is trained: P ( w | w , w , h , h , topic ) i i - 2 i - 1 i - 2 i - 1 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , w ) × e l ( topic , w ) i i - 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i i = Z ( w , w , h , h , topic ) i - 2 i - 1 i - 2 i - 1 Only marginal counts are required to constrain the model. Center for Language and Speech Processing, Johns Hopkins University.
9
Experimental Setup (Switchboard)
American English speakers. Conversational (human to human) telephone speech. 22K Vocabulary. 2 hours test set (18K words). State-of-art speaker independent systems: 30-35% WER. Results presented here do not have speaker adaptation. Center for Language and Speech Processing, Johns Hopkins University.
10
Center for Language and Speech Processing, Johns Hopkins University.
Experimental Results Baseline trigram WER is 38.5%. Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. Head word constraints result in 7% reduction in perplexity and 0.8% absolute in WER. Topic-dependent constraints and syntactic constraints together reduce the perplexity by 12% and WER by 1.3% absolute. The gains from topic and syntactic dependencies are nearly additive. Center for Language and Speech Processing, Johns Hopkins University.
11
Content Words vs. Stop words
1/5 of test tokens are content-bearing words. The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). The syntactic model improves WER on both content words and stop words evenly. The composite model has the advantage of both models and reduces WER on content words more significantly (1.8%). Center for Language and Speech Processing, Johns Hopkins University.
12
Head Words inside vs. outside 3gram Range
The WER of the baseline trigram model is relatively high when head words are beyond trigram range. Topic model helps when trigram is inappropriate. The WER reduction for syntactic model (1.5%) is more than the overall reduction (0.8%) when head words are outside trigram range. The WER reduction for composite model (2.3%) is more than the overall reduction (1.3%) when head words are inside trigram range. Center for Language and Speech Processing, Johns Hopkins University.
13
Further Insight Into the Performance
The composite model reduces the WER of content words by 3.5% absolute when the syntactic predicting information is beyond trigram range. Center for Language and Speech Processing, Johns Hopkins University.
14
Center for Language and Speech Processing, Johns Hopkins University.
Concluding Remarks Topic LM reduces PPL by 7%, WER by 0.7% (absolute). Syntax LM reduces PPL by 7%, WER by 0.8% (absolute). Composite LM reduces PPL by 12%, WER by 1.3% (absolute). Non-local dependencies are complementary and their gains are almost additive. The WER on content words reduces by 1.8%, most of it due to topic dependence. The WER on head words beyond trigram range reduces by 2.3%, most of it due to syntactic dependence. Center for Language and Speech Processing, Johns Hopkins University.
15
Ongoing and Future Work
Further improve the model by using non-terminal labels in the partial parse. Apply this model to lattice rescoring. Apply this method to other tasks (Broadcast News). Center for Language and Speech Processing, Johns Hopkins University.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.