Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jun Wu Department of Computer Science and

Similar presentations


Presentation on theme: "Jun Wu Department of Computer Science and"— Presentation transcript:

1 Maximum Entropy Language Modeling with Semantic, Syntactic and Collocational Dependencies
Jun Wu Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD 21218 August 20, 2002

2 Outline Motivation Semantic (Topic) dependencies
Syntactic dependencies Maximum entropy (ME) models with topic and syntactic dependencies Training ME models in an efficient way Hierarchical training (N-gram) Generalized hierarchical training (syntactic model) Divide-and-conquer (topic-dependent model) Conclusion and future work

3 Outline Motivation Semantic (Topic) dependencies
Syntactic dependencies ME models with topic and syntactic dependencies Training ME models in an efficient way Hierarchical training (N-gram) Generalized hierarchical training (syntactic model) Conclusion and future work

4 Exploiting Semantic and Syntactic Dependencies
Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

5 Exploiting Semantic and Syntactic Dependencies
Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

6 Exploiting Semantic and Syntactic Dependencies
Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. N-gram models only take local correlation between words into account. Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. Need a model that exploits topic and syntax.

7 Training a Topic Sensitive Model
Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: f ( w ) f ( w ) log t t t f ( w ) ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

8 Training a Topic Sensitive Model
Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

9 Training a Topic Sensitive Model
Cluster the training data by topic. TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering (K~70 in SWBD, ~100 in BN). Select topic dependent words: Estimate an ME model with N-gram and topic unigram constraints: ) , ( | 1 2 topic w Z e P i - × = l where ] [ # , ) | ( 1 2 topic w P i = å -

10 Experimental Setup Switchboard WS97 dev test set.
Vocabulary: 22K (closed), LM training set: 1100 conversations, 2.1M words, AM training set: 60 hours of speech data, Acoustic model: state-clustered cross-word triphone model, Front end: 13 MF-PLP + Δ + Δ Δ , per conv. side CMS, Test set: 19 conversations (2 hours), 18K words, No speaker adaptation. The evaluation is based on rescoring 100-best lists of the first pass speech recognition.

11 Experimental Setup (Cont.)
Broadcast News Hub-4 96 eval set. Vocabulary: 64K, LM training set: 125K stories, 130M words, AM training set: 72 hours of speech data, Acoustic model: state-clustered cross-word triphone model, Front end: 13 MFCC + Δ + Δ Δ , Test set: 2 hours, 22K words, No speaker adaptation. The evaluation is based on rescoring 100-best lists of the first pass speech recognition.

12 Experimental Results (Switchboard)
Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

13 Experimental Results (Switchboard)
Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

14 Experimental Results (Switchboard)
Baseline trigram model: PPL-79, WER-38.5%. Using N-best hypotheses causes little degradation. Topic assignment based on utterances brings a slightly better result than that based on whole conversations. Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.

15 Experimental Results (Broadcast News)
Utterance level topic detection based on 10 best lists. The ME trigram model duplicates the performance of the corresponding backoff model. Topic-dependencies help reduce perplexity by 10% and WER 0.6% (absolute).

16 Exploiting Syntactic Dependencies
contract NP ended VP The h with a loss of 7 cents after w DT NN VBD IN CD NNS i-2 i-1 i nt i-1 nt i-2 All sentences in the training set are parsed by a left-to-right parser. A stack of parse trees for each sentence prefix is generated.

17 Exploiting Syntactic Dependencies (Cont.)
A probability is assigned to each word as: å P ( w | w i - 1 ) = P ( w | w i - 1 , T ) × r ( T | w i - 1 ) i 1 i 1 i i 1 T Î S i i å = P ( w | w , w , h , h , nt , nt ) × r ( T | w i - 1 ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i ended VP contract nt i-1 NP nt i-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1

18 Training a Syntactic ME Model
Estimate an ME model with syntactic constraints: See Khudanpur and Wu CSL’00 and Chelba and Jelinek ACL’98 for details. P ( w | , 2 1 - i w , h , h , nt , nt ) i i - 1 i - 2 i - 1 i - 2 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , i w ) × e l ( nt , w ) × e l ( nt , nt , w ) i 1 - i i - 2 i 1 - i i 1 - i i - 2 i i - 1 i - 1 i i - 2 i - 1 i = Z ( w , w , h , h , nt , nt ) i - 1 i - 2 i - 1 i - 2 i - 1 i - 2 å # [ w , w , w ] where P ( h , h , nt , nt , w | w , w ) = i - 2 i - 1 i i - 1 i - 2 i - 1 i - 2 i i - 2 i - 1 # [ w , w ] h , h , nt , nt - - i - 2 - 1 i - 2 i - 1 i 2 i 1 å - = 1 2 , ] [ # ) | ( i nt w h P # [ nt , nt , w ] å P ( w , w , h , h , w | nt , nt ) = - 2 i i - 1 i i - 1 i - 2 i - 1 i - 2 i i - 2 i - 1 # [ nt , nt ] w - , w - h - , h - - 2 i 2 i 1 i 2 1 i - 1

19 Experimental Results for Switchboard
Baseline Katz back-off trigram model: PPL – 79, WER – 38.5%. Interpolated mode: PPL – 4%, WER – 0.6%. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

20 Experimental Results for Switchboard
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

21 Experimental Results for Switchboard
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

22 Experimental Results for Switchboard
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

23 Experimental Results for Switchboard
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. ME model achieves better performance than interpolated model (Chelba & Jelinek).

24 Experimental Results for Broadcast News*
Baseline trigram model: PPL – 214, WER – 35.3%. Interpolated mode: PPL – 7%, WER – 0.6%. *14M words of data

25 Experimental Results for Broadcast News
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. ME model achieves slightly better performance than interpolated model. *14M words of data

26 Experimental Results for Broadcast News
Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. Head word N-gram constraints result similar improvement. Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. ME model achieves slightly better performance than interpolated model. *14M words of data

27 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework
Probabilities are assigned as: The ME composite model is trained: Only marginal trigram like constraints are necessary. å P ( w | w i - 1 ) = P ( w | w , w , h , h , nt , nt , topic ) × r ( T | w i - 1 ) i 1 i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i P ( w | w , w , h , h , nt , nt , topic ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , w ) × e l ( nt , w ) × l ( nt , h , w ) × l ( topic , w ) i i - 1 i i - 2 i - - e e 1 i i 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i i = Z ( w , w , h , h , nt , nt , topic ) i - 2 i - 1 i - 2 i - 1 i - 2 i - 1

28 Combining Topic, Syntactic and N-gram Dependencies in an ME Framework
Probabilities are assigned as: The ME composite model is trained: Only marginal trigram like constraints are necessary. å P ( w | w i - 1 ) = P ( w | w , w , h , h , nt , nt , topic ) × r ( T | w i - 1 ) i 1 i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 i 1 T Î S i i P ( w | w , w , h , h , nt , nt , topic ) i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 e l ( w ) × e l ( w , w ) × e l ( w , w , w ) × e l ( h , w ) × e l ( h , h , w ) × e l ( nt , w ) × e l ( nt , h , w ) × ) , ( e w topic i l i i - 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i i - 1 i i - 2 i - 1 i = i Z ( w , w , h , h , nt , nt , topic ) i - 2 i - 1 i - 2 i - 1 i - 2 i - 1

29 Overall Experimental Results for SWBD
Baseline trigram WER is 38.5%. Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.

30 Overall Experimental Results for BN
Repeated improvements are achieved on the 14M Broadcast News task topic-dependent constraints and Syntactic constraints individually reduce WER by 0.7% absolute. They together reduce the WER by 1.2%. This WER is lower than the trigram model trained with 130M words. The gains from topic and syntactic dependencies are nearly additive.

31 Advantages and Disadvantage of Maximum Entropy Method
Creating a “smooth” model that satisfies all empirical constraints. Incorporating various sources of information (e.g. topic and syntax) in a unified language model. Disadvantages: High computational complexity of model parameter estimation procedure. Heavy computation load in using ME models during recognition.

32 Advantages and Disadvantage of Maximum Entropy Method
Creating a “smooth” model that satisfies all empirical constraints. Incorporating various sources of information (e.g. topic and syntax) in a unified language model. Disadvantages: High computational complexity of model parameter estimation procedure. Heavy computation load in using ME models during recognition.

33 Estimating Model Parameters Using GIS
Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is The first term dominates the computation. E.g., in Switchboard,

34 Estimating Model Parameters Using GIS
Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is The first term dominates the computation. E.g., in Switchboard,

35 Estimating Model Parameters Using GIS
Trigram model: where Generalized Iterative Scaling (GIS) can be used to compute ’s . Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is The first term dominates the computation. E.g., in Switchboard,

36 The Computation of Denominators
For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators. å P ( w , w ) E ( n ) [ g ( w )] = i - 2 i - 1 a ( w ) g ( ) a ( , ) a ( , , 1 w i ( w , w ) g 2 w i - 1 w i ( w , w , w ) g 3 w i - 2 w i - 1 w ) i 1 i Z ( w , w ) i i - 1 i i - 2 i - 1 i w i - 2 , w i - 1 i - 2 i - 1 Z ( w , w ) w i - 2 i - 1 i E [ × ] w , w w i - 2 i - 1 i

37 The Computation of Denominators
For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators.

38 The Computation of Denominators
For each , we need to compute Computing the denominator for all histories takes time. Computing the expectation for all unigram features needs the same amount of time. requires a sum over all for a given history; requires a sum over all history for a given Any simplification made on the calculation of denominators can be applied to feature expectation. We focus on the computation of Denominators.

39 State-of-the-Art Implementation
Della Pietra etc suggest . The straight-forward implementation needs . SWBD: 6 billion vs 8 trillion (~1300 fold). For a given history, only a few words have conditional (bigram or trigram) features activated.

40 State-of-the-Art Implementation
Della Pietra etc suggest . The straight-forward implementation needs . SWBD: 6 billion vs 8 trillion (~1300 fold). For a given history, only a few words have conditional (bigram or trigram) features activated.

41 State-of-the-Art Implementation (cont’d) Unigram-Caching
Unigram-caching (Della Pietra etc): Complexity: In practice, E.g. SWBD: , 120 million vs 6 billion (~50).

42 Hierarchical Training
Unigram caching is still too “slow” for large corpora, e.g. BN . Value of Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

43 Hierarchical Training
Unigram caching is still too “slow” for large corpora, e.g. BN . Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

44 Hierarchical Training
Unigram caching is still too “slow” for large corpora, e.g. BN . Computational complexity , which is the same as that of training a back-off trigram model. This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.

45 Speed-up of the Hierarchical Training Method
Baseline: Unigram-caching (Della Pietra, et al.) Nominal Speed-up The hierarchical training methods achieve a nominal speed-up of two orders of magnitude for Switchboard, and three orders of magnitude for Broadcast News. a real speed-up of 30 folds for SWBD, 85 folds for BN.

46 Feature Hierarchy N-gram Model Syntactic Model 1 P ( w | w , w , h , h
) = a ( w ) g ( w ) a g ( w , w ) a g ( h , w ) a g ( nt , w ) 1 i ( w , w ) 2 i - 1 i ( h , w ) 3 i - 1 i ( nt , w ) 4 i - 1 i i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 Z i i - 1 i i - 1 i i - 1 i a ( w , w , w ) g ( w a a 5 i - 2 , w i - 1 , w i ) ( h , h , w ) g 6 ( h i - 2 , h i - 1 , w i ) ( nt , nt , w ) g 7 ( nt i - 2 , nt i - 1 , w i ) i - 2 i - 1 i i - 2 i - 1 i i - 2 i - 1 i

47 Generalized Hierarchical Training
Syntactic Model 1 P ( w | w , w , h , h , nt , nt ) = a ( w ) g ( w ) a 1 i ( w , w ) g ( w , w ) a g ( h , w ) a g ( nt 2 i - 1 i ( h , w ) 3 i - 1 i ( nt , w ) 4 i - , w ) 1 i i i - 2 i - 1 i - 2 i - 1 i - 2 i - 1 Z i i - 1 i i - 1 i i - 1 i a ( w , w , w ) g ( w , w , w ) a ( h , h , w ) g ( h , h a 5 i - 2 i - 1 i 6 i - 2 i - 1 , w i ) ( nt , nt , w ) g 7 ( nt i - 2 , nt i - 1 , w i ) i - 2 i - 1 i i - 2 i - 1 i i - 2 i - 1 i

48 Speed-up of the Generalized Hierarchical Training
Evaluation is based on the syntactic model. The generalized hierarchical training method achieves a nominal speed-up of about two orders of magnitude for the Switchboard and Broadcast News. a real speed-up of 17 folds in Switchboard. Training the syntactic model for the subset of Broadcast News is impossible without GHT, and it needs 480 CPU-hours per iteration even after speed up.

49 Topic-dependent models can be trained by divide-and-conquer.
Training Topic Models and Composite Models by Hierarchical Training and Divide-and-Conquer Topic-dependent models can be trained by divide-and-conquer. Partition the training data in parts, train the model based on each parts and then collect partial feature expectations. Divide-and-conquer and be used together with hierarchical training. Real training time for topic-dependent models: SWBD:21 CPU-hours CPU-hours. BN: (~85) CPU-hours CPU-hours.

50 Simplify the Computation in Calculating Probabilities Using ME Models
Converting ME N-gram models to ARPA back-off format. Speedup: fold. Approximating denominator for topic-dependent models. Speedup: 400+ fold. Almost the same speech recognition accuracy. Caching the last accessed histories. Speedup: 5 fold.

51 Concluding Remarks Accuracy Flexibility Efficiency
ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

52 Concluding Remarks Accuracy Flexibility Efficiency
ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

53 Concluding Remarks Accuracy Flexibility Efficiency
ME models with non-local dependencies achieve lower WERs than the baseline backoff models in speech recognition. ME models are more accurate than corresponding backoff or interpolated models with the same constraints. Flexibility ME method can combine constraints from different information sources in a sound model. ME models have the flexibility for further extension by adding new features. Efficiency The training and use of ME models can be done efficiently. ME models are more compact compared to their interpolated counterparts.

54 Concluding Remarks Switchboard Broadcast News
Speech recognition performance has been improved by using ME models. The computational requirements for the estimation of a large class of maximum entropy model have been vastly simplified. The computation in using ME models is also reduced by three means pre-normalization, denominator approximation and history caching (not discussed). Switchboard Broadcast News Speed- up Perfo rmance Nom. Real PPL WER Trigram 170 30 79 38.5% Topic 400 50+ -7% -0.7% Syntax 88 17 -6% -1.0% Comp. - -13% -1.5% Speed- up Perfo rmance Nom. Real PPL WER Trigram 560 85 174 34.6% Topic 1400 - -10% -0.6% Syntax 150 -7%* -0.7%* Comp -18%* -1.2%* *14M words of BN data.

55 Future Work Multi-lingual topic-dependent model. Syntactic parsing.
Other applications question answering Automatic summarization Information retrieval

56 Acknowledgement I thank my advisor Sanjeev Khudanpur for giving me valuable advice and generous help when necessary. My dissertation work could not have been finished without his help. I thank David Yarowsky and Fred Jelinek for their help during my Ph.D. program. I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba, Frederick Jelinek and Peng Xu for providing the syntactic parser, Shankar Kumar, Veera Venkataramani, Dimitra Vergyri and Vlasios Doumpiotis for their help on generating N-best lists, and Andreas Stolcke for providing SRI LM tools and the WS97 baseline training setup. This work is partially sponsored by NSF STIMULATE grant.

57 Publications Wu & Khudanpur. Building A Topic-Dependent Maximum Entropy Language Model for Very Large Corpora. ICASSP2002. Wu & Khudanpur. Efficient Training Methods for Maximum Entropy Language Modeling. ICSLP2000. Khudanpur & Wu, Maximum Entropy Techniques for Exploiting Syntactic, Semantic and Collocational Dependencies in Language Modeling. Computer Speech and Language 2000. Wu & Khudanpur, Syntactic Heads in Statistical Language Modeling. ICASSP2000. Wu & Khudanpur, Combining Nonlocal, Syntactic and N-Gram Dependencies in Language Modeling. Eurospeech99. Khudanpur & Wu. A Maximum Entropy Language Model Integrating N-Grams and Topic Dependencies for Conversational Speech Recognition. ICASSP99. Brill & Wu. Classifier Combination for Improved Lexical Disambiguation. COLING-ACL'98 Kim, Khudanpur and Wu, Smoothing Issues in the Structured Language Model, Eurospeech 2001


Download ppt "Jun Wu Department of Computer Science and"

Similar presentations


Ads by Google