Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Similar presentations


Presentation on theme: "March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology."— Presentation transcript:

1 March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology & Research Laboratory SRI International 2. School of Electrical and Computer Engineering Purdue University

2 March 24, 2005EARS STT Workshop2 Motivation RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper] N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set –Dev04: achieved 1.0% absolute WER reduction over baseline LM –Eval04: no gain in WER (in fact, a small loss) RT-04 SARV LM was developed under considerable time pressure –Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data –Did not have time to examine all design choices in combination Reexamine all design decisions in detail

3 March 24, 2005EARS STT Workshop3 What Changed? RT-04 SARV training differed from RT-03 in 2 aspects: Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank The 2003 parser was trained on the WSJ Treebank only. Built a SuperARV LM with additional modifiee lexical feature constraints (Standard+ model) The 2003 LM was a SuperARV LM without these additional constraints (Standard model) Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data.

4 March 24, 2005EARS STT Workshop4 Plan of Attack Revisit changes to training procedure –Check effect on old and new data sets and systems Revisit the backoff N-gram approximation –Did we just get lucky in 2003 ? –Evaluate full SuperARV LM in N-best rescoring –Find better approximations Start investigation by going back to 2003 LM, then move to current system. Validate training software (and document and release) Work in progress Holding out on eval04 testing (avoid implicit tuning)

5 March 24, 2005EARS STT Workshop5 Perplexity of RT-03 LMs RT-03 LM training data LM types tested: –“Word”: Word backoff 4-gram, KN smoothed –“SARV N-gram”: N-gram approximation to standard SuperARV LM –“SARV Standard”: full SuperARV (without additional constraints) Full model gains smaller on dev04 N-gram approximation breaks down Test SetsWordSARV N-gram SARV Standard dev200164.3453.7452.70 eval200370.8056.2554.18 dev200463.4562.8756.97

6 March 24, 2005EARS STT Workshop6 N-best Rescoring with Full SuperARV LM Evaluated full Standard SARV LM in final N-best rescoring Based on PLP subsystem of RT-03 CTS system Full SARV rescoring is expensive, so tried increasingly longer N-best lists –Top-50 –Top-500 –Top-2000 (max used in eval system) Early passes (including MLLR) use baseline LM, so gains will be limited

7 March 24, 2005EARS STT Workshop7 RT-03 LM N-best Rescoring Results Standard SuperARV reduces WER on eval02, eval03 No gain on dev04 Identical gains on eval03-SWB and eval03-Fisher SuperARV gain increases with larger hypothesis space Test Set Top-50Top-500Top-2000 WordSARV Standard WordSARV Standard WordSARV Standard eval200226.726.126.625.826.325.6 eval2003--- 26.426.1--- dev200418.2 18.1 ---

8 March 24, 2005EARS STT Workshop8 Adding Modifiee Constraints Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints Expected that this information can improve SuperARV LM. In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure. This reduced perplexity and WER in early experiments. But: additional tag constraints could have hurt LM generalization!

9 March 24, 2005EARS STT Workshop9 Perplexity with Modifiee Constraints Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model) Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard) But not on Fisher (dev04) test set! Test SetWord N-gram SARV N-gram SARV Standard SARV Standard+ dev200164.3453.7452.7051.35 eval200370.8056.2554.1853.09 dev200463.4562.8756.9757.53

10 March 24, 2005EARS STT Workshop10 N-best Rescoring with Modifiee Constraints WER reductions consistent with perplexity results No improvement on dev04. Test Set Top-50Top-500 Word N-gram SARV Standard SARV Standard+ Word N-gram SARV Standard SARV Standard+ eval200226.726.126.026.625.825.6 eval2003--- 26.426.125.8 dev200418.2 18.1 ---

11 March 24, 2005EARS STT Workshop11 In-domain vs. Out-of-domain Parser Training SuperARVs are collected from CDG parses that are obtained by transforming CFG parses CFG parses are generated using existing state-of-the- art parsers. In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser) In 2004: Obtained trainable version of Charniak parser Retrained parser on a combination of Switchboard Treebank and WSJ Treebank (in-domain parser) –Expected improved consistency and accuracy of parse structures –However, there were bugs in that retraining; fixed for the current experiment.

12 March 24, 2005EARS STT Workshop12 Rescoring Results with In-domain Parser Reparsed the RT-03 LM training with in-domain parser Retrained Standard SuperARV model (“Standard-retrained”) N-best rescoring system as before In-domain parsing helps Also: number of distinct SuperARV tags reduced in retraining (improved parser consistency) Test SetTop-500 Rescoring WER (%) Word N-gram SARV Standard SARV Standard+ SARV Standard- retrained SARV Standard- retrained+ eval200226.625.825.6 25.4

13 March 24, 2005EARS STT Workshop13 Summary So Far Prior design decisions have been validated Adding modifiee constraints helps LM on matched data Reparsing with retrained in-domain parser improves LM quality Now: reexamine approximation used in decoding (work in progress)

14 March 24, 2005EARS STT Workshop14 N-best Rescoring with RT-04 Full Standard+ Model RT-04 model is “Standard+” model (includes modifee constraints) RT-04 had been built with in-domain parser Caveat: old parser runs fraught with some (not catastrophic) bugs, still need to reparse RT-04 LM training data (significantly more than RT-03 data) Improved WER, but smaller gains than on older test sets Gains improve with more hypotheses Suggests need for better approximation to enable use of SuperARV in search Test set Top-50Top-500 Word N-gram SARV Standard+ Word N- gram SARV Standard+ dev200418.017.817.917.6

15 March 24, 2005EARS STT Workshop15 Original N-gram Approximation Algorithm Algorithm Description: 1.For each ngram observed in the training data (note their SuperARV tag information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization; 2.For each of these ngrams, w 1 …w n, (note their tags are t 1 …t n ), 1.Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t 1 …t n, denoted st 1 …st n ; 2.Find the list of word sequences sharing the same short-SuperARV sequences as st 1 …st n, using the lexicon constructed after training; 3.We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold; 3.The resulting LM could be pruned to make its size comparable to a word- based LM. If the held-out set is small, algorithm will result in overfitting If the held-out set is large, algorithm will be slow.

16 March 24, 2005EARS STT Workshop16 Revised N-gram Approximation for SuperARV LMs Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri] Include all N-grams that “matter” to the decoder Method: Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices Step 2: extract N-grams from the HTK lattices; prune based on posterior counts Step 3: compute conditional probabilities for these N-grams using a standard SuperARV language model Step 4: compute backoff weights based on the conditional probabilities Step 5: apply the resulting N-gram LM in all subsequent decoding passes (using standard tools) Some approximations left: –Due to pruning in Step 2 –From using only N-gram context, not full sentence prefix Drawback: Step 3 takes significant compute time –currently  10xRT, but not optimized for speed yet

17 March 24, 2005EARS STT Workshop17 Lattice N-gram Approximation Experiment Based on RT-03 Standard SuperARV LM Extracted N-grams from first-pass HTK lattices Pruned N-grams with total posterior count < 10 -3 Left with 3.6M N-grams on a 6h test set RT-02/03 experiment –Uses 2003 acoustic models –2000-best rescoring (1 st pass) Dev-04 experiment –Uses 2004 acoustic models –Lattice rescoring (1 st pass)

18 March 24, 2005EARS STT Workshop18 Lattice N-gram Approximation Results 1.2% absolute gain on old (matched) test sets Small 0.2% gain on Fisher (mismatched) test set Recall: no Fisher gain previously with N-best rescoring Better exploitation of full hypothesis space yields results Test setWord N-gramSARV Lattice N-gram eval200232.130.9 eval200332.130.9 dev200420.720.5

19 March 24, 2005EARS STT Workshop19 Conclusions and Future Work There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. –When making a model more constrained, its generality may be reduced. –Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data –We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain. For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent. –Observed LM win from better parses (using retrained parser) –Can expect further gains from advances in parse accuracy

20 March 24, 2005EARS STT Workshop20 Conclusions and Future Work (Cont.) Old N-gram approximation was flawed New N-gram approximation looks promising, but also needs more work –Tests using full system –Rescoring algorithm needs speeding up Still to do: reparse current CTS LM training set. Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework.


Download ppt "March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology."

Similar presentations


Ads by Google