Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004
Using Neural Network LMs for LVCSR2 Introduction Build and use neural networks to estimate LM posterior probabilities for ASR tasks Idea: Project word indices onto continuous space Resulting smooth prob fns of word representations generalize better to unknown ngrams Still an n-gram approach, but posteriors interpolated for any poss. context; no backing off Result: significant WER reduction with small computational costs
December 10, 2004Using Neural Network LMs for LVCSR3 Architecture Standard fully connected multilayer perceptron hjhj ckck djdj oioi Input projection layer hidden layer output layer w j-n+1 w j-n+2 w j-1 p i = P(w j =i| h j ) p N = P(w j =N| h j ) p 1 = P(w j =1| h j ) N N N = 51k P =50 H ≈1k N M V bk
December 10, 2004Using Neural Network LMs for LVCSR4 ckck Architecture djdj oioi P H N M V bk d = tanh(M*c+b) p i = P(w j =i| h j ) p N = P(w j =N| h j ) o = tanh(V*d+k)
December 10, 2004Using Neural Network LMs for LVCSR5 Training Train with std back propagation algorithm Error fn: cross entropy Weight decay regularization used Targets set to 1 for w j and to 0 otherwise These outputs shown to cvg to posterior probs Back-prop through projection layer NN learns best projection of words onto continuous space for prob estimation task
Optimizations
December 10, 2004Using Neural Network LMs for LVCSR7 Fast Recognition Techniques 1) Lattice Rescoring 2) Shortlists 3) Regrouping 4) Block mode 5) CPU optimization
December 10, 2004Using Neural Network LMs for LVCSR8 Fast Recognition Techniques 1) Lattice Rescoring Decode with std backoff LM to build latticesDecode with std backoff LM to build lattices 2) Shortlists 3) Regrouping 4) Block mode 5) CPU optimization
December 10, 2004Using Neural Network LMs for LVCSR9 Fast Recognition Techniques 1) Lattice Rescoring 2) Shortlists NN only predicts high freq subset of vocabNN only predicts high freq subset of vocab 3) Regrouping 4) Block mode 5) CPU optimization Redistributes probability mass of shortlist words
December 10, 2004Using Neural Network LMs for LVCSR10 ckck Shortlist optimization djdj oioi P H N M V b k p i = P(w j =i| h j ) p S = P(w j =S| h j )
December 10, 2004Using Neural Network LMs for LVCSR11 Fast Recognition Techniques 1) Lattice Rescoring 2) Shortlists 3) Regrouping – Optimization of #1 Collect and sort LM prob requestsCollect and sort LM prob requests All prob requests with same h t : only one fwd pass necessaryAll prob requests with same h t : only one fwd pass necessary 4) Block mode 5) CPU optimization
December 10, 2004Using Neural Network LMs for LVCSR12 Fast Recognition Techniques 1) Lattice Rescoring 2) Shortlists 3) Regrouping 4) Block mode Several examples propagated through NN at onceSeveral examples propagated through NN at once Takes advantage of faster matrix operationsTakes advantage of faster matrix operations 5) CPU optimization
December 10, 2004Using Neural Network LMs for LVCSR13 ckck Block mode calculations djdj oioi P H N M V bk d = tanh(M*c+b) o = tanh(V*d+k)
December 10, 2004Using Neural Network LMs for LVCSR14 C Block mode calculations D O M V bk D = tanh(M*C+B) O = (V*D+K)
December 10, 2004Using Neural Network LMs for LVCSR15 Fast Recognition – Test Results Techniques 1) Lattice Rescoring – ave 511 nodes 2) Shortlists (2000)– 90% prediction coverage 3.8M 4gms req’d, 3.4M processed by NN3.8M 4gms req’d, 3.4M processed by NN 3) Regrouping – only 1M fwd passes req’d 4) Block mode – bunch size=128 5) CPU optimization Total processing < 9min (0.03xRT) Without optimizations, 10x slower
December 10, 2004Using Neural Network LMs for LVCSR16 Fast Training Techniques 1) Parallel implementations Full connections req low latency; very costlyFull connections req low latency; very costly 2) Resampling techniques Optimum floating pt operations best with continuous memory locationsOptimum floating pt operations best with continuous memory locations
December 10, 2004Using Neural Network LMs for LVCSR17 Fast Training Techniques 1) Floating point precision – 1.5x faster 2) Suppress internal calcs – 1.3x faster 3) Bunch mode – 10+x faster Fwd + back propagation for many examples at onceFwd + back propagation for many examples at once 4) Multiprocessing – 1.5x faster 47 hours 1h27m with bunch size 128
Application to CTS and BN LVCSR
December 10, 2004Using Neural Network LMs for LVCSR19 Application to ASR Neural net LM techniques focus on CTS bc Far less in-domain training data data sparsity NN can only handle sm amount of training data New Fisher CTS data – 20M words (vs 7M) BN data: 500M words
December 10, 2004Using Neural Network LMs for LVCSR20 Application to CTS Baseline: Train standard backoff LMs for each domain and then interpolate Expt #1: Interpolate CTS neural net with in-domain back-off LM Expt #2: Interpolate CTS neural net with full data back-off LM
December 10, 2004Using Neural Network LMs for LVCSR21 Application to CTS - PPL Baseline: Train standard backoff LMs for each domain and then interpolate In-domain PPL: 50.1 Full data PPL: 47.5 Expt #1: Interpolate CTS neural net with in-domain back-off LM In-domain PPL: 45.5 Expt #2: Interpolate CTS neural net with full data back-off LM Full data PPL: 44.2
December 10, 2004Using Neural Network LMs for LVCSR22 Application to CTS - WER Baseline: Train standard backoff LMs for each domain and then interpolate In-domain WER: 19.9 Full data WER: 19.3 Expt #1: Interpolate CTS neural net with in-domain back-off LM In-domain WER: 19.1 Expt #2: Interpolate CTS neural net with full data back-off LM Full data WER: 18.8
December 10, 2004Using Neural Network LMs for LVCSR23 Application to BN Only subset of 500M available words could be used for training – 27M train set Still useful: NN LM gave 12% PPL gain over backoff on small 27M set NN LM gave 4% PPL gain over backoff on full 500M word training set Overall WER reduction of 0.3% absolute
December 10, 2004Using Neural Network LMs for LVCSR24 Conclusion Neural net LM provide significant improvements in PPL and WER Optimizations can speed NN training by 20x and lattice rescoring in less than 0.05xRT While NN LM was developed for and works best with CTS, gains found in BN task too