Sequence Student-Teacher Training of Deep Neural Networks Jeremy H. M. Wong and Mark J. F. Gales Department of Engineering, University of Cambridge Trumpington Street, CB2 1PZ Cambridge, England
Outline Ensemble methods Student-teacher training Integrating sequence training Experimental results Conclusions
Ensemble methods Ensemble combinations can give significant gains over single systems training data is limited Computational demand of decoding scales linearly with ensemble size
Diversity More diverse ensembles tend to give better combination gains Correct each other’s errors Wider space of possible models bagging, random decision trees and Adaboost Diversity introduced as different random DNN initialisations
Combinations Frame level: Linear average of frame posteriors 𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ = 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) Hypothesis level: MBR combination decoding ℎ 𝑟 ∗ =𝑎𝑟𝑔 m𝑖𝑛 ℎ 𝑟 ′ ℎ 𝑟 ′ ℒ( ℎ 𝑟 , ℎ 𝑟 ′ ) 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) Frame combination only requires processing of single lattice Hypothesis combination does not require synchronous states
Student-teacher training General framework used to compress large model Here, use ensemble of teachers Decode only single student model Consider nature of information propagated to student Information
Frame-level student-teacher training Standard method propagates frame posterior information. Minimise KL-divergence between frame posteriors, 𝒞 𝐶𝐸 =− 𝑟 𝑡 𝑠 𝑟𝑡 𝑃 𝐶𝐸 ∗ 𝑆 𝑟𝑡 log𝑃 ( 𝑆 𝑟𝑡 | 𝑜 𝑟𝑡 , Θ) The target distribution is 𝑃 𝐶𝐸 ∗ = 1−𝜆 𝛿 𝑆 𝑟𝑡 , 𝑆 𝑟𝑡 ∗ +𝜆 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑆 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) λ = 0 reduces to the cross-entropy criterion
Integrating Sequence Training Sequence training outperforms cross-entropy training. Want to integrate sequence training into student-teacher training. Possible ways: Ensemble training: sequence train the teachers Student refinement: sequence train the student Information transfer: propagate sequence information
Integrating Sequence Training
Hypothesis-level student-teacher training Propagate hypothesis posterior information. Minimize KL-divergence between hypothesis posteriors, 𝒞 𝑀𝑀𝐼 =− 𝑟 ℎ 𝑟 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 log𝑃 ( ℎ 𝑟 | 𝑂 𝑟 , Θ) The target distribution is 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 = 1−𝜂 𝛿 ℎ 𝑟 , ℎ 𝑟 ∗ +𝜂 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) 𝜂=0 reduces to the MMI criterion
Computing the gradient SGD gradient: 𝜕𝒞 𝑀𝑀𝐼 𝜕 𝑎 𝑠 𝑟𝑡 = 𝛾 𝑃 𝑠 𝑟𝑡 𝑂 𝑟 ,Θ) − 1−𝜂 𝑃 𝑠 𝑟𝑡 ℎ 𝑡 ∗ , 𝑂 𝑟 , Θ −𝜂 ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) computation: Use N-best lists If student lattice is determinised and not regenerated, then 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) is a 𝛿-function that does not change with training iterations Can pre-compute once and store.
Datasets IARPA Babel Tok Pisin (IARPA-babel207b-v1.0e) WSJ 3 hour VLLP training set 10 hour development set WSJ 14 hour si-84 training set 64K words open-vocabulary eval-92 test set
Experimental Setup Ensemble size = 10 (Tok Pisin), 4 (WSJ) Acoustic model = DNN-HMM hybrid 1000 nodes × 4 layers for Tok Pisin 2000 nodes × 6 layers for WSJ The student and teacher models have the same architecture
Combination of sequence-trained teachers Tok Pisin WERs (%) Training teachers with sequence criteria improves the combined ensemble performance
Ensemble training Tok Pisin WERs (%) Gains from sequence training of the teachers can be transferred to the student models
Student refinement Further sequence training can bring additional gains
Information transfer Hypothesis level Gains from sequence training of the teachers can be transferred to the student models
Conclusions Summary: Future work: Investigated integrating sequence training into student-teacher training. Proposed a new method to propagate hypothesis posteriors. All three methods of integrating sequence training are complementary to student-teacher training. Future work: Investigate other forms of information to propagate. e.g. Expected loss.