Sequence Student-Teacher Training of Deep Neural Networks

Sequence Student-Teacher Training of Deep Neural Networks
Jeremy H. M. Wong and Mark J. F. Gales Department of Engineering, University of Cambridge Trumpington Street, CB2 1PZ Cambridge, England

Outline Ensemble methods Student-teacher training
Integrating sequence training Experimental results Conclusions

Ensemble methods Ensemble combinations can give significant gains over single systems training data is limited Computational demand of decoding scales linearly with ensemble size

Diversity More diverse ensembles tend to give better combination gains
Correct each other’s errors Wider space of possible models bagging, random decision trees and Adaboost Diversity introduced as different random DNN initialisations

Combinations Frame level: Linear average of frame posteriors
𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ = 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) Hypothesis level: MBR combination decoding ℎ 𝑟 ∗ =𝑎𝑟𝑔 m𝑖𝑛 ℎ 𝑟 ′ ℎ 𝑟 ′ ℒ( ℎ 𝑟 , ℎ 𝑟 ′ ) 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) Frame combination only requires processing of single lattice Hypothesis combination does not require synchronous states

Student-teacher training
General framework used to compress large model Here, use ensemble of teachers Decode only single student model Consider nature of information propagated to student Information

Frame-level student-teacher training
Standard method propagates frame posterior information. Minimise KL-divergence between frame posteriors, 𝒞 𝐶𝐸 =− 𝑟 𝑡 𝑠 𝑟𝑡 𝑃 𝐶𝐸 ∗ 𝑆 𝑟𝑡 log𝑃 ( 𝑆 𝑟𝑡 | 𝑜 𝑟𝑡 , Θ) The target distribution is 𝑃 𝐶𝐸 ∗ = 1−𝜆 𝛿 𝑆 𝑟𝑡 , 𝑆 𝑟𝑡 ∗ +𝜆 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑆 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) λ = 0 reduces to the cross-entropy criterion

Integrating Sequence Training
Sequence training outperforms cross-entropy training. Want to integrate sequence training into student-teacher training. Possible ways: Ensemble training: sequence train the teachers Student refinement: sequence train the student Information transfer: propagate sequence information

Integrating Sequence Training

Hypothesis-level student-teacher training
Propagate hypothesis posterior information. Minimize KL-divergence between hypothesis posteriors, 𝒞 𝑀𝑀𝐼 =− 𝑟 ℎ 𝑟 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 log𝑃 ( ℎ 𝑟 | 𝑂 𝑟 , Θ) The target distribution is 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 = 1−𝜂 𝛿 ℎ 𝑟 , ℎ 𝑟 ∗ +𝜂 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) 𝜂=0 reduces to the MMI criterion

Computing the gradient
SGD gradient: 𝜕𝒞 𝑀𝑀𝐼 𝜕 𝑎 𝑠 𝑟𝑡 = 𝛾 𝑃 𝑠 𝑟𝑡 𝑂 𝑟 ,Θ) − 1−𝜂 𝑃 𝑠 𝑟𝑡 ℎ 𝑡 ∗ , 𝑂 𝑟 , Θ −𝜂 ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) computation: Use N-best lists If student lattice is determinised and not regenerated, then 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) is a 𝛿-function that does not change with training iterations Can pre-compute once and store.

Datasets IARPA Babel Tok Pisin (IARPA-babel207b-v1.0e) WSJ
3 hour VLLP training set 10 hour development set WSJ 14 hour si-84 training set 64K words open-vocabulary eval-92 test set

Experimental Setup Ensemble size = 10 (Tok Pisin), 4 (WSJ)
Acoustic model = DNN-HMM hybrid 1000 nodes × 4 layers for Tok Pisin 2000 nodes × 6 layers for WSJ The student and teacher models have the same architecture

Combination of sequence-trained teachers
Tok Pisin WERs (%) Training teachers with sequence criteria improves the combined ensemble performance

Ensemble training Tok Pisin WERs (%)
Gains from sequence training of the teachers can be transferred to the student models

Student refinement Further sequence training can bring additional gains

Information transfer Hypothesis level
Gains from sequence training of the teachers can be transferred to the student models

Conclusions Summary: Future work:
Investigated integrating sequence training into student-teacher training. Proposed a new method to propagate hypothesis posteriors. All three methods of integrating sequence training are complementary to student-teacher training. Future work: Investigate other forms of information to propagate. e.g. Expected loss.

Sequence Student-Teacher Training of Deep Neural Networks

Similar presentations

Presentation on theme: "Sequence Student-Teacher Training of Deep Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Student-Teacher Training of Deep Neural Networks

Similar presentations

Presentation on theme: "Sequence Student-Teacher Training of Deep Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback