Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Student-Teacher Training of Deep Neural Networks

Similar presentations


Presentation on theme: "Sequence Student-Teacher Training of Deep Neural Networks"β€” Presentation transcript:

1 Sequence Student-Teacher Training of Deep Neural Networks
Jeremy H. M. Wong and Mark J. F. Gales Department of Engineering, University of Cambridge Trumpington Street, CB2 1PZ Cambridge, England

2 Outline Ensemble methods Student-teacher training
Integrating sequence training Experimental results Conclusions

3 Ensemble methods Ensemble combinations can give significant gains over single systems training data is limited Computational demand of decoding scales linearly with ensemble size

4 Diversity More diverse ensembles tend to give better combination gains
Correct each other’s errors Wider space of possible models bagging, random decision trees and Adaboost Diversity introduced as different random DNN initialisations

5 Combinations Frame level: Linear average of frame posteriors
𝑃 𝑠 π‘Ÿπ‘‘ π‘œ π‘Ÿπ‘‘ , Ξ¦ = π‘š=1 𝑀 𝛼 π‘š 𝑃 𝑠 π‘Ÿπ‘‘ π‘œ π‘Ÿπ‘‘ , Ξ¦ π‘š ) Hypothesis level: MBR combination decoding β„Ž π‘Ÿ βˆ— =π‘Žπ‘Ÿπ‘” m𝑖𝑛 β„Ž π‘Ÿ β€² β„Ž π‘Ÿ β€² β„’( β„Ž π‘Ÿ , β„Ž π‘Ÿ β€² ) π‘š=1 𝑀 𝛽 π‘š 𝑃( β„Ž π‘Ÿ | 𝑂 π‘Ÿ , Ξ¦ π‘š ) Frame combination only requires processing of single lattice Hypothesis combination does not require synchronous states

6 Student-teacher training
General framework used to compress large model Here, use ensemble of teachers Decode only single student model Consider nature of information propagated to student Information

7 Frame-level student-teacher training
Standard method propagates frame posterior information. Minimise KL-divergence between frame posteriors, π’ž 𝐢𝐸 =βˆ’ π‘Ÿ 𝑑 𝑠 π‘Ÿπ‘‘ 𝑃 𝐢𝐸 βˆ— 𝑆 π‘Ÿπ‘‘ log𝑃 ( 𝑆 π‘Ÿπ‘‘ | π‘œ π‘Ÿπ‘‘ , Θ) The target distribution is 𝑃 𝐢𝐸 βˆ— = 1βˆ’πœ† 𝛿 𝑆 π‘Ÿπ‘‘ , 𝑆 π‘Ÿπ‘‘ βˆ— +πœ† π‘š=1 𝑀 𝛼 π‘š 𝑃 𝑆 π‘Ÿπ‘‘ π‘œ π‘Ÿπ‘‘ , Ξ¦ π‘š ) Ξ» = 0 reduces to the cross-entropy criterion

8 Integrating Sequence Training
Sequence training outperforms cross-entropy training. Want to integrate sequence training into student-teacher training. Possible ways: Ensemble training: sequence train the teachers Student refinement: sequence train the student Information transfer: propagate sequence information

9 Integrating Sequence Training

10 Hypothesis-level student-teacher training
Propagate hypothesis posterior information. Minimize KL-divergence between hypothesis posteriors, π’ž 𝑀𝑀𝐼 =βˆ’ π‘Ÿ β„Ž π‘Ÿ 𝑃 𝑀𝑀𝐼 βˆ— β„Ž π‘Ÿ log𝑃 ( β„Ž π‘Ÿ | 𝑂 π‘Ÿ , Θ) The target distribution is 𝑃 𝑀𝑀𝐼 βˆ— β„Ž π‘Ÿ = 1βˆ’πœ‚ 𝛿 β„Ž π‘Ÿ , β„Ž π‘Ÿ βˆ— +πœ‚ π‘š=1 𝑀 𝛽 π‘š 𝑃( β„Ž π‘Ÿ | 𝑂 π‘Ÿ , Ξ¦ π‘š ) πœ‚=0 reduces to the MMI criterion

11 Computing the gradient
SGD gradient: πœ•π’ž 𝑀𝑀𝐼 πœ• π‘Ž 𝑠 π‘Ÿπ‘‘ = 𝛾 𝑃 𝑠 π‘Ÿπ‘‘ 𝑂 π‘Ÿ ,Θ) βˆ’ 1βˆ’πœ‚ 𝑃 𝑠 π‘Ÿπ‘‘ β„Ž 𝑑 βˆ— , 𝑂 π‘Ÿ , Θ βˆ’πœ‚ β„Ž π‘Ÿ π‘š=1 𝑀 𝛽 π‘š 𝑃 β„Ž π‘Ÿ 𝑂 π‘Ÿ , Ξ¦ π‘š 𝑃( 𝑠 π‘Ÿπ‘‘ | β„Ž π‘Ÿ , 𝑂 π‘Ÿ ,Θ) β„Ž π‘Ÿ π‘š=1 𝑀 𝛽 π‘š 𝑃 β„Ž π‘Ÿ 𝑂 π‘Ÿ , Ξ¦ π‘š 𝑃( 𝑠 π‘Ÿπ‘‘ | β„Ž π‘Ÿ , 𝑂 π‘Ÿ ,Θ) computation: Use N-best lists If student lattice is determinised and not regenerated, then 𝑃( 𝑠 π‘Ÿπ‘‘ | β„Ž π‘Ÿ , 𝑂 π‘Ÿ ,Θ) is a 𝛿-function that does not change with training iterations Can pre-compute once and store.

12 Datasets IARPA Babel Tok Pisin (IARPA-babel207b-v1.0e) WSJ
3 hour VLLP training set 10 hour development set WSJ 14 hour si-84 training set 64K words open-vocabulary eval-92 test set

13 Experimental Setup Ensemble size = 10 (Tok Pisin), 4 (WSJ)
Acoustic model = DNN-HMM hybrid 1000 nodes Γ— 4 layers for Tok Pisin 2000 nodes Γ— 6 layers for WSJ The student and teacher models have the same architecture

14 Combination of sequence-trained teachers
Tok Pisin WERs (%) Training teachers with sequence criteria improves the combined ensemble performance

15 Ensemble training Tok Pisin WERs (%)
Gains from sequence training of the teachers can be transferred to the student models

16 Student refinement Further sequence training can bring additional gains

17 Information transfer Hypothesis level
Gains from sequence training of the teachers can be transferred to the student models

18 Conclusions Summary: Future work:
Investigated integrating sequence training into student-teacher training. Proposed a new method to propagate hypothesis posteriors. All three methods of integrating sequence training are complementary to student-teacher training. Future work: Investigate other forms of information to propagate. e.g. Expected loss.


Download ppt "Sequence Student-Teacher Training of Deep Neural Networks"

Similar presentations


Ads by Google