Download presentation
Presentation is loading. Please wait.
Published byBorgar Aamodt Modified over 5 years ago
1
Sequence Student-Teacher Training of Deep Neural Networks
Jeremy H. M. Wong and Mark J. F. Gales Department of Engineering, University of Cambridge Trumpington Street, CB2 1PZ Cambridge, England
2
Outline Ensemble methods Student-teacher training
Integrating sequence training Experimental results Conclusions
3
Ensemble methods Ensemble combinations can give significant gains over single systems training data is limited Computational demand of decoding scales linearly with ensemble size
4
Diversity More diverse ensembles tend to give better combination gains
Correct each otherβs errors Wider space of possible models bagging, random decision trees and Adaboost Diversity introduced as different random DNN initialisations
5
Combinations Frame level: Linear average of frame posteriors
π π ππ‘ π ππ‘ , Ξ¦ = π=1 π πΌ π π π ππ‘ π ππ‘ , Ξ¦ π ) Hypothesis level: MBR combination decoding β π β =πππ mππ β π β² β π β² β( β π , β π β² ) π=1 π π½ π π( β π | π π , Ξ¦ π ) Frame combination only requires processing of single lattice Hypothesis combination does not require synchronous states
6
Student-teacher training
General framework used to compress large model Here, use ensemble of teachers Decode only single student model Consider nature of information propagated to student Information
7
Frame-level student-teacher training
Standard method propagates frame posterior information. Minimise KL-divergence between frame posteriors, π πΆπΈ =β π π‘ π ππ‘ π πΆπΈ β π ππ‘ logπ ( π ππ‘ | π ππ‘ , Ξ) The target distribution is π πΆπΈ β = 1βπ πΏ π ππ‘ , π ππ‘ β +π π=1 π πΌ π π π ππ‘ π ππ‘ , Ξ¦ π ) Ξ» = 0 reduces to the cross-entropy criterion
8
Integrating Sequence Training
Sequence training outperforms cross-entropy training. Want to integrate sequence training into student-teacher training. Possible ways: Ensemble training: sequence train the teachers Student refinement: sequence train the student Information transfer: propagate sequence information
9
Integrating Sequence Training
10
Hypothesis-level student-teacher training
Propagate hypothesis posterior information. Minimize KL-divergence between hypothesis posteriors, π πππΌ =β π β π π πππΌ β β π logπ ( β π | π π , Ξ) The target distribution is π πππΌ β β π = 1βπ πΏ β π , β π β +π π=1 π π½ π π( β π | π π , Ξ¦ π ) π=0 reduces to the MMI criterion
11
Computing the gradient
SGD gradient: ππ πππΌ π π π ππ‘ = πΎ π π ππ‘ π π ,Ξ) β 1βπ π π ππ‘ β π‘ β , π π , Ξ βπ β π π=1 π π½ π π β π π π , Ξ¦ π π( π ππ‘ | β π , π π ,Ξ) β π π=1 π π½ π π β π π π , Ξ¦ π π( π ππ‘ | β π , π π ,Ξ) computation: Use N-best lists If student lattice is determinised and not regenerated, then π( π ππ‘ | β π , π π ,Ξ) is a πΏ-function that does not change with training iterations Can pre-compute once and store.
12
Datasets IARPA Babel Tok Pisin (IARPA-babel207b-v1.0e) WSJ
3 hour VLLP training set 10 hour development set WSJ 14 hour si-84 training set 64K words open-vocabulary eval-92 test set
13
Experimental Setup Ensemble size = 10 (Tok Pisin), 4 (WSJ)
Acoustic model = DNN-HMM hybrid 1000 nodes Γ 4 layers for Tok Pisin 2000 nodes Γ 6 layers for WSJ The student and teacher models have the same architecture
14
Combination of sequence-trained teachers
Tok Pisin WERs (%) Training teachers with sequence criteria improves the combined ensemble performance
15
Ensemble training Tok Pisin WERs (%)
Gains from sequence training of the teachers can be transferred to the student models
16
Student refinement Further sequence training can bring additional gains
17
Information transfer Hypothesis level
Gains from sequence training of the teachers can be transferred to the student models
18
Conclusions Summary: Future work:
Investigated integrating sequence training into student-teacher training. Proposed a new method to propagate hypothesis posteriors. All three methods of integrating sequence training are complementary to student-teacher training. Future work: Investigate other forms of information to propagate. e.g. Expected loss.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.