Sequence Student-Teacher Training of Deep Neural Networks

Slides:

Advertisements

Similar presentations

Greedy Layer-Wise Training of Deep Networks

Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advances in WP2 Nancy Meeting – 6-7 July

Ensemble Learning: An Introduction

Adaboost and its application

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning (2), Tree and Forest

Machine Learning CS 165B Spring 2012

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

Algoritmi e Programmazione Avanzata

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

NTU & MSRA Ming-Feng Tsai

Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität.

Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

1 Deep Recurrent Neural Networks for Acoustic Modelling 2015/06/01 Ming-Han Yang William ChanIan Lane.

Combining Bagging and Random Subspaces to Create Better Ensembles

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Olivier Siohan David Rybach

Machine Learning Supervised Learning Classification and Regression

Fall 2004 Backpropagation CS478 - Machine Learning.

Summary of “Efficient Deep Learning for Stereo Matching”

2 Research Department, iFLYTEK Co. LTD.

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Deep Predictive Model for Autonomous Driving

Presented by: Dr Beatriz de la Iglesia

Restricted Boltzmann Machines for Classification

Conditional Random Fields for ASR

Accurate Robot Positioning using Corrective Learning

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

COMP61011 : Machine Learning Ensemble Models

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.

Asymmetric Gradient Boosting with Application to Spam Filtering

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Lecture 11. MLP (III): Back-Propagation

A New Boosting Algorithm Using Input-Dependent Regularizer

CRANDEM: Conditional Random Fields for ASR

Finite Element Surface-Based Stereo 3D Reconstruction

Computational Learning Theory

Deep Forest: Towards an Alternative to Deep Neural Networks

Automatic Speech Recognition: Conditional Random Fields for ASR

General Aspects of Learning

Computational Learning Theory

Double Dueling Agent for Dialogue Policy Learning

Image to Image Translation using GANs

Statistical Learning Dong Liu Dept. EEIS, USTC.

Outline Texture modeling - continued Julesz ensemble.

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

A task of induction to find patterns

CS639: Data Management for Data Science

Semi-Supervised Learning

3. Adversarial Teacher-Student Learning (AT/S)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Deep Neural Network Language Models

Presentation transcript:

Sequence Student-Teacher Training of Deep Neural Networks Jeremy H. M. Wong and Mark J. F. Gales Department of Engineering, University of Cambridge Trumpington Street, CB2 1PZ Cambridge, England

Outline Ensemble methods Student-teacher training Integrating sequence training Experimental results Conclusions

Ensemble methods Ensemble combinations can give significant gains over single systems training data is limited Computational demand of decoding scales linearly with ensemble size

Diversity More diverse ensembles tend to give better combination gains Correct each other’s errors Wider space of possible models bagging, random decision trees and Adaboost Diversity introduced as different random DNN initialisations

Combinations Frame level: Linear average of frame posteriors 𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ = 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑠 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) Hypothesis level: MBR combination decoding ℎ 𝑟 ∗ =𝑎𝑟𝑔 m𝑖𝑛 ℎ 𝑟 ′ ℎ 𝑟 ′ ℒ( ℎ 𝑟 , ℎ 𝑟 ′ ) 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) Frame combination only requires processing of single lattice Hypothesis combination does not require synchronous states

Student-teacher training General framework used to compress large model Here, use ensemble of teachers Decode only single student model Consider nature of information propagated to student Information

Frame-level student-teacher training Standard method propagates frame posterior information. Minimise KL-divergence between frame posteriors, 𝒞 𝐶𝐸 =− 𝑟 𝑡 𝑠 𝑟𝑡 𝑃 𝐶𝐸 ∗ 𝑆 𝑟𝑡 log𝑃 ( 𝑆 𝑟𝑡 | 𝑜 𝑟𝑡 , Θ) The target distribution is 𝑃 𝐶𝐸 ∗ = 1−𝜆 𝛿 𝑆 𝑟𝑡 , 𝑆 𝑟𝑡 ∗ +𝜆 𝑚=1 𝑀 𝛼 𝑚 𝑃 𝑆 𝑟𝑡 𝑜 𝑟𝑡 , Φ 𝑚 ) λ = 0 reduces to the cross-entropy criterion

Integrating Sequence Training Sequence training outperforms cross-entropy training. Want to integrate sequence training into student-teacher training. Possible ways: Ensemble training: sequence train the teachers Student refinement: sequence train the student Information transfer: propagate sequence information

Integrating Sequence Training

Hypothesis-level student-teacher training Propagate hypothesis posterior information. Minimize KL-divergence between hypothesis posteriors, 𝒞 𝑀𝑀𝐼 =− 𝑟 ℎ 𝑟 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 log𝑃 ( ℎ 𝑟 | 𝑂 𝑟 , Θ) The target distribution is 𝑃 𝑀𝑀𝐼 ∗ ℎ 𝑟 = 1−𝜂 𝛿 ℎ 𝑟 , ℎ 𝑟 ∗ +𝜂 𝑚=1 𝑀 𝛽 𝑚 𝑃( ℎ 𝑟 | 𝑂 𝑟 , Φ 𝑚 ) 𝜂=0 reduces to the MMI criterion

Computing the gradient SGD gradient: 𝜕𝒞 𝑀𝑀𝐼 𝜕 𝑎 𝑠 𝑟𝑡 = 𝛾 𝑃 𝑠 𝑟𝑡 𝑂 𝑟 ,Θ) − 1−𝜂 𝑃 𝑠 𝑟𝑡 ℎ 𝑡 ∗ , 𝑂 𝑟 , Θ −𝜂 ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) ℎ 𝑟 𝑚=1 𝑀 𝛽 𝑚 𝑃 ℎ 𝑟 𝑂 𝑟 , Φ 𝑚 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) computation: Use N-best lists If student lattice is determinised and not regenerated, then 𝑃( 𝑠 𝑟𝑡 | ℎ 𝑟 , 𝑂 𝑟 ,Θ) is a 𝛿-function that does not change with training iterations Can pre-compute once and store.

Datasets IARPA Babel Tok Pisin (IARPA-babel207b-v1.0e) WSJ 3 hour VLLP training set 10 hour development set WSJ 14 hour si-84 training set 64K words open-vocabulary eval-92 test set

Experimental Setup Ensemble size = 10 (Tok Pisin), 4 (WSJ) Acoustic model = DNN-HMM hybrid 1000 nodes × 4 layers for Tok Pisin 2000 nodes × 6 layers for WSJ The student and teacher models have the same architecture

Combination of sequence-trained teachers Tok Pisin WERs (%) Training teachers with sequence criteria improves the combined ensemble performance

Ensemble training Tok Pisin WERs (%) Gains from sequence training of the teachers can be transferred to the student models

Student refinement Further sequence training can bring additional gains

Information transfer Hypothesis level Gains from sequence training of the teachers can be transferred to the student models

Conclusions Summary: Future work: Investigated integrating sequence training into student-teacher training. Proposed a new method to propagate hypothesis posteriors. All three methods of integrating sequence training are complementary to student-teacher training. Future work: Investigate other forms of information to propagate. e.g. Expected loss.