Download presentation
Presentation is loading. Please wait.
Published byCynthia Mitchell Modified over 5 years ago
1
3. Adversarial Teacher-Student Learning (AT/S)
Adversarial Teacher-Student Learning for Unsupervised Adaptation Zhong Meng1, 2, Jinyu Li 1, Yifan Gong 1, Biing-Hwang (Fred) Juang 2 1Microsoft AI and Research, USA, 2 Georgia Institute of Technology, USA 1. Introduction 3. Adversarial Teacher-Student Learning (AT/S) Problems: ASR performance degrades significantly when the domains of the training and test data mismatch Solutions: purely unsupervised adaptation Adapt a well-trained source-domain acoustic model to the data from target domain No alignment or decoding lattices available for the target domain adaptation data Teacher-Student Learning [Li et al, 2014] Parallel data from source and target domain is required Student mimics the behavior of well-trained source-domain teacher model Adversarial Learning (GRL, DSN) [Ganin et al., 2015] No parallel data from source and target domain is required Explicitly suppress the condition variability in speech signal Gradient reversal layer (GRL): multiply gradient with negative number (βπ) in the backward pass. Condition Classifier π π Condition Posterior Condition Label π π Feature Extractor π π Deep Feature π π Student Input Feature π₯ π Teacher Acoustic Model Teacher Senone Posterior Teacher Input Feature π₯ π Senone Loss β π¦ Condition Loss β π Senone Classifier π π¦ Student Senone Posterior Student Acoustic Model GRL π
πΌ 2. Teacher-Student (T/S) Adaptation [Li et al., 2017] Student Input Feature π₯ π Teacher LSTM Acoustic Model π π Teacher Senone Posterior Teacher Input Feature π₯ π Senone Loss Student LSTM Acoustic Model π π Student Senone Posterior Student input π π is parallel to teacher input π π , i.e., frame-by-frame synchronized. Minimize the KL divergence between the output distributions of the teacher and student models πΎπΏ π π || π π =π π π| π₯ π π ; π π log π π π| π₯ π π ; π π π π π| π₯ π π ; π π Teacher senone posterior in lieu of hard labels to train the student model πΏ π π =β π πβπ π π π| π₯ π π ; π π log π π π| π₯ π π ; π π π is one of the senones in the senone set π Advance T/S learning with adversarial learning to achieve condition-robust unsupervised adaptation. Goal: Learn a condition-invariant and senone-discriminative deep feature π π . Senone classifier: π π¦ (π π (π₯ π π ))= π π¦ π¦ =π π₯ π π ; π π , π π¦ ,πβπ¬ Condition classifier: π π (π π (π₯ π π ))= π π π =π π₯ π π ; π π , π π ,πβπ T/S Senone Loss: β π¦ π π , π π¦ =β π πβπ π π π| π₯ π π ; π π π π¦ π π₯ π π ; π π , π π¦ Condition Loss: β π π π , π π =β π=1 π log π π π π₯ π π ; π π , π π Adversarial Multi-Task Learning (with Gradient Reversal Layer) max π π min π π¦ , π π β π¦ π π¦ , π π β πβ π π π , π π 4. Experiments Source-Domain Teacher Acoustic Model: LSTM trained with 375 hours Microsoft Cortana voice assistant data Adaptation Data (CHiME-3): 9137 clean and noisy parallel utterances Multi-factorial (MFA) AT/S: simultaneously suppress multiple factors (e.g., speaker and environment) that cause the condition variability. System Conditions BUS CAF PED STR Avg. WERR Un-adapted - 27.93 24.93 18.53 21.38 23.16 T/S (baseline) 15.96 14.32 11.00 13.04 13.56 Adversarial T/S 2 env. 15.24 13.95 10.71 12.76 13.15 3.02 6 env. 15.58 13.23 10.65 13.10 13.12 3.24 87 spk. 14.97 13.63 10.84 12.24 12.90 4.87 MFA T/S 6 env., 87 spk. 15.38 13.08 10.47 12.45 12.83 5.38 5. Conclusions AT/S achieves 3.24%, 4.87% and 5.38% relative WER reductions over T/S by suppressing environment, speaker and multi-factor variability. AT/S for speaker-robust unsupervised adaptation is more effective than environment-robust one. MFA T/S furthers improve the ASR performance over single-factor AT/S.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.