Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Similar presentations


Presentation on theme: "Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2"— Presentation transcript:

1 Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Personalized Acoustic Modeling By Weakly Supervised Multi-task Deep Learning Using Acoustic Tokens Discovered From Unlabeled Data Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 1Graduate Institute of Electrical Engineering, National Taiwan University 2Graduate Institute of Communication Engineering, National Taiwan University

2 Outline Introduction Proposed Approach(PTDNN) Experiments Conclusions

3 Introduction Weakly supervised High degree of similarity
Collect a large set of audio data for each user(not difficult) Transcription(difficult) High degree of similarity HMM states of acoustic token models and phoneme models

4 Introduction Feature discriminant linear regression(fDLR) Speaker code
Additional linear hidden layers Speaker code DNN are also trained with a set of automatically-obtained speaker-specific features Lightly supervised adaptation

5 Introduction Multi-task learning Relationship between
Automatically discovered acoustic tokens The phoneme labels Unknown and likely to be noisy

6 Acoustic Tokens from Unlabeled Data
Automatic Discovery of Acoustic Tokens from Unlabeled Data Initial token label sequence W0 Segmentation Mean of MFCC and K-means 𝑊 0 =𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑂

7 Acoustic Tokens from Unlabeled Data
Automatic Discovery of Acoustic Tokens from Unlabeled Data Train the HMM model for each token 𝜃 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 𝑖−1 Decode, obtain new label 𝑊 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 Convergence

8 PTDNN Top left Top right Phoneme state probabilities
Token state probabilities

9 Speaker Adaptation Initialization SI DNN-HMM acoustic model
fDLR transformation identity matrix Share hidden layer, fDLR

10 Speaker Adaptation Jointly optimize Objective function
Phoneme state output network Token state output network Objective function 𝑓= W 𝑝 ∙ 𝑓 𝑝 + W 𝑡 ∙ 𝑓 𝑡

11 Speaker Adaptation Train the acoustic token state Objective function
Unlabeled data Objective function 𝑓= W 𝑡 ∙ 𝑓 𝑡

12 Speaker Adaptation Transferring back Fine-tune the model
Optimize the phoneme state

13 Experiments Facebook post corpus 1000 utterances (6.6 hr)
Train/Dev/Test(500/250/250), Label (50/500) 4.1% of the words were in English Five male and five female speakers

14 Experiments Speaker-independent (SI) model Language Model(Trigram)
ASTMIC corpus (read speech in Mandarin, 31.8 hours) EATMIC corpus (read speech in English produced by Taiwanese speakers, 29.7 hours) Language Model(Trigram) Data crawled from PTT bulletin board system (BBS)

15 Speaker Adaptation

16 Granularity parameter sets 𝜓=(𝑚, 𝑛)

17 Transcribed Data and Token Sets

18 Conclusions We propose for personalized acoustic modeling a weakly supervised multi-task deep learning framework A large set of unlabeled data in the proposed personalized recognizer scenario Very encouraging initial experimental results were obtained


Download ppt "Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2"

Similar presentations


Ads by Google