Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Personalized Acoustic Modeling By Weakly Supervised Multi-task Deep Learning Using Acoustic Tokens Discovered From Unlabeled Data Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 1Graduate Institute of Electrical Engineering, National Taiwan University 2Graduate Institute of Communication Engineering, National Taiwan University

Outline Introduction Proposed Approach(PTDNN) Experiments Conclusions

Introduction Weakly supervised High degree of similarity
Collect a large set of audio data for each user(not difficult) Transcription(difficult) High degree of similarity HMM states of acoustic token models and phoneme models

Introduction Feature discriminant linear regression(fDLR) Speaker code
Additional linear hidden layers Speaker code DNN are also trained with a set of automatically-obtained speaker-specific features Lightly supervised adaptation

Introduction Multi-task learning Relationship between
Automatically discovered acoustic tokens The phoneme labels Unknown and likely to be noisy

Acoustic Tokens from Unlabeled Data
Automatic Discovery of Acoustic Tokens from Unlabeled Data Initial token label sequence W0 Segmentation Mean of MFCC and K-means 𝑊 0 =𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑂

Acoustic Tokens from Unlabeled Data
Automatic Discovery of Acoustic Tokens from Unlabeled Data Train the HMM model for each token 𝜃 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 𝑖−1 Decode, obtain new label 𝑊 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 Convergence

PTDNN Top left Top right Phoneme state probabilities
Token state probabilities

Speaker Adaptation Initialization SI DNN-HMM acoustic model
fDLR transformation identity matrix Share hidden layer, fDLR

Speaker Adaptation Jointly optimize Objective function
Phoneme state output network Token state output network Objective function 𝑓= W 𝑝 ∙ 𝑓 𝑝 + W 𝑡 ∙ 𝑓 𝑡

Speaker Adaptation Train the acoustic token state Objective function
Unlabeled data Objective function 𝑓= W 𝑡 ∙ 𝑓 𝑡

Speaker Adaptation Transferring back Fine-tune the model
Optimize the phoneme state

Experiments Facebook post corpus 1000 utterances (6.6 hr)
Train/Dev/Test(500/250/250), Label (50/500) 4.1% of the words were in English Five male and five female speakers

Experiments Speaker-independent (SI) model Language Model(Trigram)
ASTMIC corpus (read speech in Mandarin, 31.8 hours) EATMIC corpus (read speech in English produced by Taiwanese speakers, 29.7 hours) Language Model(Trigram) Data crawled from PTT bulletin board system (BBS)

Speaker Adaptation

Granularity parameter sets 𝜓=(𝑚, 𝑛)

Transcribed Data and Token Sets

Conclusions We propose for personalized acoustic modeling a weakly supervised multi-task deep learning framework A large set of unlabeled data in the proposed personalized recognizer scenario Very encouraging initial experimental results were obtained

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Similar presentations

Presentation on theme: "Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Similar presentations

Presentation on theme: "Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2"— Presentation transcript:

Similar presentations

About project

Feedback