Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 Personalized Acoustic Modeling By Weakly Supervised Multi-task Deep Learning Using Acoustic Tokens Discovered From Unlabeled Data Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 1Graduate Institute of Electrical Engineering, National Taiwan University 2Graduate Institute of Communication Engineering, National Taiwan University
Outline Introduction Proposed Approach(PTDNN) Experiments Conclusions
Introduction Weakly supervised High degree of similarity Collect a large set of audio data for each user(not difficult) Transcription(difficult) High degree of similarity HMM states of acoustic token models and phoneme models
Introduction Feature discriminant linear regression(fDLR) Speaker code Additional linear hidden layers Speaker code DNN are also trained with a set of automatically-obtained speaker-specific features Lightly supervised adaptation
Introduction Multi-task learning Relationship between Automatically discovered acoustic tokens The phoneme labels Unknown and likely to be noisy
Acoustic Tokens from Unlabeled Data Automatic Discovery of Acoustic Tokens from Unlabeled Data Initial token label sequence W0 Segmentation Mean of MFCC and K-means 𝑊 0 =𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑂
Acoustic Tokens from Unlabeled Data Automatic Discovery of Acoustic Tokens from Unlabeled Data Train the HMM model for each token 𝜃 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 𝑖−1 Decode, obtain new label 𝑊 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 Convergence
PTDNN Top left Top right Phoneme state probabilities Token state probabilities
Speaker Adaptation Initialization SI DNN-HMM acoustic model fDLR transformation identity matrix Share hidden layer, fDLR
Speaker Adaptation Jointly optimize Objective function Phoneme state output network Token state output network Objective function 𝑓= W 𝑝 ∙ 𝑓 𝑝 + W 𝑡 ∙ 𝑓 𝑡
Speaker Adaptation Train the acoustic token state Objective function Unlabeled data Objective function 𝑓= W 𝑡 ∙ 𝑓 𝑡
Speaker Adaptation Transferring back Fine-tune the model Optimize the phoneme state
Experiments Facebook post corpus 1000 utterances (6.6 hr) Train/Dev/Test(500/250/250), Label (50/500) 4.1% of the words were in English Five male and five female speakers
Experiments Speaker-independent (SI) model Language Model(Trigram) ASTMIC corpus (read speech in Mandarin, 31.8 hours) EATMIC corpus (read speech in English produced by Taiwanese speakers, 29.7 hours) Language Model(Trigram) Data crawled from PTT bulletin board system (BBS)
Speaker Adaptation
Granularity parameter sets 𝜓=(𝑚, 𝑛)
Transcribed Data and Token Sets
Conclusions We propose for personalized acoustic modeling a weakly supervised multi-task deep learning framework A large set of unlabeled data in the proposed personalized recognizer scenario Very encouraging initial experimental results were obtained