Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Slides:



Advertisements
Similar presentations
Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Advances in WP2 Torino Meeting – 9-10 March
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Speaker Adaptation for Vowel Classification
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Introduction to Automatic Speech Recognition
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
7-Speech Recognition Speech Recognition Concepts
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Algoritmi e Programmazione Avanzata
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
National Taiwan University, Taiwan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Learning Deep Rhetorical Structure for Extractive Speech Summarization ICASSP2010 Justin Jian Zhang and Pascale Fung HKUST Speaker: Hsiao-Tsung Hung.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Online Multiscale Dynamic Topic Models
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
2 Research Department, iFLYTEK Co. LTD.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Pick samples from task t
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Random walk initialization for training very deep feedforward networks
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Outline Background Motivation Proposed Model Experimental Results
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
3. Adversarial Teacher-Student Learning (AT/S)
Presenter: Shih-Hsiang(士翔)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
Language Transfer of Audio Word2Vec:
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 Personalized Acoustic Modeling By Weakly Supervised Multi-task Deep Learning Using Acoustic Tokens Discovered From Unlabeled Data Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2 1Graduate Institute of Electrical Engineering, National Taiwan University 2Graduate Institute of Communication Engineering, National Taiwan University

Outline Introduction Proposed Approach(PTDNN) Experiments Conclusions

Introduction Weakly supervised High degree of similarity Collect a large set of audio data for each user(not difficult) Transcription(difficult) High degree of similarity HMM states of acoustic token models and phoneme models

Introduction Feature discriminant linear regression(fDLR) Speaker code Additional linear hidden layers Speaker code DNN are also trained with a set of automatically-obtained speaker-specific features Lightly supervised adaptation

Introduction Multi-task learning Relationship between Automatically discovered acoustic tokens The phoneme labels Unknown and likely to be noisy

Acoustic Tokens from Unlabeled Data Automatic Discovery of Acoustic Tokens from Unlabeled Data Initial token label sequence W0 Segmentation Mean of MFCC and K-means 𝑊 0 =𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑂

Acoustic Tokens from Unlabeled Data Automatic Discovery of Acoustic Tokens from Unlabeled Data Train the HMM model for each token 𝜃 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 𝑖−1 Decode, obtain new label 𝑊 𝑖 = arg max 𝜃 𝑃 𝑂 𝜃, 𝑊 Convergence

PTDNN Top left Top right Phoneme state probabilities Token state probabilities

Speaker Adaptation Initialization SI DNN-HMM acoustic model fDLR transformation identity matrix Share hidden layer, fDLR

Speaker Adaptation Jointly optimize Objective function Phoneme state output network Token state output network Objective function 𝑓= W 𝑝 ∙ 𝑓 𝑝 + W 𝑡 ∙ 𝑓 𝑡

Speaker Adaptation Train the acoustic token state Objective function Unlabeled data Objective function 𝑓= W 𝑡 ∙ 𝑓 𝑡

Speaker Adaptation Transferring back Fine-tune the model Optimize the phoneme state

Experiments Facebook post corpus 1000 utterances (6.6 hr) Train/Dev/Test(500/250/250), Label (50/500) 4.1% of the words were in English Five male and five female speakers

Experiments Speaker-independent (SI) model Language Model(Trigram) ASTMIC corpus (read speech in Mandarin, 31.8 hours) EATMIC corpus (read speech in English produced by Taiwanese speakers, 29.7 hours) Language Model(Trigram) Data crawled from PTT bulletin board system (BBS)

Speaker Adaptation

Granularity parameter sets 𝜓=(𝑚, 𝑛)

Transcribed Data and Token Sets

Conclusions We propose for personalized acoustic modeling a weakly supervised multi-task deep learning framework A large set of unlabeled data in the proposed personalized recognizer scenario Very encouraging initial experimental results were obtained