Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Slides:

Advertisements

Similar presentations

How Microsoft Had Made Deep Learning Red-Hot in IT Industry

Advertisements

Known Non-targets for PLDA-SVM Training/Scoring Construction of Discriminative Kernels from Known and Unknown Non-targets for PLDA-SVM Scoring Results.

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Scalable Synthesis Brandon Lucia and Todd Mytkowicz Microsoft Research.

Advances in WP2 Torino Meeting – 9-10 March

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advances in WP2 Nancy Meeting – 6-7 July

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

Deep Learning for Speech Recognition

Joint Training Of Convolutional And Non-Convolutional Neural Networks

Yajie Miao Florian Metze

R-CNN By Zhang Liliang.

Distributed Representations of Sentences and Documents

SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.

Stanford CS224S Spring 2014 CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2014 Lecture 16: Acoustic Modeling.

Discriminative Feature Optimization for Speech Recognition

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On-line Learning of Sequence Data Based on Self-Organizing.

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Dahl, Yu, Deng, and Acero Accepted in IEEE Trans. ASSP , 2010

Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

Algoritmi e Programmazione Avanzata

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Christopher M. Bishop Object Recognition: A Statistical Learning Perspective Microsoft Research, Cambridge Sicily, 2003.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Abstract Deep neural networks are becoming a fundamental component of high performance speech recognition systems. Performance of deep learning based systems.

Speech Enhancement based on

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Olivier Siohan David Rybach

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

2 Research Department, iFLYTEK Co. LTD.

Deep Neural Networks based Text- Dependent Speaker Verification

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

CRANDEM: Conditional Random Fields for ASR

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

John H.L. Hansen & Taufiq Al Babba Hasan

Sequence Student-Teacher Training of Deep Neural Networks

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Learning Long-Term Temporal Features

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Deep Neural Network Language Models

Listen Attend and Spell – a brief introduction

Presentation transcript:

A Scalable Approach to Using DNN-Derived Features in GMM-HMM Based Acoustic Modeling For LVCSR Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia InterSpeech-2013, Aug. 26, Lyon, France

Research Background Deep learning (especially DNN-HMM) has become new state-of-the-art in speech recognition Good performance improvement (10% - 30% relative WER Reduction) Service deployment by many companies Research problems What are the main contributing factors to DNN-HMM? What are the implications to GMM-HMM? Is GMM-HMM out of date, or even dead?

Parallel Study of DNN-HMM and GMM-HMM Factors contributed to the success of DNN-HMM for LVCSR Long-span input features Discriminative training of tied-states of HMMs Deep hierarchical nonlinear feature mapping

Parallel Study of DNN-HMM and GMM-HMM Factors contributed to the success of DNN-HMM for LVCSR Long-span input features Discriminative training of tied-states of HMMs Deep hierarchical nonlinear feature mapping The first two can also be applied to IVN transform learning in GMM-HMM framework Z.-J. Yan, Q. Huo, J. Xu, and Y. Zhang, “Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR,” Proc. ICASSP-2013

Parallel Study of DNN-HMM and GMM-HMM Factors contributed to the success of DNN-HMM for LVCSR Long-span input features Discriminative training of tied-states of HMMs Deep hierarchical nonlinear feature mapping The first two can also be applied to IVN transform learning in GMM-HMM framework Z.-J. Yan, Q. Huo, J. Xu, and Y. Zhang, “Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR,” Proc. ICASSP-2013 Best GMM-HMM achieves 19.7% WER using spectral features DNN-HMM can easily achieve 16.4% WER with CE training

Parallel Study of DNN-HMM and GMM-HMM Factors contributed to the success of DNN-HMM for LVCSR Long-span input features Discriminative training of tied-states of HMMs Deep hierarchical nonlinear feature mapping The first two can also be applied to IVN transform learning in GMM-HMM framework Z.-J. Yan, Q. Huo, J. Xu, and Y. Zhang, “Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR,” Proc. ICASSP-2013 Best GMM-HMM achieves 19.7% WER using spectral features DNN-HMM can easily achieve 16.4% WER with CE training

Combining the Best of Both Worlds DNN-GMM-HMM DNN as hierarchical nonlinear feature extractor GMM-HMM as acoustic model

Why DNN-GMM-HMM Leverage the power of deep learning Train DNN feature extractor by using a subset of training data Mitigate the scalability issue of DNN training Leverage GMM-HMM technologies Train GMM-HMMs on the full-set of training data Well-established training algorithms, e.g., ML / tied-state based feature- space DT / sequence-based model-space DT Scalable training tools leveraging big data Practical unsupervised adaptation / personalization methods, e.g., CMLLR

Prior Art: TANDEM Features (Deep) TANDEM features H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” Proc. ICASSP-2000 Z. Tuske, M. Sundermeyer, R. Schluter, and H. Ney, “Context-dependent MLPs for LVCSR: Tandem, hybrid or both?” Proc. InterSpeech-2012 Input layer Output layer Hidden layers

Prior Art: Bottleneck Features (Deep) bottleneck features F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic and bottle-neck features for LVCSR of meetings,” Proc. ICASSP-2007 D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained deep neural networks,” Proc. InterSpeech-2011 Input layer Output layer Hidden layers

Proposed: DNN-Derived Features All hidden layers  feature extractor Softmax output layer  log-linear model Input layer Output layer Hidden layers

DNN-Derived Features Advantages More could be done Keep as much discriminative information as possible (different from bottleneck features) Shared DNN topology with full-size DNN-HMM (different from TANDEM features) More could be done Language-independent DNN feature extractor … Combined with GMM-HMM modeling + Discriminative training (e.g., RDLT+MMI, as shown latter) + Adaptation / personalization + Adaptive training

Combined With Best GMM-HMM Techniques DNN-derived features PCA HLDA Tied-state WE-RDLT MMI sequence training CMLLR unsupervised adaptation GMM-HMM modeling of DNN-derived features

Experimental Setup Training data Training combinations Testing data 309hr Switchboard-1 conversational telephone speech 2,000hr Switchboard+Fisher conversational telephone speech Training combinations 309hr DNN + 309hr GMM-HMM 309hr DNN + 2,000hr GMM-HMM 2,000hr DNN + 2,000hr GMM-HMM Testing data NIST 2000 Hub5 testing set

Experimental Results 309hr DNN + 309hr GMM-HMM RDLT – tied-state based region dependent linear transform (refer to our ICASSP-2013 paper) MMI – lattice based sequence training UA – CMLLR unsupervised adaptation

Experimental Results 309hr DNN + 309hr GMM-HMM Deep hierarchical nonlinear feature mapping is the key

Experimental Results 309hr DNN + 309hr GMM-HMM DNN-derived features vs. bottleneck features

Experimental Results 309hr DNN + 2,000hr GMM-HMM

Experimental Results 309hr DNN + 2,000hr GMM-HMM

Experimental Results 309hr DNN + 2,000hr GMM-HMM 0.5% absolute (or 3.6% relative gain), at cost of significantly increased training time of DNN

Conclusion Use a new way of deriving features from DNN DNN-derived features from last hidden layer Combine with best techniques in GMM-HMM Tied-state based RDLT training Sequence based MMI training CMLLR unsupervised adaptation Achieve promising results with DNN-GMM-HMM Scalable training + practical unsupervised adaptation Similar results using CNN have been reported by IBM researchers (refer to their ICASSP-2013 paper)

Thanks! Q&A