NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Lecture 3b: CNN: Advanced Layers
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
1 Deep Recurrent Neural Networks for Acoustic Modelling 2015/06/01 Ming-Han Yang William ChanIan Lane.
Olivier Siohan David Rybach
Big data classification using neural network
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
What is a Hidden Markov Model?
Mr. Darko Pekar, Speech Morphing Inc.
an introduction to: Deep Learning
An Artificial Intelligence Approach to Precision Oncology
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
ECE 6504 Deep Learning for Perception
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
A “Holy Grail” of Machine Learing
Bird-species Recognition Using Convolutional Neural Network
Hidden Markov Models Part 2: Algorithms
Recurrent Neural Networks
CRANDEM: Conditional Random Fields for ASR
Automatic Speech Recognition: Conditional Random Fields for ASR
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Creating Data Representations
Generalization in deep learning
Neural Networks Geoff Hulten.
Use 3D Convolutional Neural Network to Inspect Solder Ball Defects
Ensemble learning.
Deep Neural Networks: A Hands on Challenge Deep Neural Networks: A Hands on Challenge Deep Neural Networks: A Hands on Challenge Deep Neural Networks:
Analysis of Trained CNN (Receptive Field & Weights of Network)
John H.L. Hansen & Taufiq Al Babba Hasan
Sequence Student-Teacher Training of Deep Neural Networks
Inception-v4, Inception-ResNet and the Impact of
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Introduction to Neural Networks
Word representations David Kauchak CS158 – Fall 2016.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Learning and Memorization
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31 Ming-Han Yang Navdeep Jaitly, Vincent Vanhoucke, Geoffrey Hinton

NTNU Speech and Machine Intelligence Laboratory Outline Abstract Introduction Methods Experiments and Discussion Conclusions 2

NTNU Speech and Machine Intelligence Laboratory Abstract We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural Network-Hidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame - from the different contexts it is associated with – we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. 3

NTNU Speech and Machine Intelligence Laboratory Introduction (1/3) The use of forced alignments from GMM-HMM for training NNs suffers from several drawbacks. (e.g. GMM-HMM quality, GMM assumption) In figure 1 we present results of an experiment that shows that forced alignments may not provide the best data to train neural networks with. 4  We generated forced alignments from a tri-state monophone GMMHMM system trained on TIMIT.  For each segment corresponding to a phoneme we re-segmented the internal state boundaries by distributing them equally within the three internal states.  Thus each segment between the start frame and the end frame assigned to a phoneme was split into three sub-segments, and these were assigned the start state, the middle state and the end state of the tri-phone HMM.  The effect of this is to generate an alignment that is smoothed out.

NTNU Speech and Machine Intelligence Laboratory Introduction (2/3) 5

NTNU Speech and Machine Intelligence Laboratory Introduction (3/3) In this paper we present a method that attempts to incorporate these insights into neural network training from forced alignments.  Training Time We train a neural network to predict the phone states of all the frames within a context window of a central frame using the acoustic data around the same central frame with the same (or larger) context window as input  Test Time We take a geometric average (product model) of the predictions for each frame from all the acoustic contexts that model the state of that frame in their output layer. 6

NTNU Speech and Machine Intelligence Laboratory Methods (1/5) 7

NTNU Speech and Machine Intelligence Laboratory Methods (2/5) 8

NTNU Speech and Machine Intelligence Laboratory Methods (3/5) 9

NTNU Speech and Machine Intelligence Laboratory Methods (4/5) 10

NTNU Speech and Machine Intelligence Laboratory Geometric Averaging V.S. Arithmetic Averaging 11Page NTNU Speech and Machine Intelligence Laboratory

Methods (5/5) 12

NTNU Speech and Machine Intelligence Laboratory Experiments and Discussion (1) 13

NTNU Speech and Machine Intelligence Laboratory Experiments and Discussion (2) - TIMIT 14

NTNU Speech and Machine Intelligence Laboratory Experiments and Discussion (3) - TIMIT 15

NTNU Speech and Machine Intelligence Laboratory Experiments and Discussion (4) - TIMIT  Impact of Depth of DNNs Note that we used fully connected deep neural network (DNN) models for this and achieved accuracy significantly better than those reported for simple CNN-DNN- HMM systems – and comparable to carefully crafted CNN-DNN-HMM model with heterogeneous pooling that was trained with dropout. 16 It is our expectation that the gains are complementary, – and similar gains would be produced when these ideas are applied to convolutional and other discriminative models.

NTNU Speech and Machine Intelligence Laboratory Experiments and Discussion (5) – WSJ  Geometric Averaging Compared to Arithmetic Averaging A possible explanation for why geometric averaging outperforms arithmetic averaging is that geometric averaging acts like constraints – solutions that violate any one of the predictions sharply are discouraged under this model. Arithmetic averaging, on the other hand leads, accepts solutions as long as one of the models is quite happy with the solution – thus it is susceptible to bad decision boundaries of models that have been overfit significantly 17

NTNU Speech and Machine Intelligence Laboratory Conclusions We have shown that using an autoregressive product of a DNN-HMM system trained to predict the PHONE LABELS of multiple frames can improve speech recognition accuracy. The autogressive model bears a resemblance to RNN’s because it attempts to predict states over a range of frames. – These connections need to be further explored. Model combination approaches frequently benefit by using weighted combinations – In the future we will explore these avenues further. Lastly, it is interesting to note that geometric averaging outperforms arithmetic averaging here; – it will be interesting to see if this observation can be applied to training ensembles of models for speech recognition in new ways. 18