Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Slides:



Advertisements
Similar presentations
Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Introduction to Recurrent neural networks (RNN), Long short-term memory (LSTM) Wenjie Pei In this coffee talk, I would like to present you some basic.
Deep Learning for Speech Recognition
Speech Recognition Deep Learning and Neural Nets Spring 2015.
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
Deep Learning Neural Network with Memory (1)
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Convolutional LSTM Networks for Subcellular Localization of Proteins
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Recurrent Neural Networks Long Short-Term Memory Networks
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Recurrent Neural Networks Long Short-Term Memory Networks
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23.
Lecture 12. Outline of Rule-Based Classification 1. Overview of ANN 2. Basic Feedforward ANN 3. Linear Perceptron Algorithm 4. Nonlinear and Multilayer.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
1 Deep Recurrent Neural Networks for Acoustic Modelling 2015/06/01 Ming-Han Yang William ChanIan Lane.
Authors: F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda, A. Fischer, H. Bunke Source: Pattern Recognition, Volume 47, Issue 4, April.
Olivier Siohan David Rybach
Convolutional Sequence to Sequence Learning
Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Linguistic knowledge for Speech recognition
2 Research Department, iFLYTEK Co. LTD.
Neural Machine Translation by Jointly Learning to Align and Translate
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Intelligent Information System Lab
Deep Neural Networks based Text- Dependent Speaker Verification
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
convolutional neural networkS
Grid Long Short-Term Memory
convolutional neural networkS
Deep learning Introduction Classes of Deep Learning Networks
Understanding LSTM Networks
Neural Speech Synthesis with Transformer Network
LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS
LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
实习生汇报 ——北邮 张安迪.
Advances in Deep Audio and Audio-Visual Processing
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
Recurrent Neural Networks (RNNs)
3. Adversarial Teacher-Student Learning (AT/S)
Automatic Handwriting Generation
Artificial Intelligence 2004 Speech & Natural Language Processing
The experiments based on Recurrent Neural Networks
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
The Application of Hidden Markov Models in Speech Recognition
Listen Attend and Spell – a brief introduction
Huawei CBG AI Challenges
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Presented by Ming-Han Yang ( 楊明翰 )

Outline Speech Processing ◦ Neural Network Trends in Speech Recognition  EXPLORING MULTIDIMENSIONAL LSTM FOR LARGE VOCABULARY ASR  Microsoft Corporation  END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION  Yoshua Bengio, Université de Montréal, Canada  DEEP CONVOLUTIONAL ACOUSTIC WORD EMBEDDINGS USING WORD-PAIR SIDE INFORMATION  Toyota Technological Institute at Chicago, United States  VERY DEEP MULTILINGUAL CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR  IBM, United States; Yann LeCun, New York University, United States  LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION  Google Inc., United States  A DEEP SCATTERING SPECTRUM - DEEP IAMESE NETWORK PIPELINE FOR UNSUPERVISED ACOUSTIC MODELING  Facebook A.I. Research, France

Introduction Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks. A key aspect of these models is the use of time recurrence, combined with a gating architecture that allows them to track the long-term dynamics of speech. Inspired by human spectrogram reading, we recently proposed the frequency LSTM (F-LSTM) that performs 1-D recurrence over the frequency axis and then performs 1-D recurrence over the time axis. In this study, we further improve the acoustic model by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointly scans the input over the time and frequency axes to model spectro-temporal warping, and then uses the output activations as the input to a time LSTM (T-LSTM).

THE LSTM-RNN

TF-LSTM processing

Corpora Description & Experiments Microsoft Windows phone short message dictation task ◦ Training data : 375 hr ◦ Test set : 125k words Features ◦ 87 維 log-filter-bank features ◦ (29 維 *3) 5976 tied-triphone states (senones) DNN settings: ◦ 5 層 *2048 ; splice=5 LSTM settings : ◦ TLSTM: 每 1 層有 1024 神經元, ◦ 每層透過 linear projection layer  512 ◦ BPTT step=20 ; delay=5 frames, etc.

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. LAS consists of two sub-modules: the listener and the speller. ◦ The listener is an acoustic model encoder that performs an operation called Listen. ◦ The speller is an attention-based character decoder that performs an operation we call AttendAndSpell. Introduction

Introduction (cont.)

Listen

Attend and Spell

Attend and Spell (cont.)

Google Voice Search Task ◦ 2000 hours, 3 million utterances ◦ Test set : 16 hours Features ◦ 40-dimensional log-mel filter bank All utterances were padded with the start-of-sentence and the end-of-sentence tokens. Corpora Description & Experiments

Introduction Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We start from the system proposed in [11] for phoneme recognition and make the following contributions: ◦ reduce total training complexity from quadratic to linear ◦ introduce a recurrent architecture that successively reduces the source sequence length by pooling frames neighboring in time. ◦ character-level ARSG + n−gram word-level language model + WFST

Introduction (cont.)

Corpora Description & Experiments