Presentation is loading. Please wait.

Presentation is loading. Please wait.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Similar presentations


Presentation on theme: "Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech."— Presentation transcript:

1 Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Presented by Ming-Han Yang ( 楊明翰 )

2 Outline Speech Processing ◦ Neural Network Trends in Speech Recognition  EXPLORING MULTIDIMENSIONAL LSTM FOR LARGE VOCABULARY ASR  Microsoft Corporation  END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION  Yoshua Bengio, Université de Montréal, Canada  DEEP CONVOLUTIONAL ACOUSTIC WORD EMBEDDINGS USING WORD-PAIR SIDE INFORMATION  Toyota Technological Institute at Chicago, United States  VERY DEEP MULTILINGUAL CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR  IBM, United States; Yann LeCun, New York University, United States  LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION  Google Inc., United States  A DEEP SCATTERING SPECTRUM - DEEP IAMESE NETWORK PIPELINE FOR UNSUPERVISED ACOUSTIC MODELING  Facebook A.I. Research, France

3

4 Introduction Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks. A key aspect of these models is the use of time recurrence, combined with a gating architecture that allows them to track the long-term dynamics of speech. Inspired by human spectrogram reading, we recently proposed the frequency LSTM (F-LSTM) that performs 1-D recurrence over the frequency axis and then performs 1-D recurrence over the time axis. In this study, we further improve the acoustic model by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointly scans the input over the time and frequency axes to model spectro-temporal warping, and then uses the output activations as the input to a time LSTM (T-LSTM).

5 THE LSTM-RNN

6 TF-LSTM processing

7 Corpora Description & Experiments Microsoft Windows phone short message dictation task ◦ Training data : 375 hr ◦ Test set : 125k words Features ◦ 87 維 log-filter-bank features ◦ (29 維 *3) 5976 tied-triphone states (senones) DNN settings: ◦ 5 層 *2048 ; splice=5 LSTM settings : ◦ TLSTM: 每 1 層有 1024 神經元, ◦ 每層透過 linear projection layer  512 ◦ BPTT step=20 ; delay=5 frames, etc.

8

9 We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. LAS consists of two sub-modules: the listener and the speller. ◦ The listener is an acoustic model encoder that performs an operation called Listen. ◦ The speller is an attention-based character decoder that performs an operation we call AttendAndSpell. Introduction

10 Introduction (cont.)

11 Listen

12 Attend and Spell

13 Attend and Spell (cont.)

14 Google Voice Search Task ◦ 2000 hours, 3 million utterances ◦ Test set : 16 hours Features ◦ 40-dimensional log-mel filter bank All utterances were padded with the start-of-sentence and the end-of-sentence tokens. Corpora Description & Experiments

15

16 Introduction Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We start from the system proposed in [11] for phoneme recognition and make the following contributions: ◦ reduce total training complexity from quadratic to linear ◦ introduce a recurrent architecture that successively reduces the source sequence length by pooling frames neighboring in time. ◦ character-level ARSG + n−gram word-level language model + WFST

17 Introduction (cont.)

18

19

20 Corpora Description & Experiments


Download ppt "Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech."

Similar presentations


Ads by Google