Listen Attend and Spell – a brief introduction

Slides:

Advertisements

Similar presentations

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advertisements

Speech Recognition. What makes speech recognition hard?

Text Independent Speaker Recognition with Added Noise Jason Cardillo & Raihan Ali Bashir April 11, 2005.

Introduction to Recurrent neural networks (RNN), Long short-term memory (LSTM) Wenjie Pei In this coffee talk, I would like to present you some basic.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Why is ASR Hard? Natural speech is continuous

Introduction to Automatic Speech Recognition

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

7-Speech Recognition Speech Recognition Concepts

Deep Learning Neural Network with Memory (1)

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Performance Comparison of Speaker and Emotion Recognition

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Course Outline (6 Weeks) for Professor K.H Wong

Olivier Siohan David Rybach

Convolutional Sequence to Sequence Learning

Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.

Convolutional Neural Network

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe

Recurrent Neural Networks for Natural Language Processing

Automatic Speech Recognition Introduction

Deep Learning: Model Summary

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Image Question Answering

Neural Language Model CS246 Junghoo “John” Cho.

Speech Processing Speech Recognition

RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Final Presentation: Neural Network Doc Summarization

LECTURE 15: REESTIMATION, EM AND MIXTURES

Natural Language to SQL(nl2sql)

Advances in Deep Audio and Audio-Visual Processing

Meta Learning (Part 2): Gradient Descent as LSTM

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

RNNs and Sequence to sequence models

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Automatic Handwriting Generation

Question Answering System

Neural Machine Translation

Sequence-to-Sequence Models

Bidirectional LSTM-CRF Models for Sequence Tagging

Week 7 Presentation Ngoc Ta Aidean Sharghi

LHC beam mode classification

The Application of Hidden Markov Models in Speech Recognition

Neural Machine Translation by Jointly Learning to Align and Translate

Huawei CBG AI Challenges

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Listen Attend and Spell – a brief introduction Dr Ning Ma Speech and Hearing Group University of Sheffield

Classical speech recognition architecture there: /ðɛː/ is: /ɪz/ a: /ə/ cat: /kat/ W = “there is a cat" Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

The neural network revolution RNN-based Pronunciation models CNNs Auto-encoders DNN-HMMs LSTM-HMMs Neural language models Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X Y End-to-end: features

End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X features Y End-to-end: Two main approaches Connectionist Temporal Classification (CTC) Sequence-to-sequence models with attention (seq2seq)

Connectionist Temporal Classification (CTC) x1 x2 x3 x4 x5 x6 x7 x8 Softmax over vocabulary and extra blank token _ Bi-directional RNN produces log prob for different token classes at each time frame

Connectionist Temporal Classification (CTC) Allow only transition from a symbol to itself or to _ cc_aa_t_ maps to cat ccc__a_t_ maps to cat cccc_aaa_ttt_ maps to cat c c _ a a _ t _ Dynamic programming allows efficient calculation of log prob p(Y|X) and its gradient, which can be propagated for learning RNN parameters

Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring

Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring CTC makes label predictions for each frame just based on audio data: p(Y|X) Assumes label predictions are conditionally independent of each other

Sequence-to-sequence models (seq2seq) Decoder / Transducer y1…t yt+1 p(yt+1|y1…t, x) transcript f(X) x1 x2 x3 x4 x5 x6 x7 x8

Attention models

Attention example Prediction derived from “attending” to segment of input Attention vector – where the model thinks the relevant information is to be found

Attention example

Attention example

Attention example

Attention example

Listen Attend and Spell (LAS) Transcripts Decoder / Transducer y1…t yt+1 Decoder (RNN) named the speller transcript f(X) high-level features x1 x2 x3 x4 x5 x6 x7 x8 Encoder (RNN) named the listener Low-level signals

Listen Attend and Spell (LAS) s: state vector from the decoder softmax{ f([ht, s]) }  Attention vector h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

Listen Attend and Spell (LAS) s: state vector from the decoder h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

Limitations of LAS (seq2seq) Not an online model – all input must be received before producing transcripts Attention is a computational bottleneck Length of input has a large impact on accuracy