Listen Attend and Spell – a brief introduction Dr Ning Ma Speech and Hearing Group University of Sheffield
Classical speech recognition architecture there: /ðɛː/ is: /ɪz/ a: /ə/ cat: /kat/ W = “there is a cat" Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models
The neural network revolution RNN-based Pronunciation models CNNs Auto-encoders DNN-HMMs LSTM-HMMs Neural language models Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models
End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X Y End-to-end: features
End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X features Y End-to-end: Two main approaches Connectionist Temporal Classification (CTC) Sequence-to-sequence models with attention (seq2seq)
Connectionist Temporal Classification (CTC) x1 x2 x3 x4 x5 x6 x7 x8 Softmax over vocabulary and extra blank token _ Bi-directional RNN produces log prob for different token classes at each time frame
Connectionist Temporal Classification (CTC) Allow only transition from a symbol to itself or to _ cc_aa_t_ maps to cat ccc__a_t_ maps to cat cccc_aaa_ttt_ maps to cat c c _ a a _ t _ Dynamic programming allows efficient calculation of log prob p(Y|X) and its gradient, which can be propagated for learning RNN parameters
Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk A Kat sat on the desk Kat said hello Cat said hello A language model is required for rescoring
Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk A Kat sat on the desk Kat said hello Cat said hello A language model is required for rescoring CTC makes label predictions for each frame just based on audio data: p(Y|X) Assumes label predictions are conditionally independent of each other
Sequence-to-sequence models (seq2seq) Decoder / Transducer y1…t yt+1 p(yt+1|y1…t, x) transcript f(X) x1 x2 x3 x4 x5 x6 x7 x8
Attention models
Attention example Prediction derived from “attending” to segment of input Attention vector – where the model thinks the relevant information is to be found
Attention example
Attention example
Attention example
Attention example
Listen Attend and Spell (LAS) Transcripts Decoder / Transducer y1…t yt+1 Decoder (RNN) named the speller transcript f(X) high-level features x1 x2 x3 x4 x5 x6 x7 x8 Encoder (RNN) named the listener Low-level signals
Listen Attend and Spell (LAS) s: state vector from the decoder softmax{ f([ht, s]) } Attention vector h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution
Listen Attend and Spell (LAS) s: state vector from the decoder h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution
Limitations of LAS (seq2seq) Not an online model – all input must be received before producing transcripts Attention is a computational bottleneck Length of input has a large impact on accuracy