Download presentation
Presentation is loading. Please wait.
Published byMolly Sherman Modified over 6 years ago
1
Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe
End-to-end speech recognition system using RNNs and the CTC loss function Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe
2
Presentation overview
Problem and history Model Implementation Results
3
Why Speech Recognition?
Corti’s goal Extract critical information about patient’s condition calling emergency number such as heart attacks. Many other applications: Home-automation, court reporting, live-translation, etc...
4
1950s and 1960s : The dream starts
Bell Laboratories designed in 1952 the "Audrey" system, which recognized digits spoken by a single voice. Ten years later, IBM demonstrated at the 1962 World's Fair its "Shoebox" machine, which could understand 16 words spoken in English.
5
1970s : Dream takes off First speech recognition company, Threshold Technology U.S. Department of Defense’ s DARPA Speech Understanding Research Program( ) 1970s Bell Laboratories' introduction of a system that could interpret multiple people's voices. Carnegie Mellon’s “Harpy” words.
6
Other new improvements
1980s and 1990s : Wilder dreams Statistic Method HMM A few hundred words Several thousand words Potential of an unlimited number of words. Other new improvements Highly natural concatenative speech synthesis systems Machine learning Mixed-initiative dialog systems
7
Improvements over years
Huang X, Baker J, Reddy R. A historical perspective of speech recognition[J]. Communications of the ACM, 2014, 57(1):
8
DEEP Learning The Future Trend?
9
Speech Recognition New Fashion
HMM & FNN & GMM Deep learning(LSTM) 2000 Steady incremental improvements Decreased word error rate by 30% Around 2007, CTC-trained LSTM started to outperform tradition
10
First Trend : Larger Vocabulary
1000+ 100~1000
11
Second Trend : From Isolated to Continuous
Isolated words 1960s Isolated Words; Connected Digits 1970s Connected Words 1980s Continuous Speech 1990s
12
ASR New Fashion (DNN) and the Importance of Data
13
Deep Speech 2 Model I/O Accuracy Measurement Model Structure CTC
14
Input and output End-to-end transcription
Source: Input and output End-to-end transcription INPUT X: raw audio (wav, 16Khz, 16bit), 1D vector OUTPUT Y: sequence of words
15
2. Accuracy Measurement Word Error Rate (WER) Align recognized/reference text (using dynamic string alignment) Transform recognized text → reference text S = number of substitutions D = number of deletions I = number of insertions N = number of words in reference Source:
16
Layers in the Deep Speech 2 model
3. Model Structure Layers in the Deep Speech 2 model Raw audio Spectrogram CNN RNN Fully connected CTC cost function Source:
17
Spectrogram (pre-processing)
3. Model Structure Spectrogram (pre-processing) Apply FFT in some time window (20 ms) Source:
18
Spectrogram (pre-processing)
3. Model Structure Spectrogram (pre-processing) Concatenate windows from adjacent frames to get spectrogram Source:
19
Speech engine Train from labeled pairs (x,y*) Intermediate output: c
3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source:
20
Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y)
21
Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y) Solution: align phonemes with audio manually?
22
Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y) Solution: CTC
23
Connectionist Temporal Classification
4. Connectionist Temporal Classification (CTC) Connectionist Temporal Classification Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006) Term coined in article (Graves, 2006) → Source:
24
Connectionist Temporal Classification
4. Connectionist Temporal Classification (CTC) Connectionist Temporal Classification Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006) Term coined in article (Graves, 2006) → Temporal Classification: Labelling unsegmented data sequences. Connectionist Temporal Classification: Use of RNNs for this purpose. Source:
25
4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Source:
26
4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Note: length(c) == length(x) Source:
27
4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Note: length(c) == length(x) Note: c is a (x by 28) matrix Source:
28
4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Output neurons define distribution over whole character sequences c, assuming independence: Source:
29
4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Output neurons define distribution over whole character sequences c, assuming independence: P of char sequence that is the same length as audio Source:
30
CTC step 2 of 3 2. Define a mapping β(c) → y
4. Connectionist Temporal Classification (CTC) CTC step 2 of 3 2. Define a mapping β(c) → y β(c): Delete duplicates + blanks y = β(c) = β(HHH_E__LL_LO___) = “HELLO” Source:
31
4. Connectionist Temporal Classification (CTC)
CTC step 2 of 3 2. Define a mapping β(c) → y Mapping implies a distribution over all possible transcriptions of y: Probability of specific transcription: Sum up / marginalize over all the different alignments. Source:
32
4. Connectionist Temporal Classification (CTC)
CTC step 2 of 3 2. Define a mapping β(c) → y Mapping implies a distribution over all possible transcriptions of y: Likelihood function: P(y|x,θ) = L(θ|x,y) Probability of specific transcription: Sum up / marginalize over all the different alignments. Source:
33
4. Connectionist Temporal Classification (CTC)
CTC step 3 of 3 3. Update network param. θ to maximize likelihood of correct label y*: Maximize probability of correct transcription, given audio. Source:
34
CTC training Audio spectrogram Neural network
4. Connectionist Temporal Classification (CTC) CTC training Audio spectrogram Neural network Output bank of softmax neurons Compute CTC(c,y*) + gradient Gradient descent Source:
35
CTC training Audio spectrogram Neural network
4. Connectionist Temporal Classification (CTC) CTC training Audio spectrogram Neural network Output bank of softmax neurons Compute CTC(c,y*) + gradient Gradient descent Source:
36
4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Source:
37
4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Simple (approximate) solution: Source:
38
4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Simple (approximate) solution: Optimal solution: find c to maximize Hard problem! Source:
39
DeepSpeech 2 Implementation
Tips and Tricks System Optimizations Training data used by Baidu Their results
40
BatchNorm Why? BatchNorm accelerates the training for DNN
Tips and Tricks BatchNorm Why? To efficiently scale the model as the training set is scaled Increasing the depth of the network leads to optimization issues BatchNorm accelerates the training for DNN Basic formulation :
41
Tips and Tricks BatchNorm The special case of bidirectional RNNs (eg. standard recurrent operation):
42
Tips and Tricks BatchNorm The special case of bidirectional RNNs (eg. standard recurrent operation):
43
Tips and Tricks BatchNorm Instead, batch normalization is applied only on the vertical connections (i.e. from one layer to another) and not on the horizontal connections (i.e. within the recurrent layer) +12% performance difference for the deepest network
44
Tips and Tricks SortaGrad Training on examples of varying length pose some algorithmic challenges Think of how a child learns Longer examples tend to be more challenging 1st epoch : iterate through the training set in increasing order of the length of the longest utterance in the minibatch Other epochs : the training reverts back to a random order over minibatches
45
Language Model 5-gram models Parameters tuned on a development set
Tips and Tricks Language Model 5-gram models Parameters tuned on a development set Additionally, they used beam search to find the optimal transcription
46
GPU implementations Memory allocation issue Leads to parallel SGD
2. System Optimizations GPU implementations Memory allocation issue Most of it is for activations through each layer for use by back propagation 70M parameters: “only” 280 MB of memory for the weights but for the activations with a batch of 64 and seven-second utterances it is 1.5 GB... Leads to parallel SGD Also allocate to CPU memory accessible by the GPU
47
Data English : 11,940 hours of labeled speech
3. Training data used by Baidu Data English : 11,940 hours of labeled speech Mandarin : 9,400 hours of labeled speech Dataset augmentation by adding noise Increases the effective size of the training data Improves the robustness of the model to noisy speech
48
Training 20 epochs 9-layer model (2 CNN, 7RNN) with 68M parameters
4. Their results Training 20 epochs 9-layer model (2 CNN, 7RNN) with 68M parameters maxNorm = 400 Learning rate = 10^(-4) and annealed by 1.2 after each epoch Beam size = 500
49
4. Their results Compared results Data Drives Accuracy
50
Initial research CTC function implementations: Tensorflow Theano
Warp-CTC in C++ by Baidu (outperforms the rest) Deep Speech implementations: Theano/Python (DS1) Torch/Lua (DS2)
51
Datasets AN4: personal information: names, addresses, telephone numbers, birthdates, etc 948 training utterances 130 test utterances ~50 minutes of recordings
52
Datasets LibriSpeech: Dev-clean 5.4h Dev-other 5.3h
Train-clean h Train-clean h Train-other h Total: h of training speech test-clean and test-other: 10.5h in total
53
Initial tests and setup
AWS: g2.x8 instance AN4 dataset, 9 hours of training Moved to p2 instance but already out of credits DTU HPC: Setting up Linux dependencies Sorting out CUDA errors, memory errors, etc… 2 x Nvidia Tesla K80 (48 GB of GPU memory in total)
54
Scaling up LibriSpeech - clean 100h
Big variance of WER depending on batch size batch size = 75 (max): WER = 58% batch size = 40 : WER = 52% batch size = 12 : WER = 42% Tradeoff: faster training vs lower WER Explanation: noisy gradients
55
Scaling up LibriSpeech - clean ~1000h (full training dataset)
Audio files: 60 GB - almost 300k utterances Input data to network: 212 GB Batch size 64 One epoch took ~8h Whole training over a week WER ~12% Baidu got 5.33%
56
Training on small amounts of data
Dev-clean + dev-other = ~11 hours of speech Deep learning does not work without big data. Batch size WER, % 10 90 6 87.5 4 83.1 2 81.2
57
flac I THINK HE WAS PERHAPS MORE APPRECIATIVE THAN I WAS OF THE DISCIPLINE OF THE EDISON CONSTRUCTION DEPARTMENT AND THOUGHT IT WOULD BE WELL FOR US TO WAIT UNTIL THE MORNING OF THE FOURTH BEFORE WE STARTED UP --- i think he was perhaps more appreciative that i was of the discipline of the edison construction department and thought it would be well for us to wait until the morning of the fourth before we started up
58
flac SHE WAS THE MOST AGREEABLE WOMAN IVE EVER KNOWN IN HER POSITION SHE WOULD HAVE BEEN WORTHY OF ANY WHATEVER --- she was the most agreeable woman i have ever known in her position she would have been worth of any whatever
59
flac STEPHANOS DEDALOS --- stephanos der loss
60
Thank you for your attention !
61
Development of the Technology
Filter-bank analysis; Time-normalization; Dynamic Programming; Pattern recognition; LPC analysis; Clustering algorithms; Level building; Hidden Markov models; Stochastic Language modeling; Finite-state machines; Statistical learning; Concatenative synthesis Machine learning Mixed-initiative dialog;
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.