Download presentation
1
Deep Learning for Speech Recognition
Hung-yi Lee Conventional How good is DNN The reason
2
Outline Conventional Speech Recognition
How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing
3
Conventional Speech Recognition
How good is DNN The reason
4
Machine Learning helps
Speech Recognition 天氣真好 𝑊 X This is a structured learning problem. Evaluation function: F 𝑋,𝑊 =𝑃 𝑊|𝑋 Inference: 𝑊 =𝑎𝑟𝑔 max 𝑊 𝐹 𝑋,𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋|𝑊 : Acoustic Model, 𝑃 𝑊 : Language Model
5
Input Representation Audio is represented by a vector sequence X: X:
…… Mel-scale Frequency Cepstral Coefficients x1 x2 x3 …… 39 dim MFCC
6
Input Representation - Splice
To consider some temporal information …… …… MFCC 接合;疊接 …… …… Splice
7
Phoneme Phoneme: basic unit
Each word corresponds to a sequence of phonemes. Lexicon what do you think Different words can correspond to the same phonemes Lexicon hh w aa t d uw y uw th ih ng k
8
State Each phoneme correspond to a sequence of states
what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: …… t-d+uw d-uw+y uw-y+uw y-uw+th …… t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3 State:
9
Gaussian Mixture Model (GMM)
State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x|”t-d+uw1”) d-uw+y3 P(x|”d-uw+y3”)
10
State Each state has a stationary distribution for acoustic features
Tied-state pointer pointer P(x|”t-d+uw1”) P(x|”d-uw+y3”) Same Address
11
Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W:
what do you think? W: a b c d e …… S: Assume we also know the alignment 𝑠 1 ⋯ 𝑠 𝑇 . …… s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆,ℎ = 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 transition emission
12
Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W:
what do you think? W: a b c d e …… S: Actually, we don’t know the alignment. s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 (Viterbi algorithm)
13
How to use Deep Learning?
Conventional How good is DNN The reason
14
People imagine …… …… Acoustic features DNN This can not be true!
DNN can only take fixed length vectors as input. “Hello”
15
What DNN can do is …… DNN input: One acoustic feature DNN output:
P(a|xi) P(b|xi) P(c|xi) …… DNN input: One acoustic feature …… DNN output: Size of output layer = No. of states DNN Probability of each state senones (tied triphone states) RNN may be preferred, but it still can not address the problem Dynamics should be considered Three ways to doing that …… …… xi
16
Low rank approximation
Output layer W: M X N M N is the size of the last hidden layer W N M is the size of output layer Number of states M can be large if the outputs are the states of tri-phone. Input layer
17
Low rank approximation
U V K M ≈ M K < M,N Less parameters Output layer Output layer M M U W linear K V N N
18
How we use deep learning
There are three ways to use DNN for acoustic modeling Way 1. Tandem Way 2. DNN-HMM hybrid Way 3. End-to-end Efforts for exploiting deep learning
19
How to use Deep Learning?
Way 1: Tandem Conventional How good is DNN The reason
20
Last hidden layer or bottleneck layer are also possible.
Way 1: Tandem system P(a|xi) P(b|xi) P(c|xi) …… …… new feature 𝑥 𝑖 ′ Size of output layer = No. of states Input of your original speech recognition system DNN 一前一後地 …… …… xi Last hidden layer or bottleneck layer are also possible.
21
How to use Deep Learning?
Way 2: DNN-HMM hybrid Conventional How good is DNN The reason
22
Way 2: DNN-HMM Hybrid 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊
From DNN …… 𝑃 𝑥 𝑡 | 𝑠 𝑡 = 𝑃 𝑥 𝑡 , 𝑠 𝑡 𝑃 𝑠 𝑡 𝑃 𝑠 𝑡 | 𝑥 𝑡 DNN = 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑥 𝑡 𝑃 𝑠 𝑡 Count from training data 𝑥 𝑡
23
Way 2: DNN-HMM Hybrid 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 DNN 𝑃 𝑠 𝑡 | 𝑥 𝑡 …… 𝑥 𝑡 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑠 𝑡 𝑃 𝑋|𝑊;𝜃 𝜃 From original HMM Count from training data This assembled vehicle works …….
24
Way 2: DNN-HMM Hybrid Sequential Training 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊;𝜃 𝑃 𝑊
Given training data 𝑋 1 , 𝑊 1 , 𝑋 2 , 𝑊 2 ,⋯ 𝑋 𝑟 , 𝑊 𝑟 ,⋯ Find-tune the DNN parameters 𝜃 such that 𝑃 𝑋 𝑟 | 𝑊 𝑟 ;𝜃 𝑃 𝑊 𝑟 increase 𝑃 𝑋 𝑟 |𝑊;𝜃 𝑃 𝑊 decrease (𝑊 is any word sequence different from 𝑊 𝑟 )
25
How to use Deep Learning?
Way 3: End-to-end Conventional How good is DNN The reason
26
Way 3: End-to-end - Character
Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon CTC!!!!!!!!!!!!!!!! Baidu Research – Silicon Valley AI Lab We do not need a phoneme dictionary, nor even the concept of a “phoneme.” CTC: Connectionist Temporal Classification Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: ´ Labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. ACM, [14] A. Graves and N. J The combination of bidirectional LSTM and CTC has been applied to characterlevel speech recognition before (Eyben et al., 2009), however the relatively shallow architecture used in that work did not deliver compelling results (the best character error rate was almost 20%). The basic system is enhanced b (No OOV problem) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv: v2, 2014.
27
Way 3: End-to-end - Character
HIS FRIEND’S ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14)
28
Way 3: End-to-end – Word? DNN … … apple bog cat …… Output layer
~50k words in the lexicon DNN Size = 200 times of acoustic features Input layer padding with zero! How about our homework? … … Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech
29
Why Deep Learning? Manily based on DNN-HMM nybid
30
Deeper is better? Word error rate (WER) multiple layers 1 hidden layer
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB
31
Deeper is better? Word error rate (WER) multiple layers 1 hidden layer
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB For a fixed number of parameters, a deep model is clearly better than the shallow one.
32
Input Acoustic Feature (MFCC)
What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 1-st Hidden Layer
33
Input Acoustic Feature (MFCC)
What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 8-th Hidden Layer
34
What does DNN do? In ordinary acoustic models, all the states are modeled independently Not effective way to model human voice high low front back The sound of vowel is only controlled by a few factors.
35
What does DNN do? /i/ /u/ /e/ /o/ /a/ front back high
low front back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech Output of hidden layer reduce to two dimensions /a/ /i/ /e/ /o/ /u/ The lower layers detect the manner of articulation This is BN acturally a rounded vowel, the lips form a circular opening All the states share the results from the same set of detectors. Use parameters effectively
36
Speaker Adaptation
37
Speaker Adaptation Speaker adaptation: use different models to recognition the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker independent DNN model models. In fact, it has been shown that much less improvement was observed with fDLR when the network goes from shallow to deep [30]. It Supervised: with labels Unsupervised: without labels
38
Categories of Methods Conservative training Transformation methods
Re-train the whole DNN with some constraints Transformation methods Only train the parameter of one layer Speaker-aware Training Do not really change the DNN parameters Need less training data
39
Conservative Training
Output layer output close Output layer parameter close initialization Input layer Input layer Just train cannot work Extra constraint To evaluate how the size of the adaptation set affects the result, the number of adaptation utterances varies from 5 (32 s) to 200 (22min). Audio data of Many speakers A little data from target speaker
40
Transformation methods
Add an extra layer Output layer Output layer Fix all the other parameters layer i+1 W layer i+1 extra layer Where shold I add -> I don’t know W Wa layer i layer i A little data from target speaker Input layer Input layer
41
Transformation methods
Add the extra layer between the input and first layer With splicing …… …… layer 1 layer 1 extra layers extra layer Wa Wa Wa Wa Larger Wa More data Smaller Wa less data
42
Transformation methods
SVD bottleneck adaptation Output layer M U Wa linear K K is usually small V N
43
Speaker-aware Training
Speaker information Can also be noise-aware, devise aware training, …… Data of Speaker 1 Fixed length low dimension vectors Text transcription is not needed for extracting the vectors. Lots of mismatched data Data of Speaker 2 I-vector is original used in speaker identification Data of Speaker 3
44
Speaker-aware Training
Training data: train Speaker 1 Output layer Speaker 2 Acoustic features are appended with speaker information features I-vector is original used in speaker identification Testing data: test Acoustic feature Speaker Information All the speaker use the same DNN model Different speaker augmented by different features
45
Multi-task Learning Outline
46
Multitask Learning The multi-layer structure makes DNN suitable for multitask learning Task A Task B Task A Task B Input feature Input feature for task A Input feature for task B
47
Multitask Learning - Multilingual
states of French states of German states of Spanish states of Italian states of Mandarin Should I put Tanja’s AF here? Human languages share some common characteristics. acoustic features
48
Multitask Learning - Multilingual
Mandarin only Character Error Rate With European Language Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013
49
Multitask Learning - Different units
B A= state B = phoneme A= state B = gender A= state B = grapheme Dongpeng Chen, Mak, B., Cheung-Chi Leung, Sivadas, S., "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition,“ ICASSP 2014 (character) acoustic features
50
Deep Learning for Acoustic Modeling
New acoustic features Outline
51
MFCC … …… spectrogram DFT Waveform DCT log Input of DNN filter bank
So strong, usually cascade DCT log Input of DNN … filter bank MFCC
52
Filter-bank Output … …… spectrogram DFT Waveform log Input of DNN
Kind of standard now
53
Spectrogram spectrogram DFT Waveform Input of DNN common today
5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013
54
Spectrogram
55
Waveform? If success, no Signal & Systems Input of DNN Waveform
People tried, but not better than spectrogram yet Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014 Still need to take Signal & Systems ……
56
Waveform?
57
Convolutional Neural Network (CNN)
Outline
58
CNN Speech can be treated as images Spectrogram Frequency Time
Spectrogram reading? Spectrogram painting? Taken an image spectral waterfalls, voiceprints, or voicegram Time
59
Probabilities of states
CNN CNN Probabilities of states CNN Replace DNN by CNN Why it works? Image
60
CNN Maxout Max a1 b1 Max Max a2 b2 Max pooling
61
CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014. This is the state-of-the-art results Network in network MTA-SZTE Research Group on Artificial Intelligence Hungarian Academy of Sciences and University of Szeged, Hungary
62
Applications in Acoustic Signal Processing
Deep Learning for Regression
63
DNN for Speech Enhancement
Clean Speech DNN for mobile commination or speech recognition Noisy Speech Demo for speech enhancement:
64
DNN for Voice Conversion
Female DNN Male Demo for Voice Conversion:
65
Concluding Remarks And probably future direction???
66
Concluding Remarks Conventional Speech Recognition
How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing
67
Thank you for your attention!
Conventional How good is DNN The reason
68
More Researches related to Speech
lecture recordings Find the lectures related to “deep learning” Summary Spoken Content Retrieval Speech Summarization Speech Recognition 你好 core techniques Computer Assisted Language Learning I would like to leave Taipei on November 2nd Hi Hello Information Extraction Dialogue Information extraction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.