Download presentation
1
Multimodal Deep Learning
Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University
2
I'm going to play a video of a person speaking
Watch the video carefully and note what you hear … and then now I want ou to close McGurk - What happened for most of you was that -when you watchd the video you should have perceived the person saying /da/. Conversely, when you only listened to the clip, you probably heard /ba/. This effect is known as the McGurk effect and really shows that speech perception works by a complex integration of video and audios signals in our brain. In particular, the video gave us information about the place of articulation and mouth motions that changed the way we perceived the video.
3
McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
4
Audio-Visual Speech Recognition
In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting. For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – imges of his lips; and the audio – how do we integrate these two sources of data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
5
Feature Challenge Classifier (e.g. SVM)
So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier. While for audio, the speech community have developed many features such as MFCCs which work really well, it is not obvious what features we should use for lips. Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
6
Representing Lips Can we learn better representations for audio/visual speech recognition? How can multimodal data (multiple sources of input) be used to find better features? So what does state of the art features look like? Engineering these features took long time To this, we address two questions in this work – [click] Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
7
Unsupervised Feature Learning
5 1.1 . 10 9 1.67 . 3 Concretely, our task is to convert sequences of lip images into a vector of numbers Similarly, for the audio Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
8
Unsupervised Feature Learning
5 1.1 . 10 9 1.67 . 3 Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalities However, this is a very limited view of multimodal features – instead what we would like to do [click] is to Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
9
Multimodal Features 1 2.1 5 9 . 6.5 Find better ways to relate the audio and visual inputs and get features that arise out of relating them together Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
10
Cross-Modality Feature Learning
5 1.1 . 10 Next I’m going to describe a different feature learning setting Suppose that at test time, only the lip images are avilable, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time (lip-reading not well defined) But there are more settings to consider! If our task is only to do lip reading, visual speech recognition. An interesting question to ask is -- can we improve our lip reading features if we had audio data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
11
Feature Learning Models
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
12
Feature Learning with Autoencoders
Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
13
... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
14
... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
15
Shallow Learning Mostly unimodal features learned Hidden Units
Video Input Audio Input So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets. If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only … So why doesn’t this work? We think that there are two reasons for this. In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram. Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content. It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain) We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this. Review: 1) no incentive and 2) deep Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
16
... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
17
... ... ... ... Bimodal Autoencoder Cross-modality Learning:
Audio Reconstruction ... Video Reconstruction ... ... Hidden Representation ... But, this still has the problem! But, wait now we can do something interesting This model will be trained on clips with both audio and video. Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
18
Cross-modality Deep Autoencoder
... Video Input Learned Representation Audio Reconstruction Video Reconstruction However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. more Since audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
19
Cross-modality Deep Autoencoder
... Audio Input Learned Representation Audio Reconstruction Video Reconstruction But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
20
Bimodal Deep Autoencoders
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) [pause] the second model we present is the bimodal deep autoencoder What we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
21
Bimodal Deep Autoencoders
... Video Input Audio Reconstruction Video Reconstruction “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
22
Bimodal Deep Autoencoders
... Audio Input Audio Reconstruction Video Reconstruction “Phonemes” Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
23
Bimodal Deep Autoencoders
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
24
Training Bimodal Deep Autoencoder
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction ... Audio Input Shared Representation Audio Reconstruction Video Reconstruction ... Video Input Shared Representation Audio Reconstruction Video Reconstruction Train a single model to perform all 3 tasks Similar in spirit to denoising autoencoders (Vincent et al., 2008) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
25
Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
26
Visualizations of Learned Features
0 ms 33 ms 67 ms 100 ms Features correspond to mouth motions and are also paired up with the audio spectrogram The features are generic and are not speaker specific Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
27
Lip-reading with AVLetters
26-way Letter Classification 10 Speakers 60x80 pixels lip regions Cross-modality learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
28
Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
29
Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
30
Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Our Features (Cross Modality Learning) 64.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
31
Lip-reading with CUAVE
10-way Digit Classification 36 Speakers Cross Modality Learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
32
Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
33
Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
34
Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Discrete Cosine Transform (Gurban & Thiran, 2009) 64.0% Visemic AAM (Papandreou et al., 2009) 83.0% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
35
Multimodal Recognition
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction CUAVE: 10-way Digit Classification 36 Speakers Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio performs extremely well alone Feature Learning Supervised Learning Testing Audio + Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
36
Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
37
Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
38
Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% + Audio Features (RBM) 82.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
39
Shared Representation Evaluation
Feature Learning Supervised Learning Testing Audio + Video Audio Video Supervised Testing Audio Shared Representation Video Linear Classifier Training Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
40
Shared Representation Evaluation
Method: Learned Features + Canonical Correlation Analysis Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Supervised Testing Audio Shared Representation Video Linear Classifier Training Explain in phases! Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
41
McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
42
McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% 28.3% 13.0% 58.7% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
43
Conclusion Applied deep autoencoders to discover features in multimodal data Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue Multimodal Feature Learning: Learn representations that relate across audio and video data ... Video Input Learned Representation Audio Reconstruction Video Reconstruction ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
44
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
45
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
46
Bimodal Learning with RBMs
…... ... Audio Input Hidden Units Video Input One simple approach is to concatenate them together. Now, each hidden unit sees both the audio and visual inputs simultaneously. So we tried this and lets see what we get -- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.