Download presentation
Presentation is loading. Please wait.
Published byRaffaella Rizzi Modified over 5 years ago
1
Meta Learning (Part 2): Gradient Descent as LSTM
Hung-yi Lee \theta \phi ๆ่ฉฒ่ฆ็จไธๅ็้ก่ฒ Agnosticไธๅฏ็ฅ่ซ็ Very good tutorial !!!!! Good summarization:
2
Can we learn more than initialization parameters?
๐ 0 ๐ 1 โ๐ Init Compute Gradient Update Training Data ๐ 2 ๐ Network Structure Learning Algorithm (Function ๐น) ๐ The learning algorithm looks like RNN.
3
Recurrent Neural Network
h and hโ are vectors with the same dimension Given function f: โ โฒ ,๐ฆ=๐ โ,๐ฅ y1 y2 y3 h0 f h1 f h2 f h3 โฆโฆ x1 x2 x3 No matter how long the input/output sequence is, we only need one function f
4
LSTM yt LSTM ct yt ct-1 Naรฏve RNN ht ht ht-1 ht-1 xt xt
c change slowly ct is ct-1 added by something h change faster ht and ht-1 can be very different
5
Review: LSTM =๐ก๐๐โ =๐ =๐ =๐ xt ht-1 z W xt ht-1 ct-1 zi Wi input xt
zf Wf =๐ Identiy C is vector forget zf zi z zo xt ht-1 zo Wo =๐ ht-1 xt output
6
Review: LSTM ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง โ ๐ก = ๐ง ๐ โจ๐ก๐๐โ ๐ ๐ก
yt ct-1 ct โจ ๏ผ โจ tanh ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง โจ โ ๐ก = ๐ง ๐ โจ๐ก๐๐โ ๐ ๐ก Identiy C is vector zf zi z zo ๐ฆ ๐ก =๐ ๐ โฒ โ ๐ก ht-1 xt ht
7
Review: LSTM yt yt+1 ct-1 ct ct+1 โจ โจ โจ โจ tanh tanh โจ โจ zf zi z zo zf
๏ผ โจ โจ ๏ผ โจ tanh tanh โจ โจ Identiy C is vector zf zi z zo zf zi z zo ht ht-1 xt ht xt+1
8
Similar to gradient descent based algorithm
๐ ๐ก = ๐ ๐กโ1 โ๐ โ ๐ ๐ yt 1 1 โฎ ๐ ๐ โฎ ๐ ๐กโ1 ๐ ๐ก ct-1 ct โจ ๏ผ โจ tanh ๐ ๐ก ๐ ๐กโ1 โ โ ๐ ๐ ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง โจ โ ๐ก = ๐ง ๐ โจ๐ก๐๐โ ๐ ๐ก all โ1โ all ๐ Identiy C is vector zf zi โ โ ๐ ๐ z zo ๐ฆ ๐ก =๐ ๐ โฒ โ ๐ก โ โ ๐ ๐ ht-1 xt ht
9
Something like regularization
Similar to gradient descent based algorithm ๐ ๐ก = ๐ ๐กโ1 โ๐ โ ๐ ๐ Something like regularization ๐ ๐กโ1 ๐ ๐ก ct-1 ct โจ ๏ผ ๐ ๐ก ๐ ๐กโ1 โ โ ๐ ๐ ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง โจ all โ1โ all ๐ Dynamic learning rate Identiy C is vector zf zi z โ โ ๐ ๐ How about machine learn to determine ๐ง ๐ and ๐ง ๐ from โ โ ๐ ๐ and other information? Other ht-1 โ โ ๐ ๐ xt
10
LSTM for Gradient Descent
Typical LSTM LSTM for Gradient Descent ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง ๐ ๐ก ๐ ๐กโ1 โ โ ๐ ๐ Learn to minimize 3 training steps Testing Data ๐ ๐ 3 Learnable โLSTMโ โLSTMโ โLSTMโ ๐ 0 ๐ 1 ๐ 2 ๐ 3 โ โ ๐ ๐ โ โ ๐ ๐ โ โ ๐ ๐ Independence assumption Batch from train Batch from train Batch from train
11
Real Implementation ๐ 1 0 ๐ 1 1 ๐ 1 2 ๐ 1 3 โ โ ๐ 1 ๐ โ โ ๐ 1 ๐
The LSTM used only has one cell. Share across all parameters โLSTMโ โLSTMโ โLSTMโ ๐ 1 0 ๐ 1 1 ๐ 1 2 ๐ 1 3 โ โ ๐ 1 ๐ โ โ ๐ 1 ๐ โ โ ๐ 1 ๐ โLSTMโ โLSTMโ โLSTMโ ๐ 2 0 ๐ 2 1 ๐ 2 2 ๐ 2 3 โ โ ๐ 2 ๐ โ โ ๐ 2 ๐ โ โ ๐ 2 ๐ Reasonable model size In typical gradient descent, all the parameters use the same update rule Training and testing model architectures can be different.
12
Experimental Results ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง ๐ ๐ก โ โ ๐ ๐ ๐ ๐กโ1
Experimental Results ๐ ๐ก = ๐ง ๐ โจ ๐ ๐กโ1 + ๐ง ๐ โจ๐ง ๐ ๐ก ๐ ๐กโ1 โ โ ๐ ๐
13
Parameter update depends on not only current gradient, but previous gradients.
14
LSTM for Gradient Descent (v2)
3 training steps m can store previous gradients Testing Data ๐ ๐ 3 โLSTMโ โLSTMโ โLSTMโ ๐ 0 ๐ 1 ๐ 2 ๐ 3 โLSTMโ โLSTMโ โLSTMโ ๐ 0 ๐ 1 ๐ 2 โ โ ๐ ๐ โ โ ๐ ๐ โ โ ๐ ๐ Batch from train Batch from train Batch from train
15
https://arxiv.org/abs/1606.04474
Experimental Results
16
Meta Learning (Part 3) Hung-yi Lee Any progress in architecture search
KeLiโs blog is good: \theta \phi ๆ่ฉฒ่ฆ็จไธๅ็้ก่ฒ Agnosticไธๅฏ็ฅ่ซ็ Very good tutorial !!!!! Good summarization:
17
Even more crazy idea โฆ Learning + Prediction cat Input:
Training data and their labels Testing data Output: Predicted label of testing data Learning + Prediction cat dog Testing Data Training Data & Labels
18
Face Verification Training In each task: Testing
Few-shot Learning Testing Unlock your phone by Face Training Registration (Collecting Training data)
19
Meta Learning Same approach for Speaker Verification Yes Network Train
Test Yes No Training Tasks Train Test No Network Train Test No ็ฌฌๅๅ็ๅญ_็ฆๅธ็พ ? Network Yes or No Testing Tasks Train Test
20
Siamese Network CNN share CNN Network No embedding Train score Same
Similarity ไธญๆ่ฏไฝๆน็ฝใSiameseไนๅฐฑๆฏโๆน็ฝโไบบๆโๆณฐๅฝโไบบใSiameseๅจ่ฑ่ฏญไธญๆฏโๅญช็โใโ่ฟไฝโ็ๆๆ๏ผ่ฟๆฏไธบไปไนๅข CNN Large score Yes embedding Small score No Test
21
Siamese Network CNN share CNN Network No embedding Train score
Different Similarity ไธญๆ่ฏไฝๆน็ฝใSiameseไนๅฐฑๆฏโๆน็ฝโไบบๆโๆณฐๅฝโไบบใSiameseๅจ่ฑ่ฏญไธญๆฏโๅญช็โใโ่ฟไฝโ็ๆๆ๏ผ่ฟๆฏไธบไปไนๅข๏ผ CNN Large score Yes embedding Small score No Test
22
Siamese Network - Intuitive Explanation
Binary classification problem: โAre they the same?โ same Same or Different Training Set different Network different Face 1 Face 2 Testing Set ?
23
Siamese Network - Intuitive Explanation
Learning embedding for faces Far away CNN e.g. learn to ignore the background As close as possible CNN ๅ
ถๅฏฆๆ่ฉฒ่ฆ่ฌไธไธๆ้บผ็ฎ distance CNN
24
To learn more โฆ What kind of distance should we use?
SphereFace: Deep Hypersphere Embedding for Face Recognition Additive Margin Softmax for Face Verification ArcFace: Additive Angular Margin Loss for Deep Face Recognition Triplet loss Deep Metric Learning using Triplet Network FaceNet: A Unified Embedding for Face Recognition and Clustering
25
N-way Few/One-shot Learning
Example: 5-ways 1-shot ไธ็ Network (Learning + Prediction) Oriol Vinyals,ย Charles Blundell,ย Timothy Lillicrap,ย Koray Kavukcuoglu,ย Daan Wierstra, โMatching Networks for One Shot Learningโ, NIPS, 2016 ไธ่ฑ ไบไน ไธ็ ๅ่ ไบๆ Testing Data Training Data (Each image represents a class)
26
Prototypical Network Testing Data ๐ 1 ๐ 2 ๐ 3 ๐ 4 ๐ 5 CNN CNN CNN CNN
Oriol Vinyals,ย Charles Blundell,ย Timothy Lillicrap,ย Koray Kavukcuoglu,ย Daan Wierstra, โMatching Networks for One Shot Learningโ, NIPS, 2016 = similarity
27
Matching Network Considering the relationship among the training examples ๐ 1 ๐ 2 ๐ 3 ๐ 4 ๐ 5 CNN Bidirectional LSTM Prototypical Networks for Few-shot Learning Part of:ย Advances in Neural Information Processing Systems 30 (NIPS 2017) = similarity
28
Relation Network ??? https://arxiv.org/abs/1711.06025 CVPR 2018
concatenation of feature maps in depth
29
Few-shot learning for imaginary data
30
Few-shot learning for imaginary data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.