Presentation is loading. Please wait.

Presentation is loading. Please wait.

Meta Learning (Part 2): Gradient Descent as LSTM

Similar presentations


Presentation on theme: "Meta Learning (Part 2): Gradient Descent as LSTM"โ€” Presentation transcript:

1 Meta Learning (Part 2): Gradient Descent as LSTM
Hung-yi Lee \theta \phi ๆ‡‰่ฉฒ่ฆ็”จไธๅŒ็š„้ก่‰ฒ Agnosticไธๅฏ็Ÿฅ่ซ–็š„ Very good tutorial !!!!! Good summarization:

2 Can we learn more than initialization parameters?
๐œƒ 0 ๐œƒ 1 โˆ‡๐‘™ Init Compute Gradient Update Training Data ๐œƒ 2 ๐œƒ Network Structure Learning Algorithm (Function ๐น) ๐œ™ The learning algorithm looks like RNN.

3 Recurrent Neural Network
h and hโ€™ are vectors with the same dimension Given function f: โ„Ž โ€ฒ ,๐‘ฆ=๐‘“ โ„Ž,๐‘ฅ y1 y2 y3 h0 f h1 f h2 f h3 โ€ฆโ€ฆ x1 x2 x3 No matter how long the input/output sequence is, we only need one function f

4 LSTM yt LSTM ct yt ct-1 Naรฏve RNN ht ht ht-1 ht-1 xt xt
c change slowly ct is ct-1 added by something h change faster ht and ht-1 can be very different

5 Review: LSTM =๐‘ก๐‘Ž๐‘›โ„Ž =๐œŽ =๐œŽ =๐œŽ xt ht-1 z W xt ht-1 ct-1 zi Wi input xt
zf Wf =๐œŽ Identiy C is vector forget zf zi z zo xt ht-1 zo Wo =๐œŽ ht-1 xt output

6 Review: LSTM ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง โ„Ž ๐‘ก = ๐‘ง ๐‘œ โจ€๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ ๐‘ก
yt ct-1 ct โจ€ ๏ผ‹ โจ€ tanh ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง โจ€ โ„Ž ๐‘ก = ๐‘ง ๐‘œ โจ€๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ ๐‘ก Identiy C is vector zf zi z zo ๐‘ฆ ๐‘ก =๐œŽ ๐‘Š โ€ฒ โ„Ž ๐‘ก ht-1 xt ht

7 Review: LSTM yt yt+1 ct-1 ct ct+1 โจ€ โจ€ โจ€ โจ€ tanh tanh โจ€ โจ€ zf zi z zo zf
๏ผ‹ โจ€ โจ€ ๏ผ‹ โจ€ tanh tanh โจ€ โจ€ Identiy C is vector zf zi z zo zf zi z zo ht ht-1 xt ht xt+1

8 Similar to gradient descent based algorithm
๐œƒ ๐‘ก = ๐œƒ ๐‘กโˆ’1 โˆ’๐œ‚ โˆ‡ ๐œƒ ๐‘™ yt 1 1 โ‹ฎ ๐œ‚ ๐œ‚ โ‹ฎ ๐œƒ ๐‘กโˆ’1 ๐œƒ ๐‘ก ct-1 ct โจ€ ๏ผ‹ โจ€ tanh ๐œƒ ๐‘ก ๐œƒ ๐‘กโˆ’1 โˆ’ โˆ‡ ๐œƒ ๐‘™ ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง โจ€ โ„Ž ๐‘ก = ๐‘ง ๐‘œ โจ€๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ ๐‘ก all โ€œ1โ€ all ๐œ‚ Identiy C is vector zf zi โˆ’ โˆ‡ ๐œƒ ๐‘™ z zo ๐‘ฆ ๐‘ก =๐œŽ ๐‘Š โ€ฒ โ„Ž ๐‘ก โˆ’ โˆ‡ ๐œƒ ๐‘™ ht-1 xt ht

9 Something like regularization
Similar to gradient descent based algorithm ๐œƒ ๐‘ก = ๐œƒ ๐‘กโˆ’1 โˆ’๐œ‚ โˆ‡ ๐œƒ ๐‘™ Something like regularization ๐œƒ ๐‘กโˆ’1 ๐œƒ ๐‘ก ct-1 ct โจ€ ๏ผ‹ ๐œƒ ๐‘ก ๐œƒ ๐‘กโˆ’1 โˆ’ โˆ‡ ๐œƒ ๐‘™ ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง โจ€ all โ€œ1โ€ all ๐œ‚ Dynamic learning rate Identiy C is vector zf zi z โˆ’ โˆ‡ ๐œƒ ๐‘™ How about machine learn to determine ๐‘ง ๐‘“ and ๐‘ง ๐‘– from โˆ’ โˆ‡ ๐œƒ ๐‘™ and other information? Other ht-1 โˆ’ โˆ‡ ๐œƒ ๐‘™ xt

10 LSTM for Gradient Descent
Typical LSTM LSTM for Gradient Descent ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง ๐œƒ ๐‘ก ๐œƒ ๐‘กโˆ’1 โˆ’ โˆ‡ ๐œƒ ๐‘™ Learn to minimize 3 training steps Testing Data ๐‘™ ๐œƒ 3 Learnable โ€œLSTMโ€ โ€œLSTMโ€ โ€œLSTMโ€ ๐œƒ 0 ๐œƒ 1 ๐œƒ 2 ๐œƒ 3 โˆ’ โˆ‡ ๐œƒ ๐‘™ โˆ’ โˆ‡ ๐œƒ ๐‘™ โˆ’ โˆ‡ ๐œƒ ๐‘™ Independence assumption Batch from train Batch from train Batch from train

11 Real Implementation ๐œƒ 1 0 ๐œƒ 1 1 ๐œƒ 1 2 ๐œƒ 1 3 โˆ’ โˆ‡ ๐œƒ 1 ๐‘™ โˆ’ โˆ‡ ๐œƒ 1 ๐‘™
The LSTM used only has one cell. Share across all parameters โ€œLSTMโ€ โ€œLSTMโ€ โ€œLSTMโ€ ๐œƒ 1 0 ๐œƒ 1 1 ๐œƒ 1 2 ๐œƒ 1 3 โˆ’ โˆ‡ ๐œƒ 1 ๐‘™ โˆ’ โˆ‡ ๐œƒ 1 ๐‘™ โˆ’ โˆ‡ ๐œƒ 1 ๐‘™ โ€œLSTMโ€ โ€œLSTMโ€ โ€œLSTMโ€ ๐œƒ 2 0 ๐œƒ 2 1 ๐œƒ 2 2 ๐œƒ 2 3 โˆ’ โˆ‡ ๐œƒ 2 ๐‘™ โˆ’ โˆ‡ ๐œƒ 2 ๐‘™ โˆ’ โˆ‡ ๐œƒ 2 ๐‘™ Reasonable model size In typical gradient descent, all the parameters use the same update rule Training and testing model architectures can be different.

12 Experimental Results ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง ๐œƒ ๐‘ก โˆ’ โˆ‡ ๐œƒ ๐‘™ ๐œƒ ๐‘กโˆ’1
Experimental Results ๐‘ ๐‘ก = ๐‘ง ๐‘“ โจ€ ๐‘ ๐‘กโˆ’1 + ๐‘ง ๐‘– โจ€๐‘ง ๐œƒ ๐‘ก ๐œƒ ๐‘กโˆ’1 โˆ’ โˆ‡ ๐œƒ ๐‘™

13 Parameter update depends on not only current gradient, but previous gradients.

14 LSTM for Gradient Descent (v2)
3 training steps m can store previous gradients Testing Data ๐‘™ ๐œƒ 3 โ€œLSTMโ€ โ€œLSTMโ€ โ€œLSTMโ€ ๐œƒ 0 ๐œƒ 1 ๐œƒ 2 ๐œƒ 3 โ€œLSTMโ€ โ€œLSTMโ€ โ€œLSTMโ€ ๐‘š 0 ๐‘š 1 ๐‘š 2 โˆ’ โˆ‡ ๐œƒ ๐‘™ โˆ’ โˆ‡ ๐œƒ ๐‘™ โˆ’ โˆ‡ ๐œƒ ๐‘™ Batch from train Batch from train Batch from train

15 https://arxiv.org/abs/1606.04474
Experimental Results

16 Meta Learning (Part 3) Hung-yi Lee Any progress in architecture search
KeLiโ€™s blog is good: \theta \phi ๆ‡‰่ฉฒ่ฆ็”จไธๅŒ็š„้ก่‰ฒ Agnosticไธๅฏ็Ÿฅ่ซ–็š„ Very good tutorial !!!!! Good summarization:

17 Even more crazy idea โ€ฆ Learning + Prediction cat Input:
Training data and their labels Testing data Output: Predicted label of testing data Learning + Prediction cat dog Testing Data Training Data & Labels

18 Face Verification Training In each task: Testing
Few-shot Learning Testing Unlock your phone by Face Training Registration (Collecting Training data)

19 Meta Learning Same approach for Speaker Verification Yes Network Train
Test Yes No Training Tasks Train Test No Network Train Test No ็ฌฌๅๅ››็Ž‹ๅญ_็“ฆๅธƒ็ˆพ ? Network Yes or No Testing Tasks Train Test

20 Siamese Network CNN share CNN Network No embedding Train score Same
Similarity ไธญๆ–‡่ฏ‘ไฝœๆšน็ฝ—ใ€‚SiameseไนŸๅฐฑๆ˜ฏโ€œๆšน็ฝ—โ€ไบบๆˆ–โ€œๆณฐๅ›ฝโ€ไบบใ€‚Siameseๅœจ่‹ฑ่ฏญไธญๆ˜ฏโ€œๅญช็”Ÿโ€ใ€โ€œ่ฟžไฝ“โ€็š„ๆ„ๆ€๏ผŒ่ฟ™ๆ˜ฏไธบไป€ไนˆๅ‘ข CNN Large score Yes embedding Small score No Test

21 Siamese Network CNN share CNN Network No embedding Train score
Different Similarity ไธญๆ–‡่ฏ‘ไฝœๆšน็ฝ—ใ€‚SiameseไนŸๅฐฑๆ˜ฏโ€œๆšน็ฝ—โ€ไบบๆˆ–โ€œๆณฐๅ›ฝโ€ไบบใ€‚Siameseๅœจ่‹ฑ่ฏญไธญๆ˜ฏโ€œๅญช็”Ÿโ€ใ€โ€œ่ฟžไฝ“โ€็š„ๆ„ๆ€๏ผŒ่ฟ™ๆ˜ฏไธบไป€ไนˆๅ‘ข๏ผŸ CNN Large score Yes embedding Small score No Test

22 Siamese Network - Intuitive Explanation
Binary classification problem: โ€œAre they the same?โ€ same Same or Different Training Set different Network different Face 1 Face 2 Testing Set ?

23 Siamese Network - Intuitive Explanation
Learning embedding for faces Far away CNN e.g. learn to ignore the background As close as possible CNN ๅ…ถๅฏฆๆ‡‰่ฉฒ่ฆ่ฌ›ไธ€ไธ‹ๆ€Ž้บผ็ฎ— distance CNN

24 To learn more โ€ฆ What kind of distance should we use?
SphereFace: Deep Hypersphere Embedding for Face Recognition Additive Margin Softmax for Face Verification ArcFace: Additive Angular Margin Loss for Deep Face Recognition Triplet loss Deep Metric Learning using Triplet Network FaceNet: A Unified Embedding for Face Recognition and Clustering

25 N-way Few/One-shot Learning
Example: 5-ways 1-shot ไธ‰็Ž– Network (Learning + Prediction) Oriol Vinyals,ย Charles Blundell,ย Timothy Lillicrap,ย Koray Kavukcuoglu,ย Daan Wierstra, โ€œMatching Networks for One Shot Learningโ€, NIPS, 2016 ไธ€่Šฑ ไบŒไนƒ ไธ‰็Ž– ๅ››่‘‰ ไบ”ๆœˆ Testing Data Training Data (Each image represents a class)

26 Prototypical Network Testing Data ๐‘  1 ๐‘  2 ๐‘  3 ๐‘  4 ๐‘  5 CNN CNN CNN CNN
Oriol Vinyals,ย Charles Blundell,ย Timothy Lillicrap,ย Koray Kavukcuoglu,ย Daan Wierstra, โ€œMatching Networks for One Shot Learningโ€, NIPS, 2016 = similarity

27 Matching Network Considering the relationship among the training examples ๐‘  1 ๐‘  2 ๐‘  3 ๐‘  4 ๐‘  5 CNN Bidirectional LSTM Prototypical Networks for Few-shot Learning Part of:ย Advances in Neural Information Processing Systems 30 (NIPS 2017) = similarity

28 Relation Network ??? https://arxiv.org/abs/1711.06025 CVPR 2018
concatenation of feature maps in depth

29 Few-shot learning for imaginary data

30 Few-shot learning for imaginary data


Download ppt "Meta Learning (Part 2): Gradient Descent as LSTM"

Similar presentations


Ads by Google