Meta Learning (Part 2): Gradient Descent as LSTM

Slides:



Advertisements
Similar presentations
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Advertisements

Attention Model in NLP Jichuan ZENG.
Lecture 6 Smaller Network: RNN
Neural networks and support vector machines
Unsupervised Learning of Video Representations using LSTMs
CNN-RNN: A Unified Framework for Multi-label Image Classification
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Machine Learning & Deep Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Classification: Logistic Regression
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Deep Learning: Model Summary
Machine Learning I & II.
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Neural networks (3) Regularization Autoencoder
Shunyuan Zhang Nikhil Malik
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Image Question Answering
RNNs: Going Beyond the SRN in Language Prediction
A critical review of RNN for sequence learning Zachary C
Face Recognition with Deep Learning Method
Grid Long Short-Term Memory
NormFace:
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Recurrent Neural Networks (RNN)
Final Presentation: Neural Network Doc Summarization
Exploring Matching Networks for Cross Language Information Retrieval
Tips for Training Deep Network
Object Detection + Deep Learning
Understanding LSTM Networks
Code Completion with Neural Attention and Pointer Networks
Convolutional networks
Neural Networks Geoff Hulten.
Other Classification Models: Recurrent Neural Network (RNN)
Lecture: Deep Convolutional Neural Networks
Lecture 16: Recurrent Neural Networks (RNNs)
RCNN, Fast-RCNN, Faster-RCNN
Softmax Classifier.
RNNs: Going Beyond the SRN in Language Prediction
Life Long Learning Hung-yi Lee 李宏毅
Neural networks (3) Regularization Autoencoder
Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler
Please enjoy.
LSTM: Long Short Term Memory
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Automatic Handwriting Generation
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Neural Machine Translation using CNN
Recurrent Neural Networks
Modeling Price of ethereum with machine learning
Deep learning: Recurrent Neural Networks CV192
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Listen Attend and Spell – a brief introduction
CRCV REU 2019 Aaron Honculada.
Presentation transcript:

Meta Learning (Part 2): Gradient Descent as LSTM Hung-yi Lee \theta \phi 應該要用不同的顏色 Agnostic不可知論的 Very good tutorial !!!!! Good summarization: https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html

Can we learn more than initialization parameters? 𝜃 0 𝜃 1 ∇𝑙 Init Compute Gradient Update Training Data 𝜃 2 𝜃 Network Structure Learning Algorithm (Function 𝐹) 𝜙 The learning algorithm looks like RNN.

Recurrent Neural Network h and h’ are vectors with the same dimension Given function f: ℎ ′ ,𝑦=𝑓 ℎ,𝑥 y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3 No matter how long the input/output sequence is, we only need one function f

LSTM yt LSTM ct yt ct-1 Naïve RNN ht ht ht-1 ht-1 xt xt c change slowly ct is ct-1 added by something h change faster ht and ht-1 can be very different

Review: LSTM =𝑡𝑎𝑛ℎ =𝜎 =𝜎 =𝜎 xt ht-1 z W xt ht-1 ct-1 zi Wi input xt zf Wf =𝜎 Identiy C is vector forget zf zi z zo xt ht-1 zo Wo =𝜎 ht-1 xt output

Review: LSTM 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 yt ct-1 ct ⨀ + ⨀ tanh 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 Identiy C is vector zf zi z zo 𝑦 𝑡 =𝜎 𝑊 ′ ℎ 𝑡 ht-1 xt ht

Review: LSTM yt yt+1 ct-1 ct ct+1 ⨀ ⨀ ⨀ ⨀ tanh tanh ⨀ ⨀ zf zi z zo zf + ⨀ ⨀ + ⨀ tanh tanh ⨀ ⨀ https://arxiv.org/pdf/1412.3555.pdf Identiy C is vector zf zi z zo zf zi z zo ht ht-1 xt ht xt+1

Similar to gradient descent based algorithm 𝜃 𝑡 = 𝜃 𝑡−1 −𝜂 ∇ 𝜃 𝑙 yt 1 1 ⋮ 𝜂 𝜂 ⋮ 𝜃 𝑡−1 𝜃 𝑡 ct-1 ct ⨀ + ⨀ tanh 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 all “1” all 𝜂 Identiy C is vector zf zi − ∇ 𝜃 𝑙 z zo 𝑦 𝑡 =𝜎 𝑊 ′ ℎ 𝑡 − ∇ 𝜃 𝑙 ht-1 xt ht

Something like regularization Similar to gradient descent based algorithm 𝜃 𝑡 = 𝜃 𝑡−1 −𝜂 ∇ 𝜃 𝑙 Something like regularization 𝜃 𝑡−1 𝜃 𝑡 ct-1 ct ⨀ + 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ all “1” all 𝜂 Dynamic learning rate Identiy C is vector zf zi z − ∇ 𝜃 𝑙 How about machine learn to determine 𝑧 𝑓 and 𝑧 𝑖 from − ∇ 𝜃 𝑙 and other information? Other ht-1 − ∇ 𝜃 𝑙 xt

LSTM for Gradient Descent Typical LSTM LSTM for Gradient Descent 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 Learn to minimize 3 training steps Testing Data 𝑙 𝜃 3 Learnable “LSTM” “LSTM” “LSTM” 𝜃 0 𝜃 1 𝜃 2 𝜃 3 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 Independence assumption Batch from train Batch from train Batch from train

Real Implementation 𝜃 1 0 𝜃 1 1 𝜃 1 2 𝜃 1 3 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 The LSTM used only has one cell. Share across all parameters “LSTM” “LSTM” “LSTM” 𝜃 1 0 𝜃 1 1 𝜃 1 2 𝜃 1 3 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 “LSTM” “LSTM” “LSTM” 𝜃 2 0 𝜃 2 1 𝜃 2 2 𝜃 2 3 − ∇ 𝜃 2 𝑙 − ∇ 𝜃 2 𝑙 − ∇ 𝜃 2 𝑙 Reasonable model size In typical gradient descent, all the parameters use the same update rule Training and testing model architectures can be different.

Experimental Results 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 − ∇ 𝜃 𝑙 𝜃 𝑡−1 https://openreview.net/forum?id=rJY0-Kcll&noteId=ryq49XyLg Experimental Results 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙

Parameter update depends on not only current gradient, but previous gradients.

LSTM for Gradient Descent (v2) 3 training steps m can store previous gradients Testing Data 𝑙 𝜃 3 “LSTM” “LSTM” “LSTM” 𝜃 0 𝜃 1 𝜃 2 𝜃 3 “LSTM” “LSTM” “LSTM” 𝑚 0 𝑚 1 𝑚 2 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 Batch from train Batch from train Batch from train

https://arxiv.org/abs/1606.04474 Experimental Results

Meta Learning (Part 3) Hung-yi Lee Any progress in architecture search KeLi’s blog is good: https://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/ \theta \phi 應該要用不同的顏色 Agnostic不可知論的 Very good tutorial !!!!! Good summarization: https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html

Even more crazy idea … Learning + Prediction cat Input: Training data and their labels Testing data Output: Predicted label of testing data Learning + Prediction cat dog Testing Data Training Data & Labels

Face Verification Training In each task: Testing Few-shot Learning Testing Unlock your phone by Face Training Registration (Collecting Training data) https://support.apple.com/zh-tw/HT208109

Meta Learning Same approach for Speaker Verification Yes Network Train Test Yes No Training Tasks Train Test No Network Train Test No 第十四王子_瓦布爾 ? Network Yes or No Testing Tasks Train Test

Siamese Network CNN share CNN Network No embedding Train score Same Similarity 中文译作暹罗。Siamese也就是“暹罗”人或“泰国”人。Siamese在英语中是“孪生”、“连体”的意思,这是为什么呢 https://zhuanlan.zhihu.com/p/35040994? CNN Large score Yes embedding Small score No Test

Siamese Network CNN share CNN Network No embedding Train score Different Similarity 中文译作暹罗。Siamese也就是“暹罗”人或“泰国”人。Siamese在英语中是“孪生”、“连体”的意思,这是为什么呢? CNN Large score Yes embedding Small score No Test

Siamese Network - Intuitive Explanation Binary classification problem: “Are they the same?” same Same or Different Training Set different Network different Face 1 Face 2 Testing Set ?

Siamese Network - Intuitive Explanation Learning embedding for faces Far away CNN e.g. learn to ignore the background As close as possible CNN 其實應該要講一下怎麼算 distance CNN

To learn more … What kind of distance should we use? SphereFace: Deep Hypersphere Embedding for Face Recognition Additive Margin Softmax for Face Verification ArcFace: Additive Angular Margin Loss for Deep Face Recognition Triplet loss Deep Metric Learning using Triplet Network FaceNet: A Unified Embedding for Face Recognition and Clustering

N-way Few/One-shot Learning Example: 5-ways 1-shot 三玖 Network (Learning + Prediction) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra, “Matching Networks for One Shot Learning”, NIPS, 2016 一花 二乃 三玖 四葉 五月 Testing Data Training Data (Each image represents a class)

Prototypical Network Testing Data 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 CNN CNN CNN CNN Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra, “Matching Networks for One Shot Learning”, NIPS, 2016 = similarity https://arxiv.org/abs/1703.05175

Matching Network Considering the relationship among the training examples 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 CNN Bidirectional LSTM Prototypical Networks for Few-shot Learning Part of: Advances in Neural Information Processing Systems 30 (NIPS 2017) = similarity https://arxiv.org/abs/1606.04080

Relation Network ??? https://arxiv.org/abs/1711.06025 CVPR 2018 concatenation of feature maps in depth https://arxiv.org/abs/1711.06025

Few-shot learning for imaginary data https://arxiv.org/abs/1801.05401

Few-shot learning for imaginary data https://arxiv.org/abs/1801.05401