Meta Learning (Part 2): Gradient Descent as LSTM

Slides:

Advertisements

Similar presentations

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Advertisements

Attention Model in NLP Jichuan ZENG.

Lecture 6 Smaller Network: RNN

Neural networks and support vector machines

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

RNNs: An example applied to the prediction task

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Feedforward Networks

Machine Learning & Deep Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Recursive Neural Networks

Recurrent Neural Networks for Natural Language Processing

Classification: Logistic Regression

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Matt Gormley Lecture 16 October 24, 2016

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Deep Learning: Model Summary

Machine Learning I & II.

Intro to NLP and Deep Learning

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Intelligent Information System Lab

Basic machine learning background with Python scikit-learn

Neural networks (3) Regularization Autoencoder

Shunyuan Zhang Nikhil Malik

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Image Question Answering

RNNs: Going Beyond the SRN in Language Prediction

A critical review of RNN for sequence learning Zachary C

Face Recognition with Deep Learning Method

Grid Long Short-Term Memory

Advanced Artificial Intelligence

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Recurrent Neural Networks (RNN)

Final Presentation: Neural Network Doc Summarization

Exploring Matching Networks for Cross Language Information Retrieval

Tips for Training Deep Network

Object Detection + Deep Learning

Understanding LSTM Networks

Code Completion with Neural Attention and Pointer Networks

Convolutional networks

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Lecture: Deep Convolutional Neural Networks

Lecture 16: Recurrent Neural Networks (RNNs)

RCNN, Fast-RCNN, Faster-RCNN

Softmax Classifier.

RNNs: Going Beyond the SRN in Language Prediction

Life Long Learning Hung-yi Lee 李宏毅

Neural networks (3) Regularization Autoencoder

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

LSTM: Long Short Term Memory

Word embeddings (continued)

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Automatic Handwriting Generation

Introduction to Neural Networks

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Neural Machine Translation using CNN

Recurrent Neural Networks

Modeling Price of ethereum with machine learning

Deep learning: Recurrent Neural Networks CV192

Bidirectional LSTM-CRF Models for Sequence Tagging

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Listen Attend and Spell – a brief introduction

CRCV REU 2019 Aaron Honculada.

Presentation transcript:

Meta Learning (Part 2): Gradient Descent as LSTM Hung-yi Lee \theta \phi 應該要用不同的顏色 Agnostic不可知論的 Very good tutorial !!!!! Good summarization: https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html

Can we learn more than initialization parameters? 𝜃 0 𝜃 1 ∇𝑙 Init Compute Gradient Update Training Data 𝜃 2 𝜃 Network Structure Learning Algorithm (Function 𝐹) 𝜙 The learning algorithm looks like RNN.

Recurrent Neural Network h and h’ are vectors with the same dimension Given function f: ℎ ′ ,𝑦=𝑓 ℎ,𝑥 y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3 No matter how long the input/output sequence is, we only need one function f

LSTM yt LSTM ct yt ct-1 Naïve RNN ht ht ht-1 ht-1 xt xt c change slowly ct is ct-1 added by something h change faster ht and ht-1 can be very different

Review: LSTM =𝑡𝑎𝑛ℎ =𝜎 =𝜎 =𝜎 xt ht-1 z W xt ht-1 ct-1 zi Wi input xt zf Wf =𝜎 Identiy C is vector forget zf zi z zo xt ht-1 zo Wo =𝜎 ht-1 xt output

Review: LSTM 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 yt ct-1 ct ⨀ ＋ ⨀ tanh 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 Identiy C is vector zf zi z zo 𝑦 𝑡 =𝜎 𝑊 ′ ℎ 𝑡 ht-1 xt ht

Review: LSTM yt yt+1 ct-1 ct ct+1 ⨀ ⨀ ⨀ ⨀ tanh tanh ⨀ ⨀ zf zi z zo zf ＋ ⨀ ⨀ ＋ ⨀ tanh tanh ⨀ ⨀ https://arxiv.org/pdf/1412.3555.pdf Identiy C is vector zf zi z zo zf zi z zo ht ht-1 xt ht xt+1

Similar to gradient descent based algorithm 𝜃 𝑡 = 𝜃 𝑡−1 −𝜂 ∇ 𝜃 𝑙 yt 1 1 ⋮ 𝜂 𝜂 ⋮ 𝜃 𝑡−1 𝜃 𝑡 ct-1 ct ⨀ ＋ ⨀ tanh 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ ℎ 𝑡 = 𝑧 𝑜 ⨀𝑡𝑎𝑛ℎ 𝑐 𝑡 all “1” all 𝜂 Identiy C is vector zf zi − ∇ 𝜃 𝑙 z zo 𝑦 𝑡 =𝜎 𝑊 ′ ℎ 𝑡 − ∇ 𝜃 𝑙 ht-1 xt ht

Something like regularization Similar to gradient descent based algorithm 𝜃 𝑡 = 𝜃 𝑡−1 −𝜂 ∇ 𝜃 𝑙 Something like regularization 𝜃 𝑡−1 𝜃 𝑡 ct-1 ct ⨀ ＋ 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 ⨀ all “1” all 𝜂 Dynamic learning rate Identiy C is vector zf zi z − ∇ 𝜃 𝑙 How about machine learn to determine 𝑧 𝑓 and 𝑧 𝑖 from − ∇ 𝜃 𝑙 and other information? Other ht-1 − ∇ 𝜃 𝑙 xt

LSTM for Gradient Descent Typical LSTM LSTM for Gradient Descent 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙 Learn to minimize 3 training steps Testing Data 𝑙 𝜃 3 Learnable “LSTM” “LSTM” “LSTM” 𝜃 0 𝜃 1 𝜃 2 𝜃 3 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 Independence assumption Batch from train Batch from train Batch from train

Real Implementation 𝜃 1 0 𝜃 1 1 𝜃 1 2 𝜃 1 3 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 The LSTM used only has one cell. Share across all parameters “LSTM” “LSTM” “LSTM” 𝜃 1 0 𝜃 1 1 𝜃 1 2 𝜃 1 3 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 − ∇ 𝜃 1 𝑙 “LSTM” “LSTM” “LSTM” 𝜃 2 0 𝜃 2 1 𝜃 2 2 𝜃 2 3 − ∇ 𝜃 2 𝑙 − ∇ 𝜃 2 𝑙 − ∇ 𝜃 2 𝑙 Reasonable model size In typical gradient descent, all the parameters use the same update rule Training and testing model architectures can be different.

Experimental Results 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 − ∇ 𝜃 𝑙 𝜃 𝑡−1 https://openreview.net/forum?id=rJY0-Kcll&noteId=ryq49XyLg Experimental Results 𝑐 𝑡 = 𝑧 𝑓 ⨀ 𝑐 𝑡−1 + 𝑧 𝑖 ⨀𝑧 𝜃 𝑡 𝜃 𝑡−1 − ∇ 𝜃 𝑙

Parameter update depends on not only current gradient, but previous gradients.

LSTM for Gradient Descent (v2) 3 training steps m can store previous gradients Testing Data 𝑙 𝜃 3 “LSTM” “LSTM” “LSTM” 𝜃 0 𝜃 1 𝜃 2 𝜃 3 “LSTM” “LSTM” “LSTM” 𝑚 0 𝑚 1 𝑚 2 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 − ∇ 𝜃 𝑙 Batch from train Batch from train Batch from train

https://arxiv.org/abs/1606.04474 Experimental Results

Meta Learning (Part 3) Hung-yi Lee Any progress in architecture search KeLi’s blog is good: https://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/ \theta \phi 應該要用不同的顏色 Agnostic不可知論的 Very good tutorial !!!!! Good summarization: https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html

Even more crazy idea … Learning + Prediction cat Input: Training data and their labels Testing data Output: Predicted label of testing data Learning + Prediction cat dog Testing Data Training Data & Labels

Face Verification Training In each task: Testing Few-shot Learning Testing Unlock your phone by Face Training Registration (Collecting Training data) https://support.apple.com/zh-tw/HT208109

Meta Learning Same approach for Speaker Verification Yes Network Train Test Yes No Training Tasks Train Test No Network Train Test No 第十四王子_瓦布爾 ? Network Yes or No Testing Tasks Train Test

Siamese Network CNN share CNN Network No embedding Train score Same Similarity 中文译作暹罗。Siamese也就是“暹罗”人或“泰国”人。Siamese在英语中是“孪生”、“连体”的意思，这是为什么呢 https://zhuanlan.zhihu.com/p/35040994？ CNN Large score Yes embedding Small score No Test

Siamese Network CNN share CNN Network No embedding Train score Different Similarity 中文译作暹罗。Siamese也就是“暹罗”人或“泰国”人。Siamese在英语中是“孪生”、“连体”的意思，这是为什么呢？ CNN Large score Yes embedding Small score No Test

Siamese Network - Intuitive Explanation Binary classification problem: “Are they the same?” same Same or Different Training Set different Network different Face 1 Face 2 Testing Set ?

Siamese Network - Intuitive Explanation Learning embedding for faces Far away CNN e.g. learn to ignore the background As close as possible CNN 其實應該要講一下怎麼算 distance CNN

To learn more … What kind of distance should we use? SphereFace: Deep Hypersphere Embedding for Face Recognition Additive Margin Softmax for Face Verification ArcFace: Additive Angular Margin Loss for Deep Face Recognition Triplet loss Deep Metric Learning using Triplet Network FaceNet: A Unified Embedding for Face Recognition and Clustering

N-way Few/One-shot Learning Example: 5-ways 1-shot 三玖 Network (Learning + Prediction) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra, “Matching Networks for One Shot Learning”, NIPS, 2016 一花二乃三玖四葉五月 Testing Data Training Data (Each image represents a class)

Prototypical Network Testing Data 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 CNN CNN CNN CNN Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra, “Matching Networks for One Shot Learning”, NIPS, 2016 = similarity https://arxiv.org/abs/1703.05175

Matching Network Considering the relationship among the training examples 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 CNN Bidirectional LSTM Prototypical Networks for Few-shot Learning Part of: Advances in Neural Information Processing Systems 30 (NIPS 2017) = similarity https://arxiv.org/abs/1606.04080

Relation Network ??? https://arxiv.org/abs/1711.06025 CVPR 2018 concatenation of feature maps in depth https://arxiv.org/abs/1711.06025

Few-shot learning for imaginary data https://arxiv.org/abs/1801.05401

Few-shot learning for imaginary data https://arxiv.org/abs/1801.05401