Life Long Learning Hung-yi Lee 李宏毅

Slides:



Advertisements
Similar presentations
Dougal Sutherland, 9/25/13.
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 14 – Neural Networks
Dan Simon Cleveland State University
Machine learning Image source:
Machine learning Image source:
How to do backpropagation in a brain
Artificial Neural Networks
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Supervised Learning. Teacher response: Emulation. Error: y1 – y2, where y1 is teacher response ( desired response, y2 is actual response ). Aim: To reduce.
Tips for Training Neural Network
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Reinforcement Learning
John Canny Where to put block multi-vector algorithms?
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
The Relationship between Deep Learning and Brain Function
Introduction of Reinforcement Learning
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Data Mining, Neural Network and Genetic Programming
Chilimbi, et al. (2014) Microsoft Research
Neural Networks.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Classification: Logistic Regression
Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.
Matt Gormley Lecture 16 October 24, 2016
Intro to NLP and Deep Learning
Memory process and improving concentration.
Classification with Perceptrons Reading:
CS 188: Artificial Intelligence
Hybrid computing using a neural network with dynamic external memory
RNNs: Going Beyond the SRN in Language Prediction
Grid Long Short-Term Memory
Chapter 3. Artificial Neural Networks - Introduction -
INF 5860 Machine learning for image classification
Tips for Training Deep Network
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
network of simple neuron-like computing elements
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Capabilities of Threshold Neurons
Tuning CNN: Tips & Tricks
Explainable Machine Learning
Artificial Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Backpropagation Disclaimer: This PPT is modified based on
Artificial Neural Networks
实习生汇报 ——北邮 张安迪.
LSTM: Long Short Term Memory
Meta Learning (Part 2): Gradient Descent as LSTM
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Model Compression Joseph E. Gonzalez
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Multiple DAGs Learning with Non-negative Matrix Factorization
PYTHON Deep Learning Prof. Muhammad Saeed.
Overall Introduction for the Lecture
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Life Long Learning Hung-yi Lee 李宏毅 https://www.pearsonlearned.com/lifelong-learning-will-help-workers-navigate-future-work/

能不能每次作業都訓練同一個類神經網路呢? 你用同一個腦學習 機器學習的每一堂課 但是每一個作業你都 訓練不同的類神經網路 能不能每次作業都訓練同一個類神經網路呢? https://www.forbes.com/sites/kpmg/2018/04/23/the-changing-nature-of-work-why-lifelong-learning-matters-more-than-ever/#4e04e90e1e95

Life Long Learning (LLL) Continuous Learning, Never Ending Learning, Incremental Learning I can solve task 1. I can solve tasks 1&2. I can solve tasks 1&2&3. … https://vlomonaco.github.io/core50/benchmarks.html Learning Task 1 Learning Task 2 Learning Task 3

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協;不讓步 Constraint: cannot store all the previous data

Example – Image = 3 layers, 50 neurons each Task 1 Task 2 This is “0”. 97% 96% 90% Forget!!! 80%

明明可以把 Task 1 和 2 都學好,為什麼會變成這樣子呢!? Learning Task 1 Learning Task 2 Forget!!! 96% 90% 97% 80% 98% 89% Learning Task 1 Learning Task 2 明明可以把 Task 1 和 2 都學好,為什麼會變成這樣子呢!?

Example – Question Answering https://arxiv.org/pdf/1502.05698.pdf Example – Question Answering Given a document, answer the question based on the document. There are 20 QA tasks in bAbi corpus. Train a QA model through the 20 tasks

Example – Question Answering Sequentially train the 20 tasks Jointly training the 20 tasks 是不為也 非不能也 感謝何振豪同學提供實驗結果

Catastrophic Forgetting Little Data for Regularization Catastrophic Forgetting

Using all the data for training Learning Task 1 Learning Task 2 Wait a minute …… Multi-task training can solve the problem! Multi-task training can be considered as the upper bound of LLL. Using all the data for training Computation issue Training Data for Task 1 Training Data for Task 999 Training Data for Task 1000 …… Always keep the data Storage issue

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. 𝜃 𝑏 is the model learned from the previous tasks. Each parameter 𝜃 𝑖 𝑏 has a “guard” 𝑏 𝑖 How important this parameter is Loss for current task Elastic adj. 有彈性的 Consolidation鞏固,加強,強化 𝐿 ′ 𝜃 =𝐿 𝜃 +𝜆 𝑖 𝑏 𝑖 𝜃 𝑖 − 𝜃 𝑖 𝑏 2 Loss to be optimized Parameters to be learning Parameters learned from previous task

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. 𝜃 𝑏 is the model learned from the previous tasks. Each parameter 𝜃 𝑖 𝑏 has a “guard” 𝑏 𝑖 One kind of regularization. 𝜃 𝑖 should be close to 𝜃 𝑏 in certain directions. 𝐿 ′ 𝜃 =𝐿 𝜃 +𝜆 𝑖 𝑏 𝑖 𝜃 𝑖 − 𝜃 𝑖 𝑏 2 If 𝑏 𝑖 =0, there is no constraint on 𝜃 𝑖 If 𝑏 𝑖 =∞, 𝜃 𝑖 would always be equal to 𝜃 𝑖 𝑏

Elastic Weight Consolidation (EWC) Task 1 Task 2 𝜃 0 𝜃 ∗ 𝜃 ∗ 𝜃 2 𝜃 2 Forget! 𝜃 𝑏 𝜃 𝑏 𝜃 1 𝜃 1 The error surfaces of tasks 1 & 2. (darker = smaller loss)

Elastic Weight Consolidation (EWC) Task 1 𝜃 1 𝑏 𝜃 1 𝜃 0 Small 2nd derivative 𝑏 1 is small 動到沒關係 𝜃 2 𝜃 2 𝑏 𝜃 𝑏 𝜃 2 𝜃 1 Large 2nd derivative Each parameter has a “guard” 𝑏 𝑖 𝑏 2 is large 動到會出事

Elastic Weight Consolidation (EWC) Task 1 Task 2 𝜃 0 𝜃 ∗ Do not forget! 𝜃 2 𝜃 2 𝜃 ∗ 𝜃 ∗ 𝜃 𝑏 𝜃 𝑏 𝜃 1 𝜃 1 𝑏 1 is small, while 𝑏 2 is large. (可以動 𝜃 1 ,但儘量不要動到 𝜃 2 )

Elastic Weight Consolidation (EWC) Intransigence MNIST permutation, from the original EWC paper

Elastic Weight Consolidation (EWC) http://www.citeulike.org/group/15400/article/14311063 Synaptic Intelligence (SI) https://arxiv.org/abs/1703.04200 Memory Aware Synapses (MAS) Special part: Do not need labelled data https://arxiv.org/abs/1711.09601 EWC: pdf https://pdfs.semanticscholar.org/ee89/738fcc69212b36fb990389d817b5fb7e486b.pdf https://github.com/moskomule/ewc.pytorch/blob/master/demo.ipynb Synaptic: 突觸的 Synapsis: 突觸

Generating Data https://arxiv.org/abs/1705.08690 https://arxiv.org/abs/1711.10563 Conducting multi-task learning by generating pseudo-data using generative model Generate task 1 data Generate task 1&2 data Training Data for Task 1 Training Data for Task 2 Shin, H., Lee, J. K., Kim, J. & Kim, J. (2017), Continual learning with deep generative replay, NIPS’17, Long Beach, CA Kemker, R. & Kanan, C. (2018), Fearnet: Brain-inspired model for incremental learning, ICLR’18, Vancouver, Canada. Multi-task Learning Solve task 1 Solve task 2

Adding New Classes Learning without forgetting (LwF) https://arxiv.org/abs/1606.09282 iCaRL: Incremental Classifier and Representation Learning https://arxiv.org/abs/1611.07725 Active Long Term Memory Networks (A-LTM) https://arxiv.org/abs/1606.02355

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協;不讓步 Constraint: cannot store all the previous data

Wait a minute …… Train a model for each task Learning Task 1 Learning Task 2 Learning Task 3 Knowledge cannot transfer across different tasks Eventually we cannot store all the models …

Life-Long v.s. Transfer I can do task 2 because I have learned task 1. Transfer Learning: (We don’t care whether machine can still do task 1.) fine-tune Learning Task 1 Learning Task 2 Even though I have learned task 2, I do not forget task 1. Life-long Learning:

(It is usually negative.) Test on Task 1 Task 2 …… Task T Rand Init. 𝑅 0,1 𝑅 0,2 𝑅 0,𝑇 After Training 𝑅 1,1 𝑅 1,2 𝑅 1,𝑇 𝑅 2,1 𝑅 2,2 𝑅 2,𝑇 … Task T-1 𝑅 𝑇−1,1 𝑅 𝑇−1,2 𝑅 𝑇−1,𝑇 𝑅 𝑇,1 𝑅 𝑇,2 𝑅 𝑇,𝑇 Evaluation 𝑅 𝑖,𝑗 : after training task i, performance on task j If 𝑖>𝑗, After training task i, does task j be forgot If 𝑖<𝑗, Accuracy = 1 𝑇 𝑖=1 𝑇 𝑅 𝑇,𝑖 GEM Can we transfer the skill of task i to task j Backward Transfer = 1 𝑇−1 𝑖=1 𝑇−1 𝑅 𝑇,𝑖 − 𝑅 𝑖,𝑖 (It is usually negative.)

Evaluation Test on Task 1 Task 2 …… Task T Rand Init. 𝑅 0,1 𝑅 0,2 𝑅 0,𝑇 After Training 𝑅 1,1 𝑅 1,2 𝑅 1,𝑇 𝑅 2,1 𝑅 2,2 𝑅 2,𝑇 … Task T-1 𝑅 𝑇−1,1 𝑅 𝑇−1,2 𝑅 𝑇−1,𝑇 𝑅 𝑇,1 𝑅 𝑇,2 𝑅 𝑇,𝑇 Evaluation 𝑅 𝑖,𝑗 : after training task i, performance on task j If 𝑖>𝑗, After training task i, does task j be forgot If 𝑖<𝑗, Accuracy = 1 𝑇 𝑖=1 𝑇 𝑅 𝑇,𝑖 GEM Can we transfer the skill of task i to task j Backward Transfer = 1 𝑇−1 𝑖=1 𝑇−1 𝑅 𝑇,𝑖 − 𝑅 𝑖,𝑖 Forward Transfer = 1 𝑇−1 𝑖=2 𝑇 𝑅 𝑖−1,𝑖 − 𝑅 0,𝑖

Gradient Episodic Memory (GEM) A-GEM: https://arxiv.org/abs/1812.00420 Gradient Episodic Memory (GEM) Constraint the gradient to improve the previous tasks 𝑔 𝑔 𝑔 1 as close as possible 𝑔′ 𝑔 2 𝜃 𝜃 𝑔 2 Episodic (戲劇、小說等)插曲多的;情節不連貫的;插曲式的 𝑔′∙ 𝑔 1 ≥0 : negative gradient of current task 𝑔 1 𝑔′∙ 𝑔 2 ≥0 negative gradient of previous task Need the data from the previous tasks : update direction

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協;不讓步 Constraint: cannot store all the previous data

Progressive Neural Networks https://arxiv.org/abs/1606.04671 Progressive Neural Networks Task 1 Task 2 Task 3 https://arxiv.org/pdf/1606.04671v3.pdf input input input

https://arxiv.org/abs/1611.06194 Expert Gate https://arxiv.org/abs/1611.06194 Aljundi, R., Chakravarty, P., Tuytelaars, T.: Expert gate: Lifelong learning with a network of experts. In: CVPR (2017)

Net2Net Add some small noises https://arxiv.org/abs/1511.05641 Net2Net Add some small noises Expand the network only when the training accuracy of the current task is not good enough. https://arxiv.org/abs/1811.07017

Concluding Remarks Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協;不讓步 Constraint: cannot store all the previous data

Curriculum Learning : what is the proper learning order? Task 1 Task 2 96% 97% 90% Forget!!! 80% Task 2 Task 1 97% 97% 90% 62%

taskonomy = task + taxonomy (分類學) http://taskonomy.stanford.edu/#abstract