Life Long Learning Hung-yi Lee 李宏毅

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lecture 14 – Neural Networks

Dan Simon Cleveland State University

Machine learning Image source:

Machine learning Image source:

How to do backpropagation in a brain

Artificial Neural Networks

 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.

Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.

Backpropagation An efficient way to compute the gradient Hung-yi Lee.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

Supervised Learning. Teacher response: Emulation. Error: y1 – y2, where y1 is teacher response ( desired response, y2 is actual response ). Aim: To reduce.

Tips for Training Neural Network

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Reinforcement Learning

John Canny Where to put block multi-vector algorithms?

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

The Relationship between Deep Learning and Brain Function

Introduction of Reinforcement Learning

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

Neural Networks.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Classification: Logistic Regression

Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.

Matt Gormley Lecture 16 October 24, 2016

Intro to NLP and Deep Learning

Memory process and improving concentration.

Classification with Perceptrons Reading:

CS 188: Artificial Intelligence

Hybrid computing using a neural network with dynamic external memory

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

Chapter 3. Artificial Neural Networks - Introduction -

INF 5860 Machine learning for image classification

Tips for Training Deep Network

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

network of simple neuron-like computing elements

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’

Capabilities of Threshold Neurons

Tuning CNN: Tips & Tricks

Explainable Machine Learning

Artificial Neural Networks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Backpropagation Disclaimer: This PPT is modified based on

Artificial Neural Networks

实习生汇报 ——北邮张安迪.

LSTM: Long Short Term Memory

Meta Learning (Part 2): Gradient Descent as LSTM

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Model Compression Joseph E. Gonzalez

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Multiple DAGs Learning with Non-negative Matrix Factorization

PYTHON Deep Learning Prof. Muhammad Saeed.

Overall Introduction for the Lecture

Do Better ImageNet Models Transfer Better?

Presentation transcript:

Life Long Learning Hung-yi Lee 李宏毅 https://www.pearsonlearned.com/lifelong-learning-will-help-workers-navigate-future-work/

能不能每次作業都訓練同一個類神經網路呢？你用同一個腦學習機器學習的每一堂課但是每一個作業你都訓練不同的類神經網路能不能每次作業都訓練同一個類神經網路呢？ https://www.forbes.com/sites/kpmg/2018/04/23/the-changing-nature-of-work-why-lifelong-learning-matters-more-than-ever/#4e04e90e1e95

Life Long Learning (LLL) Continuous Learning, Never Ending Learning, Incremental Learning I can solve task 1. I can solve tasks 1&2. I can solve tasks 1&2&3. … https://vlomonaco.github.io/core50/benchmarks.html Learning Task 1 Learning Task 2 Learning Task 3

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協；不讓步 Constraint: cannot store all the previous data

Example – Image = 3 layers, 50 neurons each Task 1 Task 2 This is “0”. 97% 96% 90% Forget!!! 80%

明明可以把 Task 1 和 2 都學好，為什麼會變成這樣子呢!? Learning Task 1 Learning Task 2 Forget!!! 96% 90% 97% 80% 98% 89% Learning Task 1 Learning Task 2 明明可以把 Task 1 和 2 都學好，為什麼會變成這樣子呢!?

Example – Question Answering https://arxiv.org/pdf/1502.05698.pdf Example – Question Answering Given a document, answer the question based on the document. There are 20 QA tasks in bAbi corpus. Train a QA model through the 20 tasks

Example – Question Answering Sequentially train the 20 tasks Jointly training the 20 tasks 是不為也非不能也感謝何振豪同學提供實驗結果

Catastrophic Forgetting Little Data for Regularization Catastrophic Forgetting

Using all the data for training Learning Task 1 Learning Task 2 Wait a minute …… Multi-task training can solve the problem! Multi-task training can be considered as the upper bound of LLL. Using all the data for training Computation issue Training Data for Task 1 Training Data for Task 999 Training Data for Task 1000 …… Always keep the data Storage issue

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. 𝜃 𝑏 is the model learned from the previous tasks. Each parameter 𝜃 𝑖 𝑏 has a “guard” 𝑏 𝑖 How important this parameter is Loss for current task Elastic adj. 有彈性的 Consolidation鞏固，加強，強化 𝐿 ′ 𝜃 =𝐿 𝜃 +𝜆 𝑖 𝑏 𝑖 𝜃 𝑖 − 𝜃 𝑖 𝑏 2 Loss to be optimized Parameters to be learning Parameters learned from previous task

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. 𝜃 𝑏 is the model learned from the previous tasks. Each parameter 𝜃 𝑖 𝑏 has a “guard” 𝑏 𝑖 One kind of regularization. 𝜃 𝑖 should be close to 𝜃 𝑏 in certain directions. 𝐿 ′ 𝜃 =𝐿 𝜃 +𝜆 𝑖 𝑏 𝑖 𝜃 𝑖 − 𝜃 𝑖 𝑏 2 If 𝑏 𝑖 =0, there is no constraint on 𝜃 𝑖 If 𝑏 𝑖 =∞, 𝜃 𝑖 would always be equal to 𝜃 𝑖 𝑏

Elastic Weight Consolidation (EWC) Task 1 Task 2 𝜃 0 𝜃 ∗ 𝜃 ∗ 𝜃 2 𝜃 2 Forget! 𝜃 𝑏 𝜃 𝑏 𝜃 1 𝜃 1 The error surfaces of tasks 1 & 2. (darker = smaller loss)

Elastic Weight Consolidation (EWC) Task 1 𝜃 1 𝑏 𝜃 1 𝜃 0 Small 2nd derivative 𝑏 1 is small 動到沒關係 𝜃 2 𝜃 2 𝑏 𝜃 𝑏 𝜃 2 𝜃 1 Large 2nd derivative Each parameter has a “guard” 𝑏 𝑖 𝑏 2 is large 動到會出事

Elastic Weight Consolidation (EWC) Task 1 Task 2 𝜃 0 𝜃 ∗ Do not forget! 𝜃 2 𝜃 2 𝜃 ∗ 𝜃 ∗ 𝜃 𝑏 𝜃 𝑏 𝜃 1 𝜃 1 𝑏 1 is small, while 𝑏 2 is large. (可以動 𝜃 1 ，但儘量不要動到 𝜃 2 )

Elastic Weight Consolidation (EWC) Intransigence MNIST permutation, from the original EWC paper

Elastic Weight Consolidation (EWC) http://www.citeulike.org/group/15400/article/14311063 Synaptic Intelligence (SI) https://arxiv.org/abs/1703.04200 Memory Aware Synapses (MAS) Special part: Do not need labelled data https://arxiv.org/abs/1711.09601 EWC: pdf https://pdfs.semanticscholar.org/ee89/738fcc69212b36fb990389d817b5fb7e486b.pdf https://github.com/moskomule/ewc.pytorch/blob/master/demo.ipynb Synaptic: 突觸的 Synapsis: 突觸

Generating Data https://arxiv.org/abs/1705.08690 https://arxiv.org/abs/1711.10563 Conducting multi-task learning by generating pseudo-data using generative model Generate task 1 data Generate task 1&2 data Training Data for Task 1 Training Data for Task 2 Shin, H., Lee, J. K., Kim, J. & Kim, J. (2017), Continual learning with deep generative replay, NIPS’17, Long Beach, CA Kemker, R. & Kanan, C. (2018), Fearnet: Brain-inspired model for incremental learning, ICLR’18, Vancouver, Canada. Multi-task Learning Solve task 1 Solve task 2

Adding New Classes Learning without forgetting (LwF) https://arxiv.org/abs/1606.09282 iCaRL: Incremental Classifier and Representation Learning https://arxiv.org/abs/1611.07725 Active Long Term Memory Networks (A-LTM) https://arxiv.org/abs/1606.02355

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協；不讓步 Constraint: cannot store all the previous data

Wait a minute …… Train a model for each task Learning Task 1 Learning Task 2 Learning Task 3 Knowledge cannot transfer across different tasks Eventually we cannot store all the models …

Life-Long v.s. Transfer I can do task 2 because I have learned task 1. Transfer Learning: (We don’t care whether machine can still do task 1.) fine-tune Learning Task 1 Learning Task 2 Even though I have learned task 2, I do not forget task 1. Life-long Learning:

(It is usually negative.) Test on Task 1 Task 2 …… Task T Rand Init. 𝑅 0,1 𝑅 0,2 𝑅 0,𝑇 After Training 𝑅 1,1 𝑅 1,2 𝑅 1,𝑇 𝑅 2,1 𝑅 2,2 𝑅 2,𝑇 … Task T-1 𝑅 𝑇−1,1 𝑅 𝑇−1,2 𝑅 𝑇−1,𝑇 𝑅 𝑇,1 𝑅 𝑇,2 𝑅 𝑇,𝑇 Evaluation 𝑅 𝑖,𝑗 : after training task i, performance on task j If 𝑖>𝑗, After training task i, does task j be forgot If 𝑖<𝑗, Accuracy = 1 𝑇 𝑖=1 𝑇 𝑅 𝑇,𝑖 GEM Can we transfer the skill of task i to task j Backward Transfer = 1 𝑇−1 𝑖=1 𝑇−1 𝑅 𝑇,𝑖 − 𝑅 𝑖,𝑖 (It is usually negative.)

Evaluation Test on Task 1 Task 2 …… Task T Rand Init. 𝑅 0,1 𝑅 0,2 𝑅 0,𝑇 After Training 𝑅 1,1 𝑅 1,2 𝑅 1,𝑇 𝑅 2,1 𝑅 2,2 𝑅 2,𝑇 … Task T-1 𝑅 𝑇−1,1 𝑅 𝑇−1,2 𝑅 𝑇−1,𝑇 𝑅 𝑇,1 𝑅 𝑇,2 𝑅 𝑇,𝑇 Evaluation 𝑅 𝑖,𝑗 : after training task i, performance on task j If 𝑖>𝑗, After training task i, does task j be forgot If 𝑖<𝑗, Accuracy = 1 𝑇 𝑖=1 𝑇 𝑅 𝑇,𝑖 GEM Can we transfer the skill of task i to task j Backward Transfer = 1 𝑇−1 𝑖=1 𝑇−1 𝑅 𝑇,𝑖 − 𝑅 𝑖,𝑖 Forward Transfer = 1 𝑇−1 𝑖=2 𝑇 𝑅 𝑖−1,𝑖 − 𝑅 0,𝑖

Gradient Episodic Memory (GEM) A-GEM: https://arxiv.org/abs/1812.00420 Gradient Episodic Memory (GEM) Constraint the gradient to improve the previous tasks 𝑔 𝑔 𝑔 1 as close as possible 𝑔′ 𝑔 2 𝜃 𝜃 𝑔 2 Episodic （戲劇、小說等）插曲多的；情節不連貫的；插曲式的 𝑔′∙ 𝑔 1 ≥0 : negative gradient of current task 𝑔 1 𝑔′∙ 𝑔 2 ≥0 negative gradient of previous task Need the data from the previous tasks : update direction

Life-long Learning Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協；不讓步 Constraint: cannot store all the previous data

Progressive Neural Networks https://arxiv.org/abs/1606.04671 Progressive Neural Networks Task 1 Task 2 Task 3 https://arxiv.org/pdf/1606.04671v3.pdf input input input

https://arxiv.org/abs/1611.06194 Expert Gate https://arxiv.org/abs/1611.06194 Aljundi, R., Chakravarty, P., Tuytelaars, T.: Expert gate: Lifelong learning with a network of experts. In: CVPR (2017)

Net2Net Add some small noises https://arxiv.org/abs/1511.05641 Net2Net Add some small noises Expand the network only when the training accuracy of the current task is not good enough. https://arxiv.org/abs/1811.07017

Concluding Remarks Knowledge Retention Knowledge Transfer but NOT Intransigence Knowledge Transfer Model Expansion but Parameter Efficiency Intransigence: ɪnˋtrænsədʒəns n. 不妥協；不讓步 Constraint: cannot store all the previous data

Curriculum Learning : what is the proper learning order? Task 1 Task 2 96% 97% 90% Forget!!! 80% Task 2 Task 1 97% 97% 90% 62%

taskonomy = task + taxonomy (分類學) http://taskonomy.stanford.edu/#abstract