Incremental Training of Deep Convolutional Neural Networks

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Neural Nets Using Backpropagation Chris Marriott Ryan Shirley CJ Baker Thomas Tannahill.
The back-propagation training algorithm
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
CS 4700: Foundations of Artificial Intelligence
Neural networks.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Artificial Neural Networks
Cascade Correlation Architecture and Learning Algorithm for Neural Networks.
Artificial Neural Networks
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
Non-Bayes classifiers. Linear discriminants, neural networks.
EE459 Neural Networks Backpropagation
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Hand-written character recognition
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Deep Residual Learning for Image Recognition
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Deep Residual Learning for Image Recognition
Convolutional Sequence to Sequence Learning
Fall 2004 Backpropagation CS478 - Machine Learning.
CS 388: Natural Language Processing: Neural Networks
Deep Feedforward Networks
Artificial Neural Networks
CSC321 Lecture 18: Hopfield nets and simulated annealing
Randomness in Neural Networks
Learning with Perceptrons and Neural Networks
Extreme Learning Machine
Computer Science and Engineering, Seoul National University
Real Neurons Cell structures Cell body Dendrites Axon
AlphaGo with Deep RL Alpha GO.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Intelligent Information System Lab
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
CSC 578 Neural Networks and Deep Learning
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Artificial Intelligence 13. Multi-Layer ANNs
Tips for Training Deep Network
Artificial Neural Networks
Neural Networks Geoff Hulten.
Outline Background Motivation Proposed Model Experimental Results
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (1) Traditional multi-layer perceptrons
ImageNet Classification with Deep Convolutional Neural Networks
COSC 4335: Part2: Other Classification Techniques
Inception-v4, Inception-ResNet and the Impact of
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Learning and Memorization
CSC 578 Neural Networks and Deep Learning
A Neural Network for Car-Passenger matching in Ride Hailing Services.
Presentation transcript:

Incremental Training of Deep Convolutional Neural Networks R. Istrate, A. C. I. Malossi, C. Bekas, and D. Nikolopoulos ArXiv:1803.10232v1 2018. 03. 27.

Depth trade-off Depth of Deep Neural network shows its capacity – How to choose the depth of our Network? Deep Network has high capacity, but they need too much resources. Shallow Network has a limited capacity, but they can converge fastly. Existing solution : Grid search Disadvantage : Only too late in the process we learn whether the network is not well suited for the dataset.

Methodology Consider a generic CNN 𝒩 composed of 𝑛 layers. Partition 𝒩 into 𝐾 sub-networks 𝑆 𝑘 , 𝑘=1,…𝐾 and 𝐾≤𝑛. Each sub-network contains learnable parameters There are no sub-network composed by just pooling and dropout.

Methodology The training process starts with sub-network 𝑆 1 . To determine when is the optimal time to add the second sub-network 𝑆 2 between 𝑆 1 and classifier, we compute every window size(ws) epochs the improvement in the validation accuracy. When the improvement observed is below a threshold (fixed), stop the training and increase the network depth by adding the next sub-network.

Methodology

Criteria for end training Every ws epochs, compute the angle 𝛼 between the linear approximation of the last ws accuracy points and the x-axis. The training is stopped when 𝛼 𝑖 ≤ 𝛾 𝛼 𝑖−1 , where 𝛾 is a predefined threshold and 𝛼 𝑖 is the angle characterizing the accuracy for the i-th window.

Look-ahead initialization When a new sub-network 𝑆 𝑘+1 is inserted in the current architecture, its weights need to be initialized Random initialization shows empirically bad performance Look-ahead initialization : fix former sub-networks 𝑆 1 , …, 𝑆 𝑘 and learn 𝑆 𝑘+1 only. for a few epochs. The depth of the look-ahead tends to be comparably smaller than the depth of the final network, therefore the training of the look-ahead is not considered expensive.

Experiments Datasets : CIFAR10 Basic Networks : VGGNet and ResNet

Experiments Look-ahead initialization reduce the decrease of validation accuracy when the new sub-net is added If we use same resource, incremental learning shows better performance than baseline model.

Experiments

Experiments

Conclusion Incremental learning in this thesis is not a learning for stream dataset, but a learning for the depth of network. It can be easily applied to the online learning setting. Maximum depth of the model does not need to predefined before we start learning. If we use equivalent block for ResNet or VGGNet, then we can attach a new sub-network until the model converges.

Online Deep Learning: Learning Deep Neural Networks on the Fly D. Sahoo, Q. Pham, J. Lu, S .C .H. Hoi Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) arXiv:1711.03705 2017. 11. 10.

Online deep learning In many cases, data arrives sequentially in a stream, any may be too large to be stored in memory. Moreover, the data may exhibit concept drift. Online learning : class of learning algorithms that learn to optimize predictive models over a stream of data instances sequentially.

Online deep learning for deep learning Previous online learning methods are focus on a linear and kernel (2-layers) models. The point is, in online learning, data is small at first and increase gradually. Depth trade-off How to choose depth? Explicit and Implicit methods There prediction is just perform by the last hidden layer. This hinder the learning of lower layer’s weights.

Proposed model Model Parameters : 𝑊, Θ, 𝛼 Prediction is the weighted sum of each hidden layers.

Hedge Algorithm Algorithm for learning 𝛼 (Freund and Schapire, 1997). – Adaboost Loss function ℒ 𝐹 𝑥 , 𝑦 = 𝑙=0 𝐿 𝛼 𝑙 ℒ( 𝑓 𝑙 𝑥 , 𝑦)) Initialize 𝛼 0 (𝑙) with uniformly distributed, i.e. 𝛼 0 (𝑙) = 1 𝐿+1 At every 𝑡 iteration, update 𝛼 (𝑙) as 𝛼 𝑡+1 (𝑙) ← 𝛼 𝑡 (𝑙) 𝛽 ℒ 𝑓 𝑙 𝑥 , 𝑦 where 𝛽∈(0,1) is the discount rate parameter, and ℒ 𝑓 𝑙 𝑥 , 𝑦 ∈(0,1). Normalize 𝛼 𝑙 s.t. 𝑙=0 𝐿 𝛼 𝑡+1 (𝑙) =1.

Hedge Algorithm Hedge enjoys a regret of 𝑅 𝑇 ≤ 𝑇 𝑙𝑛𝑁 , where 𝑁 is the number of experts (Freund and Schapire, 1999), which in this case is the network depth. Since shallower models tends to converge faster than deeper models, using a hedging strategy would lower 𝛼 weights of deeper classifiers to a very small value. To alleviate this, use smoothing parameter 𝑠 ∈(0,1) which is used to set a minimum weight for each classifier. 𝛼 𝑡 (𝑙) ⇠max⁡( 𝛼 𝑡 𝑙 , 𝑠 𝐿 )

Online Deep Learning using HBP Learning 𝑊, Θ is based on the basic backpropagation.

Contribution Dynamic Objective : Having a dynamically adaptive objective function mitigates the impact of vanishing gradient and helps escape saddle points and local minima. Student-teacher learning Ensemble Concept drifting Convolutional Networks

Experiments - Datasets

Experiments – Traditional Online BP

Experiments – Comparison

Experiments – Convergence speed

Experiments – Evolution of weight 𝜶

Experiments – Robust to the Base Net