Presented by Xinxin Zuo 10/20/2017

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
ImageNet Classification with Deep Convolutional Neural Networks
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Machine Learning Supervised Learning Classification and Regression
Back Propagation and Representation in PDP Networks
Deep Learning Methods For Automated Discourse CIS 700-7
Fall 2004 Backpropagation CS478 - Machine Learning.
RNNs: An example applied to the prediction task
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS : Designing, Visualizing and Understanding Deep Neural Networks
National Taiwan University
Data Mining, Neural Network and Genetic Programming
Computer Science and Engineering, Seoul National University
COMP24111: Machine Learning and Optimisation
Class 6 Mini Presentations - part
CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.
ECE 6504 Deep Learning for Perception
Lecture 5 Smaller Network: CNN
Convolutional Networks
RNNs: Going Beyond the SRN in Language Prediction
Introduction to Neural Networks
Steepest Descent Algorithm: Step 1.
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Very Deep Convolutional Networks for Large-Scale Image Recognition
Smart Robots, Drones, IoT
Neural Network - 2 Mayank Vatsa
CSC 578 Neural Networks and Deep Learning
Deep Neural Networks (DNN)
LECTURE 35: Introduction to EEG Processing
Neural Networks Geoff Hulten.
Class Project Survey of 3-5 research papers
LECTURE 33: Alternative OPTIMIZERS
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Forward and Backward Max Pooling
Designing Neural Network Architectures Using Reinforcement Learning
Coding neural networks: A gentle Introduction to keras
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Back Propagation and Representation in PDP Networks
Neural networks (1) Traditional multi-layer perceptrons
Mihir Patel and Nikhil Sardana
实习生汇报 ——北邮 张安迪.
Deep Residual Learning for Automatic Seizure Detection
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Continuous Control M. Hamza Javed
Introduction to Deep Learning
Introduction to Neural Networks
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Introduction to Neural Networks
Image recognition.
CSC 578 Neural Networks and Deep Learning
Logistic Regression Geoff Hulten.
Overall Introduction for the Lecture
First-Order Methods.
Presentation transcript:

Presented by Xinxin Zuo 10/20/2017 Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017

Outline What is Adam The optimization algorithm Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References

What is Adam Adaptive moment estimation(1st and 2nd momentum) Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum

Momentum

Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.

The optimization algorithm Our goal is to minimization the cost function and solve for the parameters.

The optimization algorithm-cont. Input: 𝜶 − stepsize 𝜷 𝟏 , 𝜷 𝟐 − exponential decay rates, in [0,1) 𝜺 − constant Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter Update: 𝒈 𝒕 ← 𝜵 𝜽 𝒇 𝒕 𝜽 𝒕 𝒎 𝒕 ← 𝜷 𝟏 ∙ 𝒎 𝒕−𝟏 + (𝟏 − 𝜷 𝟏 )∙ 𝒈 𝒕 𝒗 𝒕 ← 𝜷 𝟐 ∙ 𝒗 𝒕−𝟏 + (𝟏 − 𝜷 𝟐 )∙ 𝒈 𝒕 𝟐 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) 𝜽 𝒕 ← 𝜽 𝒕−𝟏 − 𝜶∙ 𝒎 𝒕 𝒗 𝒕 +𝜺

The optimization algorithm-cont. Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ε[ 𝑔 𝑡 ] Ε[ 𝑔 𝑡 2 ] Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter

The optimization algorithm-cont. 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) Bias correction Derivation about the bias term (2nd momentum) … … Small Value

The optimization algorithm-cont. Bounded update ∆ 𝒕 = 𝜶 ∙ 𝒎 𝒕 𝒗 𝒕 +𝝐 Trust-region 𝒎 𝒕 𝒗 𝒕 ≈ 𝑬[ 𝒈 𝒕 ] 𝑬[ 𝒈 𝒕 𝟐 ] ≤𝟏

Relations to other adaptive approaches RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. 𝑬 𝒈 𝒕 𝟐 = 𝜸𝑬 𝒈 𝒕−𝟏 𝟐 + 𝟏−𝜸 𝒈 𝒕 𝟐 𝜽 𝒕+𝟏 = 𝜽 𝒕 − 𝜶 𝑬 𝒈 𝒕 𝟐 +𝝐 𝒈 𝒕

Relations to other adaptive approaches-cont. AdaGrad 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒗 𝒕 = 𝟏− 𝜷 𝟐 𝟏− 𝜷 𝟐 𝒕 𝒊=𝟏 𝒕 𝜷 𝟐 𝒕−𝒊 𝒈 𝒊 𝟐 𝐥𝐢𝐦 𝜷 𝟐 →𝟏 𝒗 𝒕 = 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝜷 𝟐 →𝟏 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒎 𝒕 = 𝟏− 𝜷 𝟏 𝟏− 𝜷 𝟏 𝒕 𝒊=𝟏 𝒕 𝜷 𝟏 𝒕−𝒊 𝒈 𝒊 𝜷 𝟏 =𝟎 𝒎 𝒕 = 𝒈 𝒕 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒕 −𝟏/𝟐 ∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐

Why use Adam Sparse gradients Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks

Adam in TensorFlow Default Value # Adam Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8

Experiments-Logistic Regression

Experiments – Multi-layer NN neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation

Experiments – CNN CNN architecture: CIFAR-10 with c64-c64-c128-1000 architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s)

References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv:1609.04747, 2016. https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer