Presented by Xinxin Zuo 10/20/2017

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

ImageNet Classification with Deep Convolutional Neural Networks

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

ICS 273A UC Irvine Instructor: Max Welling Neural Networks.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Classification / Regression Neural Networks 2

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

CSC321: Neural Networks Lecture 9: Speeding up the Learning

Machine Learning Supervised Learning Classification and Regression

Back Propagation and Representation in PDP Networks

Deep Learning Methods For Automated Discourse CIS 700-7

Fall 2004 Backpropagation CS478 - Machine Learning.

RNNs: An example applied to the prediction task

Convolutional Neural Network

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS : Designing, Visualizing and Understanding Deep Neural Networks

National Taiwan University

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

COMP24111: Machine Learning and Optimisation

Class 6 Mini Presentations - part

CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.

ECE 6504 Deep Learning for Perception

Lecture 5 Smaller Network: CNN

Convolutional Networks

RNNs: Going Beyond the SRN in Language Prediction

Introduction to Neural Networks

Steepest Descent Algorithm: Step 1.

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Smart Robots, Drones, IoT

Neural Network - 2 Mayank Vatsa

CSC 578 Neural Networks and Deep Learning

Deep Neural Networks (DNN)

LECTURE 35: Introduction to EEG Processing

Neural Networks Geoff Hulten.

Class Project Survey of 3-5 research papers

LECTURE 33: Alternative OPTIMIZERS

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Forward and Backward Max Pooling

Designing Neural Network Architectures Using Reinforcement Learning

Coding neural networks: A gentle Introduction to keras

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Back Propagation and Representation in PDP Networks

Neural networks (1) Traditional multi-layer perceptrons

Mihir Patel and Nikhil Sardana

实习生汇报 ——北邮张安迪.

Deep Residual Learning for Automatic Seizure Detection

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Continuous Control M. Hamza Javed

Introduction to Deep Learning

Introduction to Neural Networks

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Introduction to Neural Networks

Image recognition.

CSC 578 Neural Networks and Deep Learning

Logistic Regression Geoff Hulten.

Overall Introduction for the Lecture

First-Order Methods.

Presentation transcript:

Presented by Xinxin Zuo 10/20/2017 Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017

Outline What is Adam The optimization algorithm Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References

What is Adam Adaptive moment estimation(1st and 2nd momentum) Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum

Momentum

Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.

The optimization algorithm Our goal is to minimization the cost function and solve for the parameters.

The optimization algorithm-cont. Input: 𝜶 − stepsize 𝜷 𝟏 , 𝜷 𝟐 − exponential decay rates, in [0,1) 𝜺 − constant Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter Update: 𝒈 𝒕 ← 𝜵 𝜽 𝒇 𝒕 𝜽 𝒕 𝒎 𝒕 ← 𝜷 𝟏 ∙ 𝒎 𝒕−𝟏 + (𝟏 − 𝜷 𝟏 )∙ 𝒈 𝒕 𝒗 𝒕 ← 𝜷 𝟐 ∙ 𝒗 𝒕−𝟏 + (𝟏 − 𝜷 𝟐 )∙ 𝒈 𝒕 𝟐 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) 𝜽 𝒕 ← 𝜽 𝒕−𝟏 − 𝜶∙ 𝒎 𝒕 𝒗 𝒕 +𝜺

The optimization algorithm-cont. Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ε[ 𝑔 𝑡 ] Ε[ 𝑔 𝑡 2 ] Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter

The optimization algorithm-cont. 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) Bias correction Derivation about the bias term (2nd momentum) … … Small Value

The optimization algorithm-cont. Bounded update ∆ 𝒕 = 𝜶 ∙ 𝒎 𝒕 𝒗 𝒕 +𝝐 Trust-region 𝒎 𝒕 𝒗 𝒕 ≈ 𝑬[ 𝒈 𝒕 ] 𝑬[ 𝒈 𝒕 𝟐 ] ≤𝟏

Relations to other adaptive approaches RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. 𝑬 𝒈 𝒕 𝟐 = 𝜸𝑬 𝒈 𝒕−𝟏 𝟐 + 𝟏−𝜸 𝒈 𝒕 𝟐 𝜽 𝒕+𝟏 = 𝜽 𝒕 − 𝜶 𝑬 𝒈 𝒕 𝟐 +𝝐 𝒈 𝒕

Relations to other adaptive approaches-cont. AdaGrad 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒗 𝒕 = 𝟏− 𝜷 𝟐 𝟏− 𝜷 𝟐 𝒕 𝒊=𝟏 𝒕 𝜷 𝟐 𝒕−𝒊 𝒈 𝒊 𝟐 𝐥𝐢𝐦 𝜷 𝟐 →𝟏 𝒗 𝒕 = 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝜷 𝟐 →𝟏 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒎 𝒕 = 𝟏− 𝜷 𝟏 𝟏− 𝜷 𝟏 𝒕 𝒊=𝟏 𝒕 𝜷 𝟏 𝒕−𝒊 𝒈 𝒊 𝜷 𝟏 =𝟎 𝒎 𝒕 = 𝒈 𝒕 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒕 −𝟏/𝟐 ∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐

Why use Adam Sparse gradients Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks

Adam in TensorFlow Default Value # Adam Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8

Experiments-Logistic Regression

Experiments – Multi-layer NN neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation

Experiments – CNN CNN architecture: CIFAR-10 with c64-c64-c128-1000 architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s)

References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv:1609.04747, 2016. https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer