Presented by Xinxin Zuo 10/20/2017

Presented by Xinxin Zuo 10/20/2017
Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017

Outline What is Adam The optimization algorithm
Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References

What is Adam Adaptive moment estimation(1st and 2nd momentum)
Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum

Momentum

Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.

The optimization algorithm
Our goal is to minimization the cost function and solve for the parameters.

The optimization algorithm-cont.
Input: 𝜶 − stepsize 𝜷 𝟏 , 𝜷 𝟐 − exponential decay rates, in [0,1) 𝜺 − constant Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter Update: 𝒈 𝒕 ← 𝜵 𝜽 𝒇 𝒕 𝜽 𝒕 𝒎 𝒕 ← 𝜷 𝟏 ∙ 𝒎 𝒕−𝟏 + (𝟏 − 𝜷 𝟏 )∙ 𝒈 𝒕 𝒗 𝒕 ← 𝜷 𝟐 ∙ 𝒗 𝒕−𝟏 + (𝟏 − 𝜷 𝟐 )∙ 𝒈 𝒕 𝟐 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) 𝜽 𝒕 ← 𝜽 𝒕−𝟏 − 𝜶∙ 𝒎 𝒕 𝒗 𝒕 +𝜺

Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ε[ 𝑔 𝑡 ] Ε[ 𝑔 𝑡 2 ] Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter

𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) Bias correction Derivation about the bias term (2nd momentum) … … Small Value

Bounded update ∆ 𝒕 = 𝜶 ∙ 𝒎 𝒕 𝒗 𝒕 +𝝐 Trust-region 𝒎 𝒕 𝒗 𝒕 ≈ 𝑬[ 𝒈 𝒕 ] 𝑬[ 𝒈 𝒕 𝟐 ] ≤𝟏

Relations to other adaptive approaches
RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. 𝑬 𝒈 𝒕 𝟐 = 𝜸𝑬 𝒈 𝒕−𝟏 𝟐 + 𝟏−𝜸 𝒈 𝒕 𝟐 𝜽 𝒕+𝟏 = 𝜽 𝒕 − 𝜶 𝑬 𝒈 𝒕 𝟐 +𝝐 𝒈 𝒕

Relations to other adaptive approaches-cont.
AdaGrad 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒗 𝒕 = 𝟏− 𝜷 𝟐 𝟏− 𝜷 𝟐 𝒕 𝒊=𝟏 𝒕 𝜷 𝟐 𝒕−𝒊 𝒈 𝒊 𝟐 𝐥𝐢𝐦 𝜷 𝟐 →𝟏 𝒗 𝒕 = 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝜷 𝟐 →𝟏 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒎 𝒕 = 𝟏− 𝜷 𝟏 𝟏− 𝜷 𝟏 𝒕 𝒊=𝟏 𝒕 𝜷 𝟏 𝒕−𝒊 𝒈 𝒊 𝜷 𝟏 =𝟎 𝒎 𝒕 = 𝒈 𝒕 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒕 −𝟏/𝟐 ∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐

Why use Adam Sparse gradients
Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks

Adam in TensorFlow Default Value # Adam
Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8

Experiments-Logistic Regression

Experiments – Multi-layer NN
neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation

Experiments – CNN CNN architecture:
CIFAR-10 with c64-c64-c architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s)

References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: , 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv: , 2016.

Presented by Xinxin Zuo 10/20/2017

Similar presentations

Presentation on theme: "Presented by Xinxin Zuo 10/20/2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Xinxin Zuo 10/20/2017

Similar presentations

Presentation on theme: "Presented by Xinxin Zuo 10/20/2017"— Presentation transcript:

Similar presentations

About project

Feedback