Presented by Xinxin Zuo 10/20/2017 Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017
Outline What is Adam The optimization algorithm Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References
What is Adam Adaptive moment estimation(1st and 2nd momentum) Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum
Momentum
Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.
The optimization algorithm Our goal is to minimization the cost function and solve for the parameters.
The optimization algorithm-cont. Input: 𝜶 − stepsize 𝜷 𝟏 , 𝜷 𝟐 − exponential decay rates, in [0,1) 𝜺 − constant Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter Update: 𝒈 𝒕 ← 𝜵 𝜽 𝒇 𝒕 𝜽 𝒕 𝒎 𝒕 ← 𝜷 𝟏 ∙ 𝒎 𝒕−𝟏 + (𝟏 − 𝜷 𝟏 )∙ 𝒈 𝒕 𝒗 𝒕 ← 𝜷 𝟐 ∙ 𝒗 𝒕−𝟏 + (𝟏 − 𝜷 𝟐 )∙ 𝒈 𝒕 𝟐 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) 𝜽 𝒕 ← 𝜽 𝒕−𝟏 − 𝜶∙ 𝒎 𝒕 𝒗 𝒕 +𝜺
The optimization algorithm-cont. Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ε[ 𝑔 𝑡 ] Ε[ 𝑔 𝑡 2 ] Initial: 𝒎 𝟎 ←𝟎 − 1st moment vector 𝒗 𝟎 ←𝟎 – 2nd moment vector 𝜽 𝟎 – Initial parameter
The optimization algorithm-cont. 𝒎 𝒕 ← 𝒎 𝒕 (𝟏− 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 𝟏− 𝜷 𝟐 𝒕 (bias correction) Bias correction Derivation about the bias term (2nd momentum) … … Small Value
The optimization algorithm-cont. Bounded update ∆ 𝒕 = 𝜶 ∙ 𝒎 𝒕 𝒗 𝒕 +𝝐 Trust-region 𝒎 𝒕 𝒗 𝒕 ≈ 𝑬[ 𝒈 𝒕 ] 𝑬[ 𝒈 𝒕 𝟐 ] ≤𝟏
Relations to other adaptive approaches RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. 𝑬 𝒈 𝒕 𝟐 = 𝜸𝑬 𝒈 𝒕−𝟏 𝟐 + 𝟏−𝜸 𝒈 𝒕 𝟐 𝜽 𝒕+𝟏 = 𝜽 𝒕 − 𝜶 𝑬 𝒈 𝒕 𝟐 +𝝐 𝒈 𝒕
Relations to other adaptive approaches-cont. AdaGrad 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒗 𝒕 = 𝟏− 𝜷 𝟐 𝟏− 𝜷 𝟐 𝒕 𝒊=𝟏 𝒕 𝜷 𝟐 𝒕−𝒊 𝒈 𝒊 𝟐 𝐥𝐢𝐦 𝜷 𝟐 →𝟏 𝒗 𝒕 = 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝜷 𝟐 →𝟏 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒈 𝒕 𝒕 −𝟏 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐 𝒎 𝒕 = 𝟏− 𝜷 𝟏 𝟏− 𝜷 𝟏 𝒕 𝒊=𝟏 𝒕 𝜷 𝟏 𝒕−𝒊 𝒈 𝒊 𝜷 𝟏 =𝟎 𝒎 𝒕 = 𝒈 𝒕 𝜽 𝒕+𝟏 = 𝜽 𝒕 −𝜶∙ 𝒕 −𝟏/𝟐 ∙ 𝒈 𝒕 𝒊=𝟏 𝒕 𝒈 𝒊 𝟐
Why use Adam Sparse gradients Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks
Adam in TensorFlow Default Value # Adam Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8
Experiments-Logistic Regression
Experiments – Multi-layer NN neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation
Experiments – CNN CNN architecture: CIFAR-10 with c64-c64-c128-1000 architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s)
References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv:1609.04747, 2016. https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer