Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Xinxin Zuo 10/20/2017

Similar presentations


Presentation on theme: "Presented by Xinxin Zuo 10/20/2017"β€” Presentation transcript:

1 Presented by Xinxin Zuo 10/20/2017
Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017

2 Outline What is Adam The optimization algorithm
Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References

3 What is Adam Adaptive moment estimation(1st and 2nd momentum)
Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum

4 Momentum

5 Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.

6 The optimization algorithm
Our goal is to minimization the cost function and solve for the parameters.

7 The optimization algorithm-cont.
Input: 𝜢 βˆ’ stepsize 𝜷 𝟏 , 𝜷 𝟐 βˆ’ exponential decay rates, in [0,1) 𝜺 βˆ’ constant Initial: π’Ž 𝟎 β†πŸŽ βˆ’ 1st moment vector 𝒗 𝟎 β†πŸŽ – 2nd moment vector 𝜽 𝟎 – Initial parameter Update: π’ˆ 𝒕 ← 𝜡 𝜽 𝒇 𝒕 𝜽 𝒕 π’Ž 𝒕 ← 𝜷 𝟏 βˆ™ π’Ž π’•βˆ’πŸ + (𝟏 βˆ’ 𝜷 𝟏 )βˆ™ π’ˆ 𝒕 𝒗 𝒕 ← 𝜷 𝟐 βˆ™ 𝒗 π’•βˆ’πŸ + (𝟏 βˆ’ 𝜷 𝟐 )βˆ™ π’ˆ 𝒕 𝟐 π’Ž 𝒕 ← π’Ž 𝒕 (πŸβˆ’ 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 πŸβˆ’ 𝜷 𝟐 𝒕 (bias correction) 𝜽 𝒕 ← 𝜽 π’•βˆ’πŸ βˆ’ πœΆβˆ™ π’Ž 𝒕 𝒗 𝒕 +𝜺

8 The optimization algorithm-cont.
Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ξ•[ 𝑔 𝑑 ] Ξ•[ 𝑔 𝑑 2 ] Initial: π’Ž 𝟎 β†πŸŽ βˆ’ 1st moment vector 𝒗 𝟎 β†πŸŽ – 2nd moment vector 𝜽 𝟎 – Initial parameter

9 The optimization algorithm-cont.
π’Ž 𝒕 ← π’Ž 𝒕 (πŸβˆ’ 𝜷 𝟏 𝒕 ) (bias correction) 𝒗 𝒕 ← 𝒗 𝒕 πŸβˆ’ 𝜷 𝟐 𝒕 (bias correction) Bias correction Derivation about the bias term (2nd momentum) … … Small Value

10 The optimization algorithm-cont.
Bounded update βˆ† 𝒕 = 𝜢 βˆ™ π’Ž 𝒕 𝒗 𝒕 +𝝐 Trust-region π’Ž 𝒕 𝒗 𝒕 β‰ˆ 𝑬[ π’ˆ 𝒕 ] 𝑬[ π’ˆ 𝒕 𝟐 ] β‰€πŸ

11 Relations to other adaptive approaches
RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. 𝑬 π’ˆ 𝒕 𝟐 = πœΈπ‘¬ π’ˆ π’•βˆ’πŸ 𝟐 + πŸβˆ’πœΈ π’ˆ 𝒕 𝟐 𝜽 𝒕+𝟏 = 𝜽 𝒕 βˆ’ 𝜢 𝑬 π’ˆ 𝒕 𝟐 +𝝐 π’ˆ 𝒕

12 Relations to other adaptive approaches-cont.
AdaGrad 𝜽 𝒕+𝟏 = 𝜽 𝒕 βˆ’πœΆβˆ™ π’ˆ 𝒕 π’Š=𝟏 𝒕 π’ˆ π’Š 𝟐 𝒗 𝒕 = πŸβˆ’ 𝜷 𝟐 πŸβˆ’ 𝜷 𝟐 𝒕 π’Š=𝟏 𝒕 𝜷 𝟐 π’•βˆ’π’Š π’ˆ π’Š 𝟐 π₯𝐒𝐦 𝜷 𝟐 β†’πŸ 𝒗 𝒕 = 𝒕 βˆ’πŸ π’Š=𝟏 𝒕 π’ˆ π’Š 𝟐 𝜷 𝟐 β†’πŸ 𝜽 𝒕+𝟏 = 𝜽 𝒕 βˆ’πœΆβˆ™ π’ˆ 𝒕 𝒕 βˆ’πŸ π’Š=𝟏 𝒕 π’ˆ π’Š 𝟐 π’Ž 𝒕 = πŸβˆ’ 𝜷 𝟏 πŸβˆ’ 𝜷 𝟏 𝒕 π’Š=𝟏 𝒕 𝜷 𝟏 π’•βˆ’π’Š π’ˆ π’Š 𝜷 𝟏 =𝟎 π’Ž 𝒕 = π’ˆ 𝒕 𝜽 𝒕+𝟏 = 𝜽 𝒕 βˆ’πœΆβˆ™ 𝒕 βˆ’πŸ/𝟐 βˆ™ π’ˆ 𝒕 π’Š=𝟏 𝒕 π’ˆ π’Š 𝟐

13 Why use Adam Sparse gradients
Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks

14 Adam in TensorFlow Default Value # Adam
Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8

15 Experiments-Logistic Regression

16 Experiments – Multi-layer NN
neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation

17 Experiments – CNN CNN architecture:
CIFAR-10 with c64-c64-c architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s)

18 References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: , 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv: , 2016.


Download ppt "Presented by Xinxin Zuo 10/20/2017"

Similar presentations


Ads by Google