Download presentation
Presentation is loading. Please wait.
1
Presented by Xinxin Zuo 10/20/2017
Adam: A Method For Stochastic Optimization Diederik P. Kingma Jimmy Lei Ba Presented by Xinxin Zuo 10/20/2017
2
Outline What is Adam The optimization algorithm
Bias correction Bounded update Relations with Other approaches RMSRop Adagrad Why use Adam Applications/Experiments References
3
What is Adam Adaptive moment estimation(1st and 2nd momentum)
Cited by 4172 times since 2014 Adaptive moment estimation(1st and 2nd momentum) Then review about Momentum
4
Momentum
5
Momentum-cont. It damps oscillations in directions of high curvature by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.
6
The optimization algorithm
Our goal is to minimization the cost function and solve for the parameters.
7
The optimization algorithm-cont.
Input: πΆ β stepsize π· π , π· π β exponential decay rates, in [0,1) πΊ β constant Initial: π π βπ β 1st moment vector π π βπ β 2nd moment vector π½ π β Initial parameter Update: π π β π΅ π½ π π π½ π π π β π· π β π πβπ + (π β π· π )β π π π π β π· π β π πβπ + (π β π· π )β π π π π π β π π (πβ π· π π ) (bias correction) π π β π π πβ π· π π (bias correction) π½ π β π½ πβπ β πΆβ π π π π +πΊ
8
The optimization algorithm-cont.
Bias correction Why do we need the correction? It comes from the initialization. Bias towards zeros. What we really want is the expected value. Ξ[ π π‘ ] Ξ[ π π‘ 2 ] Initial: π π βπ β 1st moment vector π π βπ β 2nd moment vector π½ π β Initial parameter
9
The optimization algorithm-cont.
π π β π π (πβ π· π π ) (bias correction) π π β π π πβ π· π π (bias correction) Bias correction Derivation about the bias term (2nd momentum) β¦ β¦ Small Value
10
The optimization algorithm-cont.
Bounded update β π = πΆ β π π π π +π Trust-region π π π π β π¬[ π π ] π¬[ π π π ] β€π
11
Relations to other adaptive approaches
RMSProp lacks a bias-correction term, leads to very large stepsizes and will probably diverge. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient. π¬ π π π = πΈπ¬ π πβπ π + πβπΈ π π π π½ π+π = π½ π β πΆ π¬ π π π +π π π
12
Relations to other adaptive approaches-cont.
AdaGrad π½ π+π = π½ π βπΆβ π π π=π π π π π π π = πβ π· π πβ π· π π π=π π π· π πβπ π π π π₯π’π¦ π· π βπ π π = π βπ π=π π π π π π· π βπ π½ π+π = π½ π βπΆβ π π π βπ π=π π π π π π π = πβ π· π πβ π· π π π=π π π· π πβπ π π π· π =π π π = π π π½ π+π = π½ π βπΆβ π βπ/π β π π π=π π π π π
13
Why use Adam Sparse gradients
Intuitive interpretations and typically require little tuning Used quite a lot in large and complex Networks
14
Adam in TensorFlow Default Value # Adam
Adam_opt = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon) Default Value learning_rate = 0.001 beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8
15
Experiments-Logistic Regression
16
Experiments β Multi-layer NN
neural network model: two fully connected hidden layers with 1000 hidden units each and ReLU activation
17
Experiments β CNN CNN architecture:
CIFAR-10 with c64-c64-c architecture three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLUβs)
18
References Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: , 2014. Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint arXiv: , 2016.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.