Presentation is loading. Please wait.

Presentation is loading. Please wait.

Steepest Descent Algorithm: Step 1.

Similar presentations


Presentation on theme: "Steepest Descent Algorithm: Step 1. "β€” Presentation transcript:

1 Steepest Descent Algorithm: Step 1. 𝑀 (𝑑+1) = 𝑀 (𝑑) βˆ’πœ‚β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑)
Very simple. Just one step. Poor convergence if the function surface is more steep in one dimension than in another. Let’s see an example!

2 Steepest Descent Objective: 𝐿 𝑀 = 𝑀 1 2 +25 𝑀 2 2 Gradient:
𝛻𝐿= 2 𝑀 𝑀 2 Small over 𝑀 1 direction, large over 𝑀 2 direction! -50 𝑀 2 βˆ’2 𝑀 1

3 Steepest Descent Zigzag behavior!
Move very slowly on the flat direction, oscillate on the steep direction.

4 Momentum Accelerate convergence on the direction that the gradient keeps the same sign over the past few iterations. If πœ‡ is large, momentum will overshoot the target, but the overall convergence is still faster than steepest descent in practice. As πœ‡ goes smaller, momentum behaves more similar to steepest descent.

5

6 Nesterov Accelerated Gradient
Algorithm: Step 1. 𝑣 (𝑑+1) =πœ‡ 𝑣 (𝑑) βˆ’πœ‚β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑) +πœ‡ 𝑣 (𝑑) Step 2. 𝑀 (𝑑+1) = 𝑀 (𝑑) + 𝑣 (𝑑+1) Nesterov Accelerated Gradient has better performance than momentum [Ilya et al., 2013].

7

8 AdaGrad (Adaptive Gradient)
Consider the previous example: 𝐿 𝑀 = 𝑀 𝑀 2 2 Gradient: 𝛻𝐿= 2 𝑀 1 50 𝑀 2 Initial 𝑀 (0) is βˆ’1,βˆ’1 . 𝑀 (1) = βˆ’1 βˆ’1 βˆ’ πœ‚ βˆ’ βˆ’7 β‹…(βˆ’2) πœ‚ βˆ’ βˆ’7 β‹…(βˆ’50) β‰ˆ βˆ’1 βˆ’1 + πœ‚ πœ‚ Will follow the diagonal direction!

9

10 RMSProp (Root Mean Square Propagation)
Algorithm: Step 1. 𝐺 (𝑑+1) =𝛾 𝐺 (𝑑) + 1βˆ’π›Ύ β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑) 2 𝐺 is a moving exponential average of the square gradient. Step 2. 𝑀 (𝑑+1) = 𝑀 (𝑑) βˆ’ πœ‚ 𝐺 (𝑑+1) βˆ’7 β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑) Scale the learning rate by 1/ 𝐺 and follow the negative gradient direction. Default settings: 𝛾=0.9, πœ‚=0.001.

11

12 Another example (saddle point)
Now consider minimizing 𝐿 𝑀 = 𝑀 𝑀 2 2 The problem is unbounded and (0,0) is a saddle point. At βˆ’0.01,βˆ’1 , we have 𝛻𝐿= βˆ’2 . How will these methods behave if we start at this point?

13

14 Adam (Adaptive Moment Estimation)
Algorithm [Complete Version]: Step 1. π‘š (𝑑+1) = 𝛽 1 π‘š (𝑑) + 1βˆ’ 𝛽 1 β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑) Step 2. 𝑣 (𝑑+1) = 𝛽 2 𝑣 (𝑑) +(1βˆ’ 𝛽 2 ) β‹… πœ•πΏ πœ•π‘€ 𝑀 (𝑑) 2 Step 3. π‘š (𝑑+1) = 1 1βˆ’ 𝛽 1 𝑑 β‹… π‘š (𝑑+1) Step 4. 𝑣 (𝑑+1) = 1 1βˆ’ 𝛽 2 𝑑 β‹… 𝑣 (𝑑+1) Step 5. 𝑀 (𝑑+1) = 𝑀 (𝑑) βˆ’ πœ‚ 𝑣 (𝑑+1) βˆ’8 β‹… π‘š (𝑑+1) Bias Correction (since we set π‘š (0) , 𝑣 (0) to 0)

15

16


Download ppt "Steepest Descent Algorithm: Step 1. "

Similar presentations


Ads by Google