Tight Analyses for Non-Smooth Stochastic Gradient Descent

Tight Analyses for Non-Smooth Stochastic Gradient Descent
Nick Harvey Chris Liaw Yaniv Plan Sikander Randhawa University of British Columbia

Gradient Descent is everywhere
UW CSE446: Machine Learning

Gradient Descent is everywhere
MIT 6.006: Undergraduate Algorithms (for non-theorists)

Gradient Descent in a Nutshell
Given current iterate, 𝑥 𝑡 : Produce 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝑡 𝛻𝑓( 𝑥 𝑡 ) UW CSE546: Machine Learning Assumption on 𝒇 Standard Convergence Rates Smooth and Strongly Convex 𝑓 𝑥 𝑡 −𝑂𝑃𝑇=𝒪( exp −𝑡 ) Smooth 𝑓 𝑥 𝑡 −𝑂𝑃𝑇=𝒪(1/𝑡) Non-Smooth* and Strongly Convex 𝑓 1 𝑡 𝑖=1 𝑡 𝑥 𝑖 −𝑂𝑃𝑇=𝒪( log (𝑡) /𝑡) Non-Smooth* and Lipschitz 𝑓 1 𝑡 𝑖=1 𝑡 𝑥 𝑖 −𝑂𝑃𝑇=𝒪(1/ 𝑡 ) * Use a subgradient instead of 𝛻𝑓( 𝑥 𝑡 )

Example: Geometric Median e. g
Example: Geometric Median e.g. [Cohen, Lee, Miller, Pachocki, Sidford ‘16] Given 𝑛 points in d dimensions: {𝑧 1 , 𝑧 2 , 𝑧 3 , …, 𝑧 𝑛 }⊂ ℝ 𝑑 . Goal: Minimize 𝑓(𝑥)= 𝑖=1 𝑛 𝑥− 𝑧 𝑖 2 Problem: 𝑛 is huge Evaluating 𝑓 is prohibitive 𝑓 is 𝑛-Lipschitz and non-differentiable. The “Geometric Median”.

Gradient Descent in a Nutshell
Given current iterate, 𝑥 𝑡 : Produce 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝑡 𝛻𝑓( 𝑥 𝑡 ) Assumption on 𝒇 Standard Convergence Rates Smooth and Strongly Convex 𝑓 𝑥 𝑡 −𝑂𝑃𝑇=𝒪( exp −𝑡 ) Smooth 𝑓 𝑥 𝑡 −𝑂𝑃𝑇=𝒪(1/𝑡) Non-Smooth and Strongly Convex 𝑓 1 𝑡 𝑖=1 𝑡 𝑥 𝑖 −𝑂𝑃𝑇=𝒪( log (𝑡) /𝑡) Non-Smooth and Lipschitz 𝑓 1 𝑡 𝑖=1 𝑡 𝑥 𝑖 −𝑂𝑃𝑇=𝒪(1/ 𝑡 ) “Last iterate results” “Average iterate results” Key Question: Why analyze the average iterate? Isn’t 𝑓( 𝑥 𝑡 ) better? Almost as good?

Formal setup Key Questions: Can we analyze 𝑓 𝑥 𝑡 −𝑂𝑃𝑇?
Input: convex set 𝑋⊆ℝ, convex function 𝑓 :𝑋→ℝ Algorithm: standard gradient descent (with projections) Assumptions 𝑓 is 𝐿-Lipschitz 𝑑𝑖𝑎𝑚 𝑋 ≤𝑅 Non-Assumptions 𝑓 is differentiable (or smooth) Key Questions: Can we analyze 𝑓 𝑥 𝑡 −𝑂𝑃𝑇? What about the stochastic setting?

Lipschitz functions 𝑓𝑓(𝑥)=∑ 𝑖=1 𝑛 𝑥− 𝑧 𝑖 2 f is 𝑳-Lipschitz if:
∀𝑥,𝑦 𝑓 𝑥 −𝑓 𝑦 ≤𝐿 𝑥−𝑦 2 Equivalently: every (sub-)gradient has norm ≤𝐿 Example: Geometric Median 𝑓𝑓(𝑥)=∑ 𝑖=1 𝑛 𝑥− 𝑧 𝑖 2 This is 𝑛-Lipschitz A Lipschitz function

Strongly Convex functions
𝑓 is 𝜶-strongly convex if: 𝑓 𝑥 − 𝛼 2 ‖𝑥‖ 2 2 is convex If 𝑓 is twice-continuously differentiable, same as 𝛻2𝑓 𝑥 ≽𝛼𝐼 Canonical Example: Regularization in ML min 𝑤 𝑖=1 𝑛 ℓ 𝑤; 𝑧 𝑖 +𝜆 ‖𝑤‖ 2 2 Empirical Loss Regularizer

Subgradients Definition (subgradient): Slope of a hyperplane that lower bounds 𝑓 and is tangent to 𝑓 at 𝑥. Notation: 𝜕𝑓 𝑥 = { all subgradients of 𝑓 at 𝑥 } If 𝑓 is differentiable at 𝑥 then 𝜕𝑓 𝑥 = { 𝛻𝑓(𝑥) } 𝑓 𝑥 =|𝑥| Example: 𝑓 𝑥 = 𝑥 : Convex, 1-Lipschitz Non-differentiable at 0 𝜕𝑓 0 = interval [-1,1] Note: 0.5∈𝜕𝑓(0) since 𝑔 0 =0 and 𝑔(𝑥)≤𝑓 𝑥 ∀𝑥 𝑔 𝑥 =0.5𝑥

Sub-Gradient Descent Problem: not monotonic! What to do?
Step sizes: Today we fix 𝜼 𝒕 = 𝟏 𝒕 Feasible set Initial point Input: 𝑋⊂ ℝ 𝑛 , 𝑥 1 ∈ ℝ 𝑛 , 𝜂 1 , 𝜂 2 ,… Repeat: Query subgradient oracle for 𝑔 𝑡 ∈𝜕𝑓 𝑥 𝑡 𝑦 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝑡 𝑔 𝑡 𝑥 𝑡+1 ← Π 𝑋 ( 𝑦 𝑡+1 ). 𝑡←𝑡+1 Until output desired Return: ??? 𝑓 𝑥 𝑡+1 >𝑓 𝑥 𝑡 𝑓 𝑥 Π 𝑋 𝑦 = “closest point in 𝑋 to 𝑦” 𝛻 𝑓(𝑥 𝑡 ) >0 − 𝜂 𝑡 𝑔 𝑡 𝑦 𝑡+1 𝑥 𝑡+1 Π 𝑋 ( 𝑦 𝑡+1 ) 𝑥 𝑡 Feasible set 𝑋 Problem: not monotonic! What to do? Line search? Return mint 𝑓( 𝑥 𝑡 )? Analyze 𝑓( 𝑖=1 𝑡 𝑥 𝑖 /𝑡 ) instead. Show that 𝑓 𝑥𝑡 converges to OPT anyways? Is computing f feasible? Cumbersome? Not sparse? At what rate?

Main Question 1: Convergence of 𝑓 𝑥 𝑡
Does 𝑓( 𝑥 𝑡 ) converge to OPT? If so, at what rate? “Running SGD for T iterations, and returning the last iterate, is a very common heuristic.” [Shamir-Zhang ‘13] “For all methods known so far, it is possible to guarantee only convergence of the average values of the objective function.” [Nesterov-Shikhman ’15] ($50) “What is the expected suboptimality of the last iterate returned by SGD, for non-smooth strongly convex problems?” [Shamir ’12]

Stochastic Sub-Gradient Descent
For Geometric Median, 𝑓 𝑥 = 𝑖=1 𝑛 𝑥− 𝑧 𝑖 Computing 𝜕𝑓 𝑥 𝑡 takes Ω(𝑛) time, which is prohibitive. Random Sampling: Suppose we find 𝒈 𝐭 such that 𝔼 𝒈 𝒕 𝒙 𝟏 , …, 𝒙 𝒕 ]∈𝝏𝒇 𝒙 𝒕 . Does Gradient Descent still work? (assume ∥ 𝒈 𝒕 ∥ is a.s. bounded) Input: 𝑋⊂ 𝑅 𝑛 , 𝑥 1 ∈ 𝑅 𝑛 , 𝜂 1 , 𝜂 2 ,… For 𝑡=1, …, 𝑇, do: Query the gradient oracle to obtain 𝒈 𝐭 𝑦 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝑡 𝑔 t 𝑥 𝑡+1 ← Π 𝑋 ( 𝑦 𝑡+1 ) Endfor SGD: No longer enforcing that 𝒈 𝐭 ∈𝝏𝒇 𝒙 𝒕

Deterministic / Expected UB Deterministic / Expected UB
Lipschitz Functions Deterministic / Expected UB High Probability UB Deterministic LB Uniform Averaging 𝑂 1/ 𝑇 [Nemirovski-Yudin ’83] 𝑂 1/ 𝑇 [Azuma] Ω 1/ 𝑇 [Nemirovski-Yudin ’83] Last Iterate 𝑂 log 𝑇 / 𝑇 [Shamir-Zhang ’13] Tight 𝐎 𝐥𝐨𝐠 𝐓 / 𝐓 [This work] Tight* 𝛀 𝐥𝐨𝐠 𝐓 / 𝐓 [This work] Strongly Convex & Lipschitz Functions Deterministic / Expected UB High Probability UB Deterministic LB Uniform Averaging 𝑂 log(𝑇)/𝑇 [Hazan-Agarwal-Kale ‘07] 𝑂 log 𝑇 /𝑇 [Kakade-Tewari ’08] Ω 1/𝑇 [Nemirovski-Yudin ’83] Epoch-based Averaging 𝑂 1/𝑇 [Hazan-Kale ’11] 𝑂 log ( log 𝑇) /𝑇 ” Suffix Averaging [Rakhlin-Shamir-Sridharan ’12] 𝑂 log( log T )/𝑇 Last Iterate [Shamir-Zhang ’13] Tight 𝐎 𝟏/𝐓 [This work] 𝐎 𝐥𝐨𝐠 𝐓 /𝐓 [This work] Tight* 𝛀 𝐥𝐨𝐠 𝐓 /𝐓 [This work] * dependence on log⁡(1/𝛿) is not completely tight

Main Result 1: Lower Bound, Lipschitz case
Euclidean ball in ℝT Theorem: Fix 𝑇∈ℕ. There exists a 1-Lipschitz function 𝑓 : 𝐵 2 𝑇 →ℝ (and a subgradient oracle for 𝑓) such that 𝑂𝑃𝑇=𝑓(0)=0, and executing GD from 𝑥 1 =0 with 𝜂 𝑡 =1/ 𝑡 (or 𝜂 𝑡 =1/ 𝑇 ) yields: 𝑓 𝑥 𝑡+1 ≥𝑓 𝑥 𝑡 𝑇 (𝑇−𝑡) for 𝑡<𝑇. (Monotonic increase!) Thus, 𝑓 𝑥 𝑇 −𝑂𝑃𝑇≥ log 𝑇 32 𝑇 . (Suboptimal convergence...) Furthermore, for any 𝑘≤𝑇 and any 𝑥 ∈𝑐𝑜𝑛𝑣 𝑥 𝑇−𝑘+1 ,…, 𝑥 𝑇 , 𝑓 𝑥 −𝑂𝑃𝑇≥ log⁡(𝑇/𝑘) 32 𝑇 . (...for any suffix of size 𝑜(𝑇))

Main Result 1: Lower Bound, Lipschitz case
Euclidean ball in ℝT Theorem: Fix 𝑇∈ℕ. Then ∃ 1-Lipschitz 𝑓 : 𝐵 2 𝑇 →ℝ s.t. 𝑂𝑃𝑇=𝑓(0)=0 and executing GD from 𝑥 1 =0 with 𝜂 𝑡 =1/ 𝑡 yields: 𝑓 𝑥 𝑡+1 ≥𝑓 𝑥 𝑡 𝑇 (𝑇−𝑡) for 𝑡<𝑇. (Monotonic increase!) Thus, 𝑓 𝑥 𝑇 −𝑂𝑃𝑇≥ log 𝑇 32 𝑇 . (Suboptimal convergence.) Theorem: Fix 𝜖>0. Then ∃ 1-Lipschitz 𝑓 : ℓ 2 →ℝ s.t. 𝑂𝑃𝑇=𝑓(0)=0 and executing GD from 𝑥 1 =0 yields: 𝑓 𝑥 𝑇 −𝑂𝑃𝑇 > log 1−𝜖 𝑇 𝑇 for infinitely many 𝑇. In other words, for any 𝑔(𝑇)∈𝑜( log 𝑇 / 𝑇 ), we have lim sup 𝑇→∞ 𝑓 𝑥 𝑇 𝑔 𝑇 =∞ .

Why “infinitely often” and “lim sup”?
Can we improve to 𝑓 𝑥 𝑇 −𝑂𝑃𝑇 > log 1−𝜖 𝑇 𝑇 for all 𝑇? But averaging has: 𝑖=1 𝑇 𝑓( 𝑥 𝑖 )/𝑇 −𝑂𝑃𝑇=𝑂(1/ 𝑇 ). Contradiction! So log⁡𝑇 factor can only occur sporadically.

Hitting the 𝐥𝐨𝐠 𝑻 𝑻 curve infinitely often:
Compatible with 𝑓 𝑖 Maintains Lipschitzness Given 𝑇 can construct function 𝑓 𝑇 , such that 𝑓 𝑇 𝑥 𝑇+1 −𝑂𝑃𝑇≥ log 𝑇 𝑇 . Consider 𝑓 𝑥 1 , 𝑥 2 , … ≝ 𝑖=1 ∞ 1 𝑖2 𝑓 𝑖 𝑥 𝑖 . This function simulates running GD on each 𝑓 𝑖 in parallel. At time step 𝑖 , we will have 1 𝑖 2 𝑓 𝑖 𝑥 𝑖 ≥ log 𝑖 𝑖 𝑖 . Observe, log 𝑖 ≫ 𝑖 2 .

Some intuition about the lower bound…
Consider the function 𝒇 𝒙 =|𝒙| on the interval [-1,1] with initial point 𝒙 𝟏 =𝟎. Recall: 𝑥 2 = 𝑥 1 − 𝑔 1 , 𝑔 1 ∈𝜕𝑓 0 . Line of slope -1 which lower bounds |𝑥| and is tangent to 𝑥 at 0! 𝑥 1 𝑥 2 𝜕𝑓(0) = “slopes of lines which lower bound 𝑓 and are tangent to 𝑓at 0." Hence if the sub-gradient oracle selects -1 for 𝑔 1 , then 𝑥 2 = 𝑥 1 − −1 =1 Takeaway: Non-differentiable points can increase the function value!

How to keep increasing? Instead, introduce another dimension!
We can increase function value at first step! Can we repeat? Sadly, no: Convexity ⇒ all sub-gradients at 𝑥 2 away point from origin. Instead, introduce another dimension! Consider 𝑓 𝑢,𝑣 = max −𝑢, 0.5𝑢−𝑣, 0.5𝑢+0.5𝑣 . We know we can increase the value at the first step, but even with another if 𝑓 were non-differentiable at 𝑥 2 =1, convexity would force all sub-gradients to point back towards the origin! 𝑥 1 =(0,0), 𝑓 𝑥 1 =0 𝑥 2 = 1, 0 , 𝑓(𝑥 2 )=0.5 But we also want the function to continue increasing! Need the negative slope here to continue moving away from the origin. 𝑥 3 𝑥 3 = , , 𝑓(𝑥 3 )=0.677 𝑥 2 𝑥 1 v 𝑥 1 𝑥 2 u

non-differentiability
How to keep increasing? Main takeaway: non-differentiability + multiple dimensions ⇒ can force GD to continually step away from optimal solution into a new dimension!

Main Question 2: High probability bounds on 𝑓( 𝑥 𝑡 )
“Another relevant issue is obtaining a bound which holds not just in expectation, but also in arbitrarily high probability 1−𝛿 with logarithmic dependence on 1/𝛿. An extra $20 will be awarded for proving a tight bound on the suboptimality of [the last iterate] which holds in high probability.” [Shamir ‘12] [Shamir-Zhang ‘13] showed 𝔼 𝑓 𝑥 𝑇 −𝑂𝑃𝑇=𝑂( log 𝑇 / 𝑇 ) Main Result 2: High prob UB, Lipschitz case Theorem: Let 𝑓 :𝑋→ℝ be convex and 1-Lipschitz with 𝑑𝑖𝑎𝑚 𝑋 bounded. Let the stochastic gradient oracle return 𝒈 𝐭 such that 𝔼 𝒈 𝒕 𝒙 𝟏 , …, 𝒙 𝒕 ]∈𝝏𝒇 𝒙 𝒕 and ‖ 𝒈 𝒕 ‖ is a.s. bounded. Then 𝑓 𝑥 𝑇 −𝑂𝑃𝑇≤𝑂 log (𝑇) ⋅log⁡(1/𝛿) 𝑇 w.p. ≥1−𝛿.

Setup for the high probability upper bound
Recall assumptions: 𝑑𝑖𝑎𝑚 𝑋 ≤𝑅 𝔼 𝑔 𝑡 | 𝑥 1 ,…, 𝑥 𝑡 ∈𝜕𝑓( 𝑥 𝑡 ) ‖ 𝑔 𝑡 ‖ is a.s. bounded Decompose: 𝑔 𝑡 = 𝑔 𝑡 − 𝑧 𝑡 𝑔 𝑡 ∈𝜕𝑓 𝑥 𝑡 [True subgradient] 𝔼 𝑧 𝑡 | 𝑥 1 ,…, 𝑥 𝑡 =0 [Mean-zero noise] We’ll assume ‖ 𝑧 𝑡 ‖≤1 a.s. (Can relax to ‖ 𝑧 𝑡 ‖ is subgaussian.)

Sub-Gradient inequality
“Linearizing with a sub-gradient lower bounds the function” 𝑓 𝑥 + 𝑔 𝑇 𝑧 −𝑥 , 𝑔∈𝜕𝑓 (𝑥) 𝑓 𝑧 f(𝑥) For all x, 𝑧, 𝑔∈𝜕𝑓(𝑥): 𝑓 𝑥 −𝑓 𝑧 ≤ 𝑔, 𝑥−𝑧 𝑔 𝑡 is a sub-gradient in expectation: 𝔼 𝑓 𝑥 𝑡 −𝑓(𝑧) ≤𝔼 𝑔 𝑡 , 𝑥 𝑡 −𝑧 Removing expectation, 𝑔 𝑡 is not a sub-gradient. But 𝑔 𝑡 + 𝑧 𝑡 is: 𝑓 𝑥 𝑡 −𝑓(𝑧)≤ 𝑔 𝑡 , 𝑥 𝑡 −𝑧 𝑧 𝑡 , 𝑥 𝑡 −𝑧

Analysis of upper bound in expectation: 3 Steps
This analysis is due to [Shamir and Zhang 2012] Standard stuff. Derive recursive relationship between suffixes. More standard stuff.

Step 1: Standard Stuff (Uniform Averaging Analysis)
Note: 𝑆 0 =𝑓 𝑥 𝑇 . 𝑆 𝑇−1 = 1 𝑇 𝑡=1 𝑇 𝑓( 𝑥 𝑡 ) The following inequality is true for any 𝑘: 𝔼 1 𝑘 𝑡=𝑇−𝑘 𝑇 𝑓 𝑥 𝑡 −𝑓( 𝑥 𝑇−𝑘 ) =𝒪 1 𝑇 ≔ 𝑆 𝑘 Requires use of: Diameter bound on 𝑋. Bounded noise. Sub-Gradient Inequality 𝒌 times. A problem for when we want to remove expectation!

Step 2 + 3: Recursive Relationship and Standard Stuff
𝑆 𝑘 := 1 𝑘 𝑡=𝑇−𝑘 𝑇 𝑓 𝑥 𝑡 𝔼[𝑆 𝑘−1 ]≤ 𝔼 [𝑆 𝑘 ]+𝒪 1 𝑘 𝑇 𝔼 𝑆 𝑘 −𝑓( 𝑥 𝑇−𝑘 ) =𝒪 1 𝑇 𝔼[𝑆 0 ]≤ 𝔼 [𝑆 𝑇−1 ]+𝒪 log 𝑇 𝑇 𝑘 𝑆 𝑘−1 =(𝑘+1) 𝑆 𝑘 −𝑓( 𝑥 𝑇−𝑘 ) 𝔼 𝑓 𝑥 𝑇 −𝑓 𝑥 ∗ ≤ 𝔼 [𝑺 𝑻−𝟏 −𝒇 𝒙 ∗ ]+𝓞 𝒍𝒐𝒈 𝑻 𝑻 𝑘𝔼[𝑆 𝑘−1 ]=𝑘 𝔼 [𝑆 𝑘 ]+𝔼 𝑆 𝑘 −𝑓( 𝑥 𝑇−𝑘 ) =𝓞 𝟏 𝑻 =𝓞 𝟏 𝑻 [standard]!

Removing Expectation: Step 1
Step 1 required use of: Diameter bound on 𝑋. Bounded noise. Sub-Gradient Inequality 𝒌 times. Replace applications of sub-gradient inequality to include noise terms. 𝔼 𝑓 𝑥 𝑡 −𝑓( 𝑥 𝑇−𝑘 ) ≤𝔼 𝑔 𝑡 , 𝑥 𝑡 − 𝑥 𝑇−𝑘 𝑓 𝑥 𝑡 −𝑓( 𝑥 𝑇−𝑘 )≤ 𝑔 𝑡 , 𝑥 𝑡 − 𝑥 𝑇−𝑘 𝑧 𝑡 , 𝑥 𝑡 − 𝑥 𝑇−𝑘 True in expectation 𝑆 𝑘 −𝑓 𝑥 𝑇−𝑘 ≤𝒪 1 𝑇 + 𝟏 𝒌+𝟏 𝒕=𝑻−𝒌 𝑻 𝒛 𝒕 , 𝒙 𝒕 − 𝒙 𝑻−𝒌 ≔ 𝒁 𝒌

Removing Expectation: Step 2
𝑓 𝑥 𝑇 −𝑓 𝑥 ∗ ≤ 𝑺 𝑻−𝟏 −𝒇 𝒙 ∗ +𝓞 𝐥𝐨𝐠 𝑻 𝑻 𝒌=𝟏 𝑻 𝟏 𝒌 𝒁 𝒌 Error of uniform averaging 𝒪 𝑇 with high probability. Expected error of final iterate Error due to noisy subgradients where 𝑍 𝑘 = 1 𝑘+1 𝒕=𝑻−𝒌 𝑻 𝒛 𝒕 , 𝒙 𝒕 − 𝒙 𝑻−𝒌

Martingales Call these “increments” Consider random variables 𝐷 1 , 𝐷 2 , … with 𝔼 𝐷 𝑖 𝐷 1 ,…, 𝐷 𝑖−1 ]=0. Let 𝒮 𝑛 = 𝑖=1 𝑛 𝐷 𝑖 . 𝒮 𝒏 𝒏∈ℕ is a martingale. Recall 𝑍 𝑘 = 1 𝑘+1 𝒕=𝑻−𝒌 𝑻 𝒛 𝒕 , 𝒙 𝒕 − 𝒙 𝑻−𝒌 . 𝑍 𝑘 = “The value of a martingale after k steps” We want to bound 𝑍 ≔ 𝑘=1 𝑇 1 𝑘 𝑍 𝑘 This is also a martingale Think of a gambler repeatedly playing a fair betting game. Her expected fortune at step 𝑡+1 is her current fortune.

Azuma’s Inequality Use standard martingale concentration inequalities to bound 𝑍? Natural attempt: For sums of i.i.d. random variables use: Hoeffding’s Inequality Martingale version: Azuma’s Inequality Azuma’s Inequality: Need bound on sum of squared increments! [variance] Suppose 𝑋 𝑛 𝑛∈ℕ is a martingale. If 𝑋 𝑘 − 𝑋 𝑘−1 ≤ 𝑐 𝑘 almost surely. Then, Pr 𝑋 𝑁 − 𝑋 0 ≥𝑡 ≤ exp − 𝑡 𝑁 𝑐 𝑘 Suppose 𝒮 𝑛 𝑛∈ℕ is a martingale with 𝐷 𝑘 2 ≤ 𝑐 𝑘 2 almost surely. For every 𝛿>0: 𝑆 𝑛 =𝒪 𝑘=1 𝑛 𝑐 𝑘 2 log 1 𝛿 , w.p. at least 1 −𝛿. Total variance up to step n

Trying to use Azuma to bound Z
Need almost sure bounds 𝓞 𝐥𝐨 𝐠 𝟐 𝑻 𝑻 on the squared increments of 𝑍. 𝑍 = 𝑘=1 𝑇 1 𝑘 𝑍 𝑘 = 𝑘=1 𝑇 1 𝑘(𝑘+1) 𝑡=𝑇−𝑘 𝑇 𝑧 𝑡 , 𝑥 𝑡 − 𝑥 𝑇−𝑘 Changing order of summation = 𝑖=1 𝑇 𝑧 𝑡 , 𝑡=1 𝑖−1 1 (𝑇−𝑡) 2 ( 𝒙 𝒕 − 𝒙 𝒊 ) ≔ 𝑫 𝒊 , or increments of Z Using diameter bound is too weak! Yields 𝒊 𝑻 𝑫 𝒊 𝟐 ≈𝚯(𝟏) . We need 𝓞 𝐥𝐨 𝐠 𝟐 𝑻 𝑻 .

What came first, the martingale or its variance?
Need tighter control of 𝑥 𝑡 − 𝑥 𝑗 2 . Martingale A careful analysis of 𝑥 𝑡 − 𝑥 𝑗 2 yields: 𝑖=1 𝑇 𝐷 𝑖 2 ≤ log 2 𝑇 𝑇 + log 𝑇 𝑇 𝑍 [w.h.p.] Variance Variance Martingale We want to use Azuma, so we need to bound the variance. To bound the variance, we need to bound the martingale. To bound the martingale, we need to bound the variance to use Azuma.

Trying to make our variance bound work …
Let 𝛼= log 𝑇 𝑇 log (1/𝛿) . Let 𝛽= log 2 𝑇 𝑇 log (1/𝛿) . We have ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 w.h.p. How to bound Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 ?? Pr [Z≥𝛼] = Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 + Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≥𝛽 +𝛼 𝑍 ≤ Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 + Pr ∑ 𝐷 𝑖 2 ≥𝛽 +𝛼 𝑍 ≤ Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 +𝛿.

Freedman’s Inequality:
This doesn’t appear in Freedman  “It is not likely that the martingale is larger than the square root of it’s total conditional variance.” We’d like to bound Pr 𝑍≥𝛼 ∧ ∑ 𝐷 𝑖 2 ≤𝛽 +𝛼 𝑍 Suppose 𝑋 𝑛 𝑛∈ℕ is a martingale. If 𝑋 𝑘 − 𝑋 𝑘−1 𝑋 𝑘 − 𝑋 𝑘−1 ≤ 𝑐 𝑘 almost surely. Then, Pr 𝑋 𝑁 − 𝑋 0 ≥𝑡 ≤ exp − 𝑡 𝑁 𝑐 𝑘 Pr 𝑋 𝑁 − 𝑋 0 ≥𝑡 ≤ exp − 𝑡 𝑁 𝑐 𝑘 Let 𝐷 1 , 𝐷 2 ,… be the increments of a martingale. Let V i = 𝔼 𝐷 𝑖 2 | 𝐷 1 , …, 𝐷 𝑖−1 . Let 𝑉 𝑛 = 1 𝑛 𝑉 𝑖 . Then, Pr 𝒮 𝑛 ≥𝑣 log 1 𝛿 𝑎𝑛𝑑 𝑉 𝑛 ≤ 𝑣 2 ≤𝛿, Recall 𝛼 2 ≈𝛽. Martingale is big Variance is small So, we almost are able to apply Freedman

親子丼 Oyakodon – a Japanese rice bowl dish which contains chicken and egg.

鮭親子丼 Salmon Oyakodon – a Japanese rice bowl dish which contains salmon and roe.

The “Oyakodon” Theorem
The Oyakadon Theorem [HLPR 2018]– a martingale concentration inequality useful when the variance is bounded by the martingale: Suppose 𝑋 𝑛 𝑛∈ℕ is a martingale. If 𝑋 𝑘 − 𝑋 𝑘−1 ≤ 𝑐 𝑘 almost surely. Then, Pr 𝑋 𝑁 − 𝑋 0 ≥𝑡 ≤ exp − 𝑡 𝑁 𝑐 𝑘 Let 𝐷 1 , 𝐷 2 ,… be the increments of a martingale. Let V i = 𝔼 𝐷 𝑖 2 | 𝐷 1 , …, 𝐷 𝑖−1 . Let 𝒮 𝑛 = 1 𝑛 𝐷 𝑖 . Let 𝑉 𝑛 = 1 𝑛 𝑉 𝑖 . Then, Pr 𝒮 𝑛 ≥𝑥 𝑎𝑛𝑑 𝑉 𝑛 ≤𝛽+𝛼 𝒮 𝑛 ≤exp − 𝑥 2 4𝛼𝑥+8 𝛽 . Accounts for the “Chicken and Egg Phenomenon” Turns out to be what we need to complete the high probability analysis of 𝒁!

Conclusions Open questions
𝑓( 𝑥 𝑡 ) can be asymptotically worse than 𝑓( 𝑖=1 𝑡 𝑥 𝑡 /𝑡), for Lipschitz or Strongly Convex GD (in pathological cases) High probability bounds for 𝑓 𝑥 𝑡 and suffix averaging. Open questions Does lim 𝑡→∞ 𝑥 𝑡 exist? UB on 𝑓( 𝑥 𝑡 ) is 𝑂 log 𝑇 log 1/𝛿 / 𝑇 [Subexponential] LB is Ω log 𝑇 + log⁡(1/𝛿) 𝑇 . Not quite tight [Subgaussian] Are these techniques useful in analyzing game dynamics?

Thank you! Questions?

Tight Analyses for Non-Smooth Stochastic Gradient Descent

Similar presentations

Presentation on theme: "Tight Analyses for Non-Smooth Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tight Analyses for Non-Smooth Stochastic Gradient Descent

Similar presentations

Presentation on theme: "Tight Analyses for Non-Smooth Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback