Markov Chains and Mixing Times By Levin, Peres and Wilmer Chapter 4: Introduction to markov chain mixing, Sections 4.1-4.4, pp. 47-61. Presented by Dani Dorfman
Planned topics Total Variation Distance Coupling The Convergence Theorem Measuring Distance from Stationary
Total Variation Distance
Definition 𝜇−𝜈 𝑇𝑉 = max 𝐴⊂Ω 𝜇 𝐴 −𝜈(𝐴) Given two distributions 𝜇,𝜈 on Ω we define the Total Variation to be: 𝜇−𝜈 𝑇𝑉 = max 𝐴⊂Ω 𝜇 𝐴 −𝜈(𝐴)
Example 𝑤 𝑒 Coin tossing frog. 𝑃= 1−𝑝 𝑝 𝑞 1−𝑞 ,Π= 𝑞 𝑝+𝑞 𝑝 𝑝+𝑞 𝑃= 1−𝑝 𝑝 𝑞 1−𝑞 ,Π= 𝑞 𝑝+𝑞 𝑝 𝑝+𝑞 Define 𝜇 0 = 1,0 , Δ 𝑡 = 𝜇 𝑡 𝑒 −Π(𝑒) (=Π 𝑤 − 𝜇 𝑡 (𝑤) ) An easy computation shows: 𝜇 𝑡 −Π 𝑇𝑉 = Δ 𝑡 = 1−𝑝−𝑞 𝑡 Δ 0 𝑞 𝑝 𝑤 𝑒
“An Easy Computation” Induction: 𝑡=0: Δ 0 = 1−𝑝−𝑞 0 Δ 0 𝑡→𝑡+1: Δ 0 = 1−𝑝−𝑞 0 Δ 0 𝑡→𝑡+1: Δ 𝑡+1 = 𝜇 𝑡+1 𝑒 −𝜋 𝑒 = 1−𝑝 𝜇 𝑡 𝑒 +𝑞 1− 𝜇 𝑡 𝑒 −𝜋 𝑒 = 1−𝑝−𝑞 𝜇 𝑡 𝑒 +𝑞−𝜋 𝑒 = 1−𝑝−𝑞 𝜇 𝑡 𝑒 +𝑞− 𝑞 𝑝+𝑞 = 1−𝑝−𝑞 𝜇 𝑡 𝑒 − 1−𝑝−𝑞 𝜋 𝑒 = 1−𝑝−𝑞 Δ 𝑡
𝐼 𝐼𝐼 𝐼𝐼𝐼 Proposition 4.2 𝜈 𝜇 𝐵 𝐵 𝐶 𝜇 𝐴 −𝜈(𝐴) Let 𝜇 and 𝜈 be two probability distributions on Ω. Then: 𝜇−𝜈 𝑇𝑉 = 1 2 𝑥∈Ω 𝜇 𝑥 −𝜈(𝑥) 𝜈 𝜇 𝐼 𝐼𝐼 𝜇 𝐴 −𝜈(𝐴) 𝐼𝐼𝐼 𝐵 𝐵 𝐶
Proof Define 𝐵= 𝑥|𝜇 𝑥 ≥𝜈(𝑥) , Let 𝐴⊂Ω be an event. Clearly: 𝜇 𝐴 −𝜈 𝐴 ≤𝜇 𝐴∩𝐵 −𝜈 𝐴∩𝐵 ≤𝜇 𝐵 −𝜈 𝐵 Parallel argument gives: 𝜈 𝐴 −𝜇 𝐴 ≤𝜈 𝐴∩ 𝐵 𝐶 −𝜇 𝐴∩ 𝐵 𝐶 ≤𝜈 𝐵 𝐶 −𝜇 𝐵 𝐶 Note that both upper bounds are equal. Taking 𝐴=𝐵 achieves the upper bounds, therefore: 𝜇−𝜈 𝑇𝑉 =𝜇 𝐵 −𝜈 𝐵 = 1 2 𝜇 𝐵 −𝜈 𝐵 +(𝜈 𝐵 𝐶 −𝜇( 𝐵 𝐶 ) = 1 2 𝑥∈Ω |𝜇 𝑥 −𝜈(𝑥)|
Remarks From the last proof we easily deduce: 𝜇−𝜈 𝑇𝑉 = 𝑥∈Ω,𝜇 𝑥 ≥𝜈(𝑥) [𝜇 𝑥 −𝜈 𝑥 ] Notice that 𝑇𝑉 is equivalent to 𝐿 1 norm and therefore: 𝜇−𝜈 𝑇𝑉 ≤ 𝜇−𝜔 𝑇𝑉 + 𝜔−𝜈 𝑇𝑉
Proposition 4.5 Let 𝜇 and 𝜈 be two probability distributions on Ω. Then: 𝜇−𝜈 𝑇𝑉 = 1 2 sup max 𝑓 ≤1 𝑥∈Ω 𝑓 𝑥 𝜇 𝑥 −𝑓 𝑥 𝜈(𝑥) 𝐼 𝐼𝐼 𝐼𝐼𝐼
𝑓 ∗ (𝑥)= 1 𝜇 𝑥 −𝜈 𝑥 ≥0 −1 𝜇 𝑥 −𝜈 𝑥 <0 Proof Clearly the following function achieves the supremum: 𝑓 ∗ (𝑥)= 1 𝜇 𝑥 −𝜈 𝑥 ≥0 −1 𝜇 𝑥 −𝜈 𝑥 <0 Therefore: 1 2 𝑥∈Ω 𝑓 ∗ 𝑥 𝜇 𝑥 − 𝑓 ∗ 𝑥 𝜈(𝑥) = 1 2 𝑥∈Ω,𝜇 𝑥 −𝜈 𝑥 ≥0 𝜇 𝑥 −𝜈(𝑥) + 1 2 𝑥∈Ω,𝜇 𝑥 −𝜈 𝑥 <0 𝜈 𝑥 −𝜇(𝑥) = 1 2 𝜇−𝜈 𝑇𝑉 + 1 2 𝜇−𝜈 𝑇𝑉 = 𝜇−𝜈 𝑇𝑉
Coupling & Total Variation
Definition A coupling of two probability distributions 𝜇,𝜈 is a pair of random variables 𝑋,𝑌 s.t 𝑃 𝑋=𝑥 =𝜇 𝑥 ,𝑃 𝑌=𝑥 =𝜈 𝑥 . Given a coupling 𝑋,𝑌 of 𝜇,𝜈 one can define 𝑞 𝑥,𝑦 = 𝑃(𝑋=𝑥,𝑌=𝑦) which represents the joint distribution 𝑋,𝑌 . Thus: 𝜇 𝑥 = 𝑦∈Ω 𝑞 𝑥,𝑦 ,𝜈 𝑦 = 𝑥∈Ω 𝑞(𝑥,𝑦)
Example (𝑋,𝑌) s.t ∀𝑥,𝑦 𝑃 𝑋=𝑥,𝑌=𝑦 = 1 4 𝑞= 1/4 1/4 1/4 1/4 𝑃 𝑋≠𝑌 = 1 2 𝜇,𝜈 represent a legal coin flip. We can build several couplings: (𝑋,𝑌) s.t ∀𝑥,𝑦 𝑃 𝑋=𝑥,𝑌=𝑦 = 1 4 𝑞= 1/4 1/4 1/4 1/4 𝑃 𝑋≠𝑌 = 1 2 (𝑋,𝑌) s.t 𝑋=𝑌 ∀𝑥 𝑃 𝑋=𝑌=𝑥 = 1 2 𝑞= 1/2 0 0 1/2 𝑃 𝑋≠𝑌 =0
Proposition 4.7 𝜇−𝜈 𝑇𝑉 = inf 𝑋,𝑌 𝑃(𝑋≠𝑌) Let 𝜇 and 𝜈 be two probability distributions on Ω. Then: 𝜇−𝜈 𝑇𝑉 = inf 𝑋,𝑌 𝑃(𝑋≠𝑌)
Proof 𝑞= 𝜇 𝐴 −𝜈 𝐴 =𝑃 𝑋∈𝐴 −𝑃 𝑌∈𝐴 ≤ 𝑃 𝑋∈𝐴,𝑌∉𝐴 ≤𝑃(𝑋≠𝑌) In order to show 𝜇−𝜈 𝑇𝑉 ≤ inf 𝑋,𝑌 𝑃 𝑋≠𝑌 , ∀𝐴⊂Ω note that: 𝜇 𝐴 −𝜈 𝐴 =𝑃 𝑋∈𝐴 −𝑃 𝑌∈𝐴 ≤ 𝑃 𝑋∈𝐴,𝑌∉𝐴 ≤𝑃(𝑋≠𝑌) Thus it suffices to find a coupling 𝑋,𝑌 𝑠.𝑡 𝑃 𝑋≠𝑌 = 𝜇−𝜈 𝑇𝑉 . 𝑞=
Proof Cont. 𝐼 𝐼𝐼 𝐼𝐼𝐼
Proof Cont. Define the coupling (𝑋,𝑌) as follows: With probability p=1− 𝜇−𝜈 𝑇𝑉 take 𝑋=𝑌 according to the distribution 𝛾 𝐼𝐼𝐼 . O/w take 𝑋,𝑌 from 𝐵= 𝑥 𝜇 𝑥 −𝜈 𝑥 >0 , 𝐵 𝐶 according to the distributions 𝛾 𝐼 , 𝛾 𝐼𝐼 correspondingly. Clearly: 𝑃 𝑋≠𝑌 = 𝜇−𝜈 𝑇𝑉
Proof Cont. All that is left is to define 𝛾 𝐼 , 𝛾 𝐼𝐼 , 𝛾 𝐼𝐼𝐼 : γ 𝐼 (𝑥)= 1 𝜇−𝜈 𝑇𝑉 𝜇 𝑥 −𝜈 𝑥 𝜇 𝑥 −𝜈 𝑥 >0 0 𝑒𝑙𝑠𝑒 γ 𝐼𝐼 (𝑥)= 1 𝜇−𝜈 𝑇𝑉 𝜈 𝑥 −𝜇 𝑥 𝜇 𝑥 −𝜈 𝑥 ≤0 0 𝑒𝑙𝑠𝑒 γ 𝐼𝐼𝐼 (𝑥)= min{𝜇 𝑥 ,𝜈(𝑥)} 1− 𝜇−𝜈 𝑇𝑉 Note that: 𝜇=𝑝 𝛾 𝐼𝐼𝐼 + 1−𝑝 𝛾 𝐼 , 𝜈=𝑝 𝛾 𝐼𝐼𝐼 + 1−𝑝 𝛾 𝐼𝐼
The Convergence Theorem
Theorem 4.9 Suppose that 𝑃 is irreducible and aperiodic, with stationary distribution 𝜋. Then ∃𝛼∈ 0,1 ,𝐶>0 𝑠.𝑡: ∀𝑡 ma𝑥 𝑥∈Ω 𝑃 𝑡 𝑥,∙ −Π 𝑇𝑉 <𝐶 𝛼 𝑡
Lemma (Prop. 1.7) If 𝑃 is irreducible and aperiodic, then ∃𝑟>0 𝑠.𝑡: ∀𝑥,𝑦 𝑃 𝑟 𝑥,𝑦 >0 Proof: Define ∀𝑥 Τ 𝑥 ={𝑡| 𝑃 𝑡 𝑥,𝑥 >0}, then ∀𝑥 gcd Τ 𝑥 =1. ∀𝑥 Τ 𝑥 is closed under addition. From number theory: ∀𝑥∃ 𝑟 𝑥 ∀𝑟> 𝑟 𝑥 𝑟∈Τ 𝑥 . From irreducibility ∀𝑥,𝑦∃ 𝑟 𝑥,𝑦 <𝑛 𝑠.𝑡 𝑃 𝑟 𝑥,𝑦 𝑥,𝑦 >0. Taking 𝑟≔𝑛+ max 𝑥∈Ω 𝑟 𝑥 ends the proof.
Proof of Theorem 4.9 The last lemma gives us the existence of 𝑟 𝑠.𝑡 ∀𝑥,𝑦 𝑃 𝑟 𝑥,𝑦 >0. Let Π be the matrix with Ω rows, each row is 𝜋. ∃𝛿>0 𝑠.𝑡 ∀𝑥,𝑦∈Ω :𝑃 𝑥,𝑦 ≥𝛿𝜋 𝑦 =𝛿Π 𝑥,𝑦 . Let 𝑄 be the stochastic matrix that is derived from the equation: 𝑃 𝑟 = 1−𝜃 Π+𝜃𝑄 [𝜃=1−𝛿] Clearly: 𝑃Π=Π𝑃=Π. By induction one can see: ∀𝑘 𝑃 𝑟𝑘 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 𝑄 𝑘
Proof of Induction Case 𝑘=1 comes by definition. 𝑘→𝑘+1: 𝑃 𝑟(𝑘+1) = 𝑃 𝑟𝑘 𝑃 𝑟 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 𝑄 𝑘 𝑃 𝑟 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 𝑄 𝑘 P 𝑟 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 𝑄 𝑘 1−𝜃 Π+𝜃𝑄 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 1−𝜃 𝑄 𝑘 Π+ 𝜃 𝑘+1 𝑄 𝑘+1 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 1−𝜃 Π+ 𝜃 𝑘+1 𝑄 𝑘+1 = 1− 𝜃 𝑘+1 Π+ 𝜃 𝑘+1 𝑄 𝑘+1
Proof of Theorem 4.9 Cont. ∀𝑗 𝑃 𝑟𝑘+𝑗 −Π= 𝜃 𝑘 ( 𝑄 𝑘 𝑃 𝑗 −Π) The induction derives: 𝑃 𝑟𝑘+𝑗 = 𝑃 𝑟𝑘 𝑃 𝑗 = 1− 𝜃 𝑘 Π+ 𝜃 𝑘 𝑄 𝑘 𝑃 𝑗 Therefore, ∀𝑗 𝑃 𝑟𝑘+𝑗 −Π= 𝜃 𝑘 ( 𝑄 𝑘 𝑃 𝑗 −Π) Finally, ∀𝑥 𝑃 𝑟𝑘+𝑗 𝑥,∙ −𝜋 𝑇𝑉 ≤ 𝜃 𝑘
Standardizing Distance From Stationary
Definitions 𝑑 𝑡 = max 𝑥,𝑦∈Ω 𝑃 𝑡 𝑥,∙ − 𝑃 𝑡 (𝑦,∙ 𝑇𝑉 Given a stochastic matrix 𝑃 with it’s 𝜋, we define: 𝑑 𝑡 = max 𝑥∈Ω 𝑃 𝑡 𝑥,∙ −𝜋 𝑇𝑉 𝑑 𝑡 = max 𝑥,𝑦∈Ω 𝑃 𝑡 𝑥,∙ − 𝑃 𝑡 (𝑦,∙ 𝑇𝑉
Lemma 4.11 For every stochastic matrix 𝑃 and her stationary distribution 𝜋: 𝑑 𝑡 ≤ 𝑑 𝑡 ≤2𝑑(𝑡) Proof: The second inequality is trivial from the triangle inequality. Note that: 𝜋 𝐴 = 𝑦∈Ω 𝜋 𝑦 𝑃(𝑦,𝐴) .
Proof Cont. 𝑃 𝑡 (𝑥,∙)−𝜋 𝑇𝑉 = max 𝐴⊂Ω 𝑃 𝑡 𝑥,𝐴 −𝜋(𝐴) = max 𝐴⊂Ω 𝑦∈Ω 𝜋(𝑦) 𝑃 𝑡 𝑥,𝐴 − 𝑃 𝑡 (𝑦,𝐴) ≤ max 𝐴⊂Ω 𝑦∈Ω 𝜋 𝑦 𝑃 𝑡 𝑥,𝐴 − 𝑃 𝑡 𝑦,𝐴 ≤ 𝑦∈Ω 𝜋(𝑦) max 𝐴⊂Ω 𝑃 𝑡 𝑥,𝐴 − 𝑃 𝑡 𝑦,𝐴 = 𝑦∈Ω 𝜋 𝑦 𝑃 𝑡 𝑥,∙ − 𝑃 𝑡 𝑦,∙ 𝑇𝑉 ≤ 𝑦∈Ω 𝜋 𝑦 𝑑 𝑡 = 𝑑 (𝑡)
Observations 𝑑 𝑡 = max 𝜇 𝜇𝑃−𝜋 𝑇𝑉 𝑑 𝑡 = max 𝜇,𝜈 𝜇P−𝜈𝑃 𝑇𝑉
Lemma 4.12 The 𝑑 function is submultiplicative, 𝑖.𝑒. ∀𝑠,𝑡 𝑑 𝑠+𝑡 ≤ 𝑑 𝑠 𝑑 𝑡 . Proof: Fix 𝑥,𝑦∈Ω, Let ( 𝑋 𝑠 , 𝑌 𝑠 ) be the optimal coupling of 𝑃 𝑠 𝑥,∙ , 𝑃 𝑠 𝑦,∙ . Note that: 𝑃 𝑡+𝑠 𝑥,𝑤 = (𝑃 𝑠 𝑃 𝑡 ) 𝑥,𝑤 = 𝑧∈Ω 𝑃 𝑠 𝑥,𝑧 𝑃 𝑡 𝑧,𝑤 =𝐸 𝑃 𝑡 𝑋 𝑠 ,𝑤 The same argument gives us: 𝑃 𝑡+𝑠 𝑦,𝑤 =𝐸 𝑃 𝑡 𝑌 𝑠 ,𝑤 .
Proof Cont. Note: 𝑃 𝑡+𝑠 𝑥,𝑤 − 𝑃 𝑡+𝑠 𝑦,𝑤 =𝐸 𝑃 𝑡 𝑋 𝑠 ,𝑤 −𝐸 𝑃 𝑡 𝑌 𝑠 ,𝑤 𝑃 𝑡+𝑠 𝑥,𝑤 − 𝑃 𝑡+𝑠 𝑦,𝑤 =𝐸 𝑃 𝑡 𝑋 𝑠 ,𝑤 −𝐸 𝑃 𝑡 𝑌 𝑠 ,𝑤 Summing over all 𝑤 yields: 𝑃 𝑡+𝑠 𝑥,∙ − 𝑝 𝑡+𝑠 𝑦,∙ 𝑇𝑉 = 1 2 𝑤∈Ω 𝐸 𝑃 𝑡 𝑋 𝑠 ,𝑤 − 𝑃 𝑡 ( 𝑌 𝑠 ,𝑤) ≤ 𝐸 1 2 𝑤∈Ω 𝑃 𝑡 𝑋 𝑠 ,𝑤 − 𝑃 𝑡 ( 𝑌 𝑠 ,𝑤) ≤ 𝑑 𝑡 𝑃 𝑋 𝑠 ≠ 𝑌 𝑠 ≤ 𝑑 𝑡 𝑑 𝑠
Remarks From submultiplicity we note that 𝑑 (𝑡) is non-increasing. Also: ∀𝑐 𝑑 𝑐𝑡 ≤ 𝑑 𝑐𝑡 ≤ 𝑑 𝑐 (𝑡)
Thank you for your attention!