Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMS E6998-9 F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Similar presentations


Presentation on theme: "COMS E6998-9 F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image."β€” Presentation transcript:

1 COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image pertinent to the presentation.

2 Administrivia, Plan Website moved: Piazza: sign-up! Plan:
sublinear.wikischolars.columbia.edu/main Piazza: sign-up! Plan: Median trick, Chernoff bound (from Tue) Distinct Elements Count Impossibility Results

3 Last Lecture Counting frequency Morris Algorithm: Initialize 𝑋=0
On increment, 𝑋=𝑋+1 with prob. 1/ 2 𝑋 Estimator: 2 𝑋 βˆ’1 IP Frequency 3 Morris: π‘‰π‘Žπ‘Ÿ=𝑂( 𝑛 2 ) Failure prob: 0.1 Morris+: Average of π‘˜=𝑂(1/ πœ– 2 ) π‘‰π‘Žπ‘Ÿ=𝑂 𝑛 2 /π‘˜ use Chebyshev for 1+πœ– approx. Morris++: Median of π‘š=𝑂 log 1 𝛿 use Chernoff Failure prob: 𝛿

4 β€œMedian trick” Chernoff/Hoeffding bounds:
𝑋 1 , 𝑋 2 ,… 𝑋 π‘š are independent r.v. in {0,1} πœ‡=𝐸 𝑖 𝑋 𝑖 πœ–βˆˆ[0,1] Pr 𝑖 𝑋 𝑖 βˆ’πœ‡ >πœ–πœ‡ ≀2 𝑒 βˆ’ πœ– 2 πœ‡/3 Algorithm 𝐴: output ∈ correct range with 90% probability Algorithm 𝐴 βˆ— output ∈ correct range with 1βˆ’π›Ώ probability Median trick: Repeat 𝐴 for π‘š=𝑂 log 1 𝛿 times Take median of the answers

5 Using Chernoff for Median trick
Chernoff: Pr 𝑖 𝑋 𝑖 βˆ’πœ‡ >πœ–πœ‡ ≀2 𝑒 βˆ’ πœ– 2 πœ‡/3 Define 𝑋 𝑖 = 1 iff 𝑖 π‘‘β„Ž copy of 𝐴 is correct 𝐸 𝑋 𝑖 =0.9 (𝐴 is correct with 90% prob.) πœ‡=0.9π‘š New alg 𝐴 βˆ— is correct when βˆ‘ 𝑋 𝑖 >0.5π‘š Use Chernoff to bound: Pr βˆ‘ 𝑋 𝑖 βˆ’πœ‡ >0.4π‘š = Pr βˆ‘ 𝑋 𝑖 βˆ’πœ‡ > πœ‡ ≀ 𝑒 βˆ’π‘β‹…0.9 π‘š <𝛿 for π‘š=𝑂 log 1 𝛿

6 Problem: Distinct Elements
Streaming elements from [𝑛] Approximate the number of elements with non-zero freq. Length of stream = π‘š Space required? 𝑂(𝑛) bits 𝑂(π‘šβ‹…log 𝑛) bits IP Frequency 1 3 2 4 9 5 … 𝑛

7 Algorithm for approximating DE
Main tool: hash function β„Ž: 𝑛 β†’[0,1] β„Ž(𝑖) random in [0,1] Algorithm [Flajolet-Martin 1985] Init 𝑧=1 When see element 𝑖: 𝑧=min⁑{𝑧, β„Ž(𝑖)} Estimator: 1 𝑧 βˆ’1 Where from? Will return later…

8 Analysis Let 𝑑 = count of dist. elm. Claim 1: E 𝑧 = 1 𝑑+1 Proof:
Algorithm DE: Init: 𝑧=1 when see element 𝑖: 𝑧=min⁑{𝑧,β„Ž 𝑖 } Estimator: 1 𝑧 βˆ’1 Let 𝑑 = count of dist. elm. Claim 1: E 𝑧 = 1 𝑑+1 Proof: 𝑧 = minimum of 𝑑 random numbers in [0,1] Pick another random number π‘Žβˆˆ[0,1] What’s the probability π‘Ž<𝑧 ? 1) exactly 𝑧 2) probability it is smallest among 𝑑+1 reals: 1 𝑑+1 5 7 2 β„Ž(5) β„Ž(7) β„Ž(2) ο‚»1/(𝑑+1)

9 Analysis 2 Need variance too… How do we get 1+πœ– approximation though?
Algorithm DE: Init: 𝑧=1 when see element 𝑖: 𝑧=min⁑{𝑧,β„Ž 𝑖 } Estimator: 1 𝑧 βˆ’1 Need variance too… Can prove var 𝑧 ≀2/ 𝑑 2 How do we get 1+πœ– approximation though? We can take 𝑧= 1 π‘˜ 𝑧 1 + 𝑧 2 +… 𝑧 π‘˜ for independent 𝑧 1 ,… 𝑧 π‘˜

10 Alternative: Bottom-k
Algorithm DE: Init: 𝑧=1 when see element 𝑖: 𝑧=min⁑{𝑧,β„Ž 𝑖 } Estimator: 1 𝑧 βˆ’1 Bottom-k alg. [BJKS’02]: Init ( 𝑧 1 , 𝑧 2 ,… 𝑧 π‘˜ )=1 Keep π‘˜ smallest hashes seen 𝑧 1 ≀ 𝑧 2 ≀… 𝑧 π‘˜ Estimator: 𝑑 = π‘˜ 𝑧 π‘˜ Proof: will prove Probability that 𝑑 > 1+πœ– 𝑑 is 0.05 Probability that 𝑑 < 1βˆ’πœ– 𝑑 is 0.05 Overall only 0.1 probability 𝑑 outside the correct range

11 Analysis for Bottom-k Compute: Pr 𝑑 > 1+πœ– 𝑑 Suppose we see {1…d}
Algorithm Bottom-k: Init: 𝑧 1 ,… 𝑧 π‘˜ =1 Keep π‘˜ smallest hashes seen using 𝑧 1 ,… 𝑧 π‘˜ Estimator: 𝑑 = π‘˜ 𝑧 π‘˜ Compute: Pr 𝑑 > 1+πœ– 𝑑 Suppose we see {1…d} Define 𝑋 𝑖 =1 iff β„Ž 𝑖 < π‘˜ 1+πœ– 𝑑 Then: 𝑑 > 1+πœ– 𝑑 iff 𝑖 𝑋 𝑖 >π‘˜ We have: 𝐸 𝑋 𝑖 = π‘˜ 1+πœ– 𝑑 𝐸 𝑖 𝑋 𝑖 =𝑑⋅𝐸 𝑋 𝑖 = π‘˜ 1+πœ– var 𝑖 𝑋 𝑖 =𝑑⋅var 𝑋 𝑖 ≀𝑑⋅𝐸 𝑋 1 2 ≀ π‘˜ 1+πœ– β‰€π‘˜ By Chebyshev: Pr βˆ‘ 𝑋 𝑖 βˆ’ π‘˜ 1+πœ– > 20π‘˜ ≀0.05 or: Pr βˆ‘ 𝑋 𝑖 > π‘˜ 1+πœ– + 20π‘˜ ≀0.05 requires 𝑑>π‘˜ Implied by βˆ‘ 𝑋 𝑖 >π‘˜ for π‘˜=Ξ©(1/ πœ– 2 )

12 Hash functions in Streaming
We used β„Ž: 𝑛 β†’[0,1] Issue 1: reals? Issue 2: how do we store it? Issue 1: Ok with: β„Ž: 𝑛 β†’ 0, 1 𝑀 , 2 𝑀 , 3 𝑀 ,…1 for 𝑀≫ 𝑛 3 Probability that 𝑑≀𝑛 random numbers collide: at most 1/𝑛

13 Issue 2: bounded randomness
Pairwise independent hash functions Definition: β„Ž: 𝑛 β†’ 1,2,…𝑀 s.t. for all 𝑖≠𝑗 and π‘Ž,π‘βˆˆ[𝑀] Pr β„Ž 𝑖 =π‘Žβˆ§β„Ž 𝑗 =𝑏 =1/ 𝑀 2 (i.e., like random on pairs) Such hash function enough: Variance cares only about pairs! We defined 𝑋 𝑖 =1 iff β„Ž 𝑖 <… And computed π‘£π‘Žπ‘Ÿ βˆ‘ 𝑋 𝑖 =𝐸 βˆ‘ 𝑋 𝑖 2 βˆ’ 𝐸 βˆ‘ 𝑋 𝑖 2 =𝐸 𝑋 1 𝑋 1 + 𝑋 1 𝑋 2 +… βˆ’ 𝐸 βˆ‘ 𝑋 𝑖 2 same for fully random β„Ž and pairwise independent β„Ž

14 Pairwise-Independent: example
Definition: β„Ž: 𝑛 β†’ 0,1,β€¦π‘€βˆ’1 s.t. for all 𝑖≠𝑗 and π‘Ž,π‘βˆˆ{0,1,β€¦π‘€βˆ’1} Pr β„Ž 𝑖 =π‘Žβˆ§β„Ž 𝑗 =𝑏 =1/ 𝑀 2 (A) construction: Suppose 𝑀 is prime Pick 𝑝,π‘žβˆˆ{0,1,β€¦π‘€βˆ’1} β„Ž 𝑖 =𝑝𝑖+π‘ž (π‘šπ‘œπ‘‘ 𝑀) Space: only 𝑂 log 𝑀 =𝑂( log 𝑛 ) bits Proof of correctness: β„Ž 𝑖 =π‘Ž and β„Ž 𝑗 =𝑏 : system of 2 equations in 2 unknowns (𝑝,π‘ž) Exactly one pair (𝑝,π‘ž) satisfies it Probability it is chosen: exactly 1/ 𝑀 2

15 Impossibility Results
Relaxations: Approximation Randomization Need both for space β‰ͺmin⁑{𝑛,π‘š}

16 Deterministic Exact Won’t Work
Suppose algorithm 𝐴, estimator 𝑅 uses space 𝑠β‰ͺ𝑛,π‘š We build the following stream: Let vector π‘₯∈ 0,1 𝑛 𝑖 in stream iff π‘₯ 𝑖 =1 Run 𝐴 on it and let 𝜎 be memory content 1 π‘₯= 1 𝑑 didn’t change β‡’ π‘₯ 1 =1 1 3 5 6 7 8 9 10 𝜎 𝜎 2 𝑑 increased β‡’ π‘₯ 2 =0 𝜎

17 Deterministic Exact Won’t Work
Using 𝜎, can recover entire π‘₯ ! β€œπœŽ= encoding of a string π‘₯ of length 𝑛” But 𝜎 has only 𝑠β‰ͺ𝑛 bits! Can think 𝐴: 0,1 𝑛 β†’ 0,1 𝑠 1 𝑑 didn’t change β‡’ π‘₯ 1 =1 1 3 5 6 7 8 9 10 𝜎 𝜎 2 𝑑 increased β‡’ π‘₯ 2 =0 𝜎

18 Deterministic Exact Won’t Work
Using 𝜎, can recover entire π‘₯ β€œπœŽ = encoding of a string π‘₯ of length 𝑛” But 𝜎 has only 𝑠β‰ͺ𝑛 bits! Can think 𝐴: 0,1 𝑛 β†’ 0,1 𝑠 Must be injective Otherwise, suppose 𝐴 π‘₯ =𝐴 π‘₯ β€² =𝜎 The recovery implies π‘₯=π‘₯β€² Hence 𝑠β‰₯𝑛

19 Deterministic Approx Won’t Too
Similar: use 𝐴 to compress π‘₯ from a code Code: set π‘‡βŠ‚ 0,1 𝑛 s.t. π‘¦βˆ–π‘₯ β‰₯𝑛/6 for all distinct π‘₯,π‘¦βˆˆπ‘‡ 𝑇 β‰₯ 2 Ξ© 𝑛 Use 𝐴 to encode an input π‘₯ into 𝜎 For each π‘¦βˆˆπ‘‡ check whether π‘₯=𝑦: Append 𝑦 If 𝑑 β€²>1.01 𝑑 , then π‘₯≠𝑦 By injectivity of 𝐴 on 𝑇: 2 𝑠 β‰₯|𝑇| or 𝑠=Ξ© 𝑛 1 π‘₯= 1 𝑦= 2 4 7 8 9 10 11 1 3 5 6 7 8 9 10 πœŽβ€²=𝐴(π‘₯+𝑦) 𝑑 β€²=𝑅(πœŽβ€²) 𝜎=𝐴(π‘₯) 𝑑 =𝑅(𝜎) 𝜎=𝐴(π‘₯)

20 Concluding Remarks Median trick + Chernoff Distinct Elements
Can also store hashes β„Ž(𝑖) approximately (store number of leading zeros) 𝑂(π‘™π‘œπ‘”π‘™π‘œπ‘” 𝑛) bit per hash value Plus other bells and whisles HyperLogLog Impossibility results Can also prove randomized, exact won’t work


Download ppt "COMS E6998-9 F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image."

Similar presentations


Ads by Google