Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7: Dynamic sampling Dimension Reduction

Similar presentations


Presentation on theme: "Lecture 7: Dynamic sampling Dimension Reduction"β€” Presentation transcript:

1 Lecture 7: Dynamic sampling Dimension Reduction
Left to the title, a presenter can insert his/her own image pertinent to the presentation.

2 Plan Admin: Plan: Scriber? PSet 2 released later today, due next Wed
Alex office hours: Tue 2:30-4:30 Plan: Dynamic streaming graph algorithms S2: Dimension Reduction & Sketching Scriber?

3 Sub-Problem: dynamic sampling
Stream: general updates to a vector π‘₯∈ βˆ’1,0,1 𝑛 Goal: Output 𝑖 with probability π‘₯ 𝑖 𝑗 | π‘₯ 𝑗 |

4 Dynamic Sampling Goal: output 𝑖 with probability π‘₯ 𝑖 𝑗 | π‘₯ 𝑗 |
Let 𝐷={𝑖 s.t. π‘₯ 𝑖 β‰ 0} Intuition: Suppose |𝐷|=10 How can we sample 𝑖 with π‘₯ 𝑖 β‰ 0? Each π‘₯ 𝑖 β‰ 0 is a 1/10-heavy hitter Use CountSketch β‡’ recover all of them 𝑂 log 𝑛 space total Suppose 𝐷 =10 𝑛 Downsample: pick a random set πΌβŠ‚[𝑛] s.t. Pr π‘–βˆˆπΌ = 1 𝑛 Focus on substream on π‘–βˆˆπΌ only (ignore the rest) What’s |𝐷∩𝐼| ? In expectation = 10 Use CountSketch on the downsampled stream 𝐼… In general: prepare for all levels

5 Basic Sketch Hash function 𝑔: 𝑛 β†’[𝑛] Let β„Ž 𝑖 = #tail zeros in 𝑔(𝑖)
Pr β„Ž 𝑖 =𝑗 = 2 βˆ’π‘—βˆ’1 for 𝑗=0..πΏβˆ’1 and 𝐿= log 2 𝑛 Partition stream into substreams 𝐼 0 , 𝐼 1 ,… 𝐼 𝐿 Substream 𝐼 𝑗 focuses on elements with β„Ž 𝑖 =𝑗 𝐸 |𝐷∩ 𝐼 𝑗 | =|𝐷|β‹… 2 βˆ’π‘—βˆ’1 Sketch: for each 𝑗=0,…𝐿, Store 𝐢 𝑆 𝑗 : CountSketch for πœ™=0.01 Store 𝐷𝐢 𝑗 : distinct count sketch for approx=1.1 𝐹 2 would be sufficient here! Both for success probability 1βˆ’1/𝑛

6 Estimation Find a substream 𝐼 𝑗 s.t. 𝐷 𝐢 𝑗 output ∈[1,20]
Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 β†’[𝑛] β„Ž(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐢 𝑆 𝑗 , π‘—βˆˆ[𝐿] DistinctCount sketches 𝐷𝐢 𝑗 , π‘—βˆˆ[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=β„Ž(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐢 𝑆 𝑗 and 𝐷 𝐢 𝑗 Estimator: Let j be s.t. 𝐷 𝐢 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐢 𝑆 𝑗 Return 𝑖 Find a substream 𝐼 𝑗 s.t. 𝐷 𝐢 𝑗 output ∈[1,20] If no such stream, then FAIL Recover all π‘–βˆˆ 𝐼 𝑗 with π‘₯ 𝑖 β‰ 0 (using 𝐢 𝑆 𝑗 ) Pick any of them at random

7 Analysis If 𝐷 <10 Suppose 𝐷β‰₯10 𝐸 𝐷∩ 𝐼 π‘˜ = 𝐷 β‹… 2 βˆ’π‘˜βˆ’1 ∈ 5,10
Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 β†’[𝑛] β„Ž(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐢 𝑆 𝑗 , π‘—βˆˆ[𝐿] DistinctCount sketches 𝐷𝐢 𝑗 , π‘—βˆˆ[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=β„Ž(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐢 𝑆 𝑗 and 𝐷 𝐢 𝑗 Estimator: Let j be s.t. 𝐷 𝐢 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐢 𝑆 𝑗 Return 𝑖 If 𝐷 <10 then 𝐷∩ 𝐼 𝑗 ∈[1,10] for some 𝑗 Suppose 𝐷β‰₯10 Let π‘˜ be such that 𝐷 ∈ 10β‹… 2 π‘˜ ,10β‹… 2 π‘˜+1 𝐸 𝐷∩ 𝐼 π‘˜ = 𝐷 β‹… 2 βˆ’π‘˜βˆ’ ∈ 5,10 π‘‰π‘Žπ‘Ÿ 𝐷∩ 𝐼 π‘˜ ≀ 𝐷 β‹… 2 βˆ’π‘˜βˆ’1 ≀10 Chebyshev: 𝐷∩ 𝐼 π‘˜ deviates from expectation by 4> 1.5π‘‰π‘Žπ‘Ÿ with probability at most <0.7 Ie., probability of FAIL is at most 0.7

8 Analysis (cont) Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 β†’[𝑛] β„Ž(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐢 𝑆 𝑗 , π‘—βˆˆ[𝐿] DistinctCount sketches 𝐷𝐢 𝑗 , π‘—βˆˆ[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=β„Ž(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐢 𝑆 𝑗 and 𝐷 𝐢 𝑗 Estimator: Let j be s.t. 𝐷 𝐢 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐢 𝑆 𝑗 Return 𝑖 Let 𝑗 with 𝐷 𝐢 𝑗 ∈[1,20] All heavy hitters = 𝐷∩ 𝐼 𝑗 𝐢 𝑆 𝑗 will recover a heavy hitter, i.e., π‘–βˆˆπ·βˆ© 𝐼 𝑗 By symmetry, once we output some 𝑖, it is random over 𝐷 Randomness? We just used Chebyshev β‡’ pairwise 𝑔 is OK !

9 Dynamic Sampling: overall
DynSampleBasic guarantee: FAIL: with probability ≀0.7 Otherwise, output a random π‘–βˆˆπ· Modulo a negligible probability of 𝐢𝑆/𝐷𝐢 failing Reduce FAIL probability? DynSample-Full: Take π‘˜=𝑂( log 𝑛 ) independent DynSampleBasic Will not FAIL in at least one with probability at least 1βˆ’ 0.7 π‘˜ β‰₯1βˆ’1/𝑛 Space: 𝑂( log 4 𝑛 ) words for: π‘˜=𝑂( log 𝑛 ) repetitions 𝑂( log 𝑛 ) substreams 𝑂( log 2 𝑛 ) for each 𝐢 𝑆 𝑗 ,𝐷 𝐢 𝑗

10 Back to Dynamic Graphs Graph 𝐺 with edges inserted/deleted
Define node-edge incidence vectors: For node 𝑣, we have vector: π‘₯ 𝑣 ∈ 𝑅 𝑝 where 𝑝= 𝑛 2 For 𝑗>𝑣: π‘₯ 𝑣 (𝑣,𝑗)=+1 if edge (𝑣,𝑗) exists For 𝑗<𝑣: π‘₯ 𝑣 (𝑗,𝑣)=βˆ’1 if edge (𝑗,𝑣) exists Idea: Use Dynamic-Sample-Full to sample an edge from each vertex 𝑣 Collapse edges How to iterate? Property: For a set 𝑄 of nodes Consider: π‘£βˆˆπ‘„ π‘₯ 𝑣 Claim: has non-zero in coordinate (𝑖,𝑗) iff edge (𝑖,𝑗) crosses from 𝑄 to outside (i.e., π‘„βˆ© 𝑖,𝑗 =1) Sketch enough for: for any set 𝑄, can sample an edge from 𝑄 ! (𝑖,𝑗) 𝑖 +1 𝑗 -1

11 Dynamic Connectivity Sketching algorithm: Check connectivity:
π‘₯ 𝑣 ∈ 𝑅 𝑝 where 𝑝= 𝑛 2 for 𝑗>𝑣: π‘₯ 𝑣 (𝑣,𝑗)=+1 if βˆƒ(𝑣,𝑗) for 𝑗<𝑣: π‘₯ 𝑣 (𝑗,𝑣)=βˆ’1 if βˆƒ(𝑗,𝑣) Sketching algorithm: Dynamic-Sample-Full for each π‘₯ 𝑣 Check connectivity: Sample an edge from each node 𝑣 Contract all sampled edges β‡’ partitioned the graph into a bunch of components 𝑄 1 ,… 𝑄 𝑙 (each is connected) Iterate on the components 𝑄 1 ,… 𝑄 𝑙 How many iterations? 𝑂( log 𝑛 ) – each time we reduce the number of components by a factor β‰₯2 Issue: iterations not independent! Can use a fresh Dynamic-Sampling-Full for each of the 𝑂( log 𝑛 ) iterations

12 A little history [Ahn-Guha-McGregor’12]: the above streaming algorithm
Overall 𝑂(𝑛⋅ log 4 𝑛) space [Kapron-King-Mountjoy’13]: Data structure for maintaining graph connectivity under edge inserts/deletes First algorithm with log 𝑛 𝑂 1 time for update/connectivity ! Open since β€˜80s

13 Section 2: Dimension Reduction & Sketching

14 Why? Application: Nearest Neighbor Search in high dimensions
Preprocess: a set 𝐷 of points Query: given a query point π‘ž, report a point π‘βˆˆπ· with the smallest distance to π‘ž 𝑝 π‘ž

15 Motivation Generic setup: Application areas: Distance can be:
Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule speech/image/video/music recognition, vector quantization, bioinformatics, etc… Distance can be: Euclidean, Hamming 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 𝑝 π‘ž

16 Low-dimensional: easy
Compute Voronoi diagram Given query π‘ž, perform point location Performance: Space: 𝑂(𝑛) Query time: 𝑂( log 𝑛 )

17 High-dimensional case
All exact algorithms degrade rapidly with the dimension 𝑑 Algorithm Query time Space Full indexing 𝑂(log 𝑛⋅𝑑) 𝑛 𝑂(𝑑) (Voronoi diagram size) No indexing – linear scan 𝑂(𝑛⋅𝑑)

18 Dimension Reduction Reduce high dimension?!
β€œflatten” dimension 𝑑 into dimension π‘˜β‰ͺ𝑑 Not possible in general: packing bound But can if: for a fixed subset of β„œ 𝑑


Download ppt "Lecture 7: Dynamic sampling Dimension Reduction"

Similar presentations


Ads by Google