Lecture 7: Dynamic sampling Dimension Reduction

Slides:

Advertisements

Similar presentations

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

all-pairs shortest paths in undirected graphs

WSPD Applications.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.

Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Combinatorial Algorithms Reference Text: Kreher and Stinson.

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

UNC Chapel Hill M. C. Lin Delaunay Triangulations Reading: Chapter 9 of the Textbook Driving Applications –Height Interpolation –Constrained Triangulation.

Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Uniformed Search (cont.) Computer Science cpsc322, Lecture 6

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,

Lecture 22: Linearity Testing Sparse Fourier Transform

Estimating L2 Norm MIT Piotr Indyk.

Streaming & sampling.

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Spectral Approaches to Nearest Neighbor Search [FOCS 2014]

K Nearest Neighbor Classification

Lecture 10: Sketching S3: Nearest Neighbor Search

Lecture 4: CountSketch High Frequencies

Randomized Algorithms CS648

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Data-Dependent Hashing for Nearest Neighbor Search

Enumerating Distances Using Spanners of Bounded Degree

Lecture 16: Earth-Mover Distance

Nearest-Neighbor Classifiers

Computational Geometry Capter:1-2.1

CIS 700: “algorithms for Big Data”

Randomized Algorithms CS648

Instructor: Shengyu Zhang

עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב

Locality Sensitive Hashing

CSCI B609: “Foundations of Data Science”

cse 521: design and analysis of algorithms

CSCI B609: “Foundations of Data Science”

CS5112: Algorithms and Data Structures for Applications

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

Minwise Hashing and Efficient Search

Trevor Brown DC 2338, Office hour M3-4pm

President’s Day Lecture: Advanced Nearest Neighbor Search

Sublinear Algorihms for Big Data

Locality In Distributed Graph Algorithms

Presentation transcript:

Lecture 7: Dynamic sampling Dimension Reduction Left to the title, a presenter can insert his/her own image pertinent to the presentation.

Plan Admin: Plan: Scriber? PSet 2 released later today, due next Wed Alex office hours: Tue 2:30-4:30 Plan: Dynamic streaming graph algorithms S2: Dimension Reduction & Sketching Scriber?

Sub-Problem: dynamic sampling Stream: general updates to a vector 𝑥∈ −1,0,1 𝑛 Goal: Output 𝑖 with probability 𝑥 𝑖 𝑗 | 𝑥 𝑗 |

Dynamic Sampling Goal: output 𝑖 with probability 𝑥 𝑖 𝑗 | 𝑥 𝑗 | Let 𝐷={𝑖 s.t. 𝑥 𝑖 ≠0} Intuition: Suppose |𝐷|=10 How can we sample 𝑖 with 𝑥 𝑖 ≠0? Each 𝑥 𝑖 ≠0 is a 1/10-heavy hitter Use CountSketch ⇒ recover all of them 𝑂 log 𝑛 space total Suppose 𝐷 =10 𝑛 Downsample: pick a random set 𝐼⊂[𝑛] s.t. Pr 𝑖∈𝐼 = 1 𝑛 Focus on substream on 𝑖∈𝐼 only (ignore the rest) What’s |𝐷∩𝐼| ? In expectation = 10 Use CountSketch on the downsampled stream 𝐼… In general: prepare for all levels

Basic Sketch Hash function 𝑔: 𝑛 →[𝑛] Let ℎ 𝑖 = #tail zeros in 𝑔(𝑖) Pr ℎ 𝑖 =𝑗 = 2 −𝑗−1 for 𝑗=0..𝐿−1 and 𝐿= log 2 𝑛 Partition stream into substreams 𝐼 0 , 𝐼 1 ,… 𝐼 𝐿 Substream 𝐼 𝑗 focuses on elements with ℎ 𝑖 =𝑗 𝐸 |𝐷∩ 𝐼 𝑗 | =|𝐷|⋅ 2 −𝑗−1 Sketch: for each 𝑗=0,…𝐿, Store 𝐶 𝑆 𝑗 : CountSketch for 𝜙=0.01 Store 𝐷𝐶 𝑗 : distinct count sketch for approx=1.1 𝐹 2 would be sufficient here! Both for success probability 1−1/𝑛

Estimation Find a substream 𝐼 𝑗 s.t. 𝐷 𝐶 𝑗 output ∈[1,20] Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 →[𝑛] ℎ(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐶 𝑆 𝑗 , 𝑗∈[𝐿] DistinctCount sketches 𝐷𝐶 𝑗 , 𝑗∈[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=ℎ(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐶 𝑆 𝑗 and 𝐷 𝐶 𝑗 Estimator: Let j be s.t. 𝐷 𝐶 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐶 𝑆 𝑗 Return 𝑖 Find a substream 𝐼 𝑗 s.t. 𝐷 𝐶 𝑗 output ∈[1,20] If no such stream, then FAIL Recover all 𝑖∈ 𝐼 𝑗 with 𝑥 𝑖 ≠0 (using 𝐶 𝑆 𝑗 ) Pick any of them at random

Analysis If 𝐷 <10 Suppose 𝐷≥10 𝐸 𝐷∩ 𝐼 𝑘 = 𝐷 ⋅ 2 −𝑘−1 ∈ 5,10 Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 →[𝑛] ℎ(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐶 𝑆 𝑗 , 𝑗∈[𝐿] DistinctCount sketches 𝐷𝐶 𝑗 , 𝑗∈[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=ℎ(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐶 𝑆 𝑗 and 𝐷 𝐶 𝑗 Estimator: Let j be s.t. 𝐷 𝐶 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐶 𝑆 𝑗 Return 𝑖 If 𝐷 <10 then 𝐷∩ 𝐼 𝑗 ∈[1,10] for some 𝑗 Suppose 𝐷≥10 Let 𝑘 be such that 𝐷 ∈ 10⋅ 2 𝑘 ,10⋅ 2 𝑘+1 𝐸 𝐷∩ 𝐼 𝑘 = 𝐷 ⋅ 2 −𝑘−1 ∈ 5,10 𝑉𝑎𝑟 𝐷∩ 𝐼 𝑘 ≤ 𝐷 ⋅ 2 −𝑘−1 ≤10 Chebyshev: 𝐷∩ 𝐼 𝑘 deviates from expectation by 4> 1.5𝑉𝑎𝑟 with probability at most 1 1.5 <0.7 Ie., probability of FAIL is at most 0.7

Analysis (cont) Algorithm DynSampleBasic: Initialize: hash function 𝑔: 𝑛 →[𝑛] ℎ(𝑖)= # tail zeros in 𝑔(𝑖) CountSketch sketches 𝐶 𝑆 𝑗 , 𝑗∈[𝐿] DistinctCount sketches 𝐷𝐶 𝑗 , 𝑗∈[𝐿] Process(int 𝑖, real 𝛿 𝑖 ): Let 𝑗=ℎ(𝑖) Add (𝑖, 𝛿 𝑖 ) to 𝐶 𝑆 𝑗 and 𝐷 𝐶 𝑗 Estimator: Let j be s.t. 𝐷 𝐶 𝑗 ∈[1,20] If no such j, FAIL 𝑖= random heavy hitter from 𝐶 𝑆 𝑗 Return 𝑖 Let 𝑗 with 𝐷 𝐶 𝑗 ∈[1,20] All heavy hitters = 𝐷∩ 𝐼 𝑗 𝐶 𝑆 𝑗 will recover a heavy hitter, i.e., 𝑖∈𝐷∩ 𝐼 𝑗 By symmetry, once we output some 𝑖, it is random over 𝐷 Randomness? We just used Chebyshev ⇒ pairwise 𝑔 is OK !

Dynamic Sampling: overall DynSampleBasic guarantee: FAIL: with probability ≤0.7 Otherwise, output a random 𝑖∈𝐷 Modulo a negligible probability of 𝐶𝑆/𝐷𝐶 failing Reduce FAIL probability? DynSample-Full: Take 𝑘=𝑂( log 𝑛 ) independent DynSampleBasic Will not FAIL in at least one with probability at least 1− 0.7 𝑘 ≥1−1/𝑛 Space: 𝑂( log 4 𝑛 ) words for: 𝑘=𝑂( log 𝑛 ) repetitions 𝑂( log 𝑛 ) substreams 𝑂( log 2 𝑛 ) for each 𝐶 𝑆 𝑗 ,𝐷 𝐶 𝑗

Back to Dynamic Graphs Graph 𝐺 with edges inserted/deleted Define node-edge incidence vectors: For node 𝑣, we have vector: 𝑥 𝑣 ∈ 𝑅 𝑝 where 𝑝= 𝑛 2 For 𝑗>𝑣: 𝑥 𝑣 (𝑣,𝑗)=+1 if edge (𝑣,𝑗) exists For 𝑗<𝑣: 𝑥 𝑣 (𝑗,𝑣)=−1 if edge (𝑗,𝑣) exists Idea: Use Dynamic-Sample-Full to sample an edge from each vertex 𝑣 Collapse edges How to iterate? Property: For a set 𝑄 of nodes Consider: 𝑣∈𝑄 𝑥 𝑣 Claim: has non-zero in coordinate (𝑖,𝑗) iff edge (𝑖,𝑗) crosses from 𝑄 to outside (i.e., 𝑄∩ 𝑖,𝑗 =1) Sketch enough for: for any set 𝑄, can sample an edge from 𝑄 ! (𝑖,𝑗) 𝑖 +1 𝑗 -1

Dynamic Connectivity Sketching algorithm: Check connectivity: 𝑥 𝑣 ∈ 𝑅 𝑝 where 𝑝= 𝑛 2 for 𝑗>𝑣: 𝑥 𝑣 (𝑣,𝑗)=+1 if ∃(𝑣,𝑗) for 𝑗<𝑣: 𝑥 𝑣 (𝑗,𝑣)=−1 if ∃(𝑗,𝑣) Sketching algorithm: Dynamic-Sample-Full for each 𝑥 𝑣 Check connectivity: Sample an edge from each node 𝑣 Contract all sampled edges ⇒ partitioned the graph into a bunch of components 𝑄 1 ,… 𝑄 𝑙 (each is connected) Iterate on the components 𝑄 1 ,… 𝑄 𝑙 How many iterations? 𝑂( log 𝑛 ) – each time we reduce the number of components by a factor ≥2 Issue: iterations not independent! Can use a fresh Dynamic-Sampling-Full for each of the 𝑂( log 𝑛 ) iterations

A little history [Ahn-Guha-McGregor’12]: the above streaming algorithm Overall 𝑂(𝑛⋅ log 4 𝑛) space [Kapron-King-Mountjoy’13]: Data structure for maintaining graph connectivity under edge inserts/deletes First algorithm with log 𝑛 𝑂 1 time for update/connectivity ! Open since ‘80s

Section 2: Dimension Reduction & Sketching

Why? Application: Nearest Neighbor Search in high dimensions Preprocess: a set 𝐷 of points Query: given a query point 𝑞, report a point 𝑝∈𝐷 with the smallest distance to 𝑞 𝑝 𝑞

Motivation Generic setup: Application areas: Distance can be: Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule speech/image/video/music recognition, vector quantization, bioinformatics, etc… Distance can be: Euclidean, Hamming 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 𝑝 𝑞

Low-dimensional: easy Compute Voronoi diagram Given query 𝑞, perform point location Performance: Space: 𝑂(𝑛) Query time: 𝑂( log 𝑛 )

High-dimensional case All exact algorithms degrade rapidly with the dimension 𝑑 Algorithm Query time Space Full indexing 𝑂(log 𝑛⋅𝑑) 𝑛 𝑂(𝑑) (Voronoi diagram size) No indexing – linear scan 𝑂(𝑛⋅𝑑)

Dimension Reduction Reduce high dimension?! “flatten” dimension 𝑑 into dimension 𝑘≪𝑑 Not possible in general: packing bound But can if: for a fixed subset of ℜ 𝑑