Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Data Stream Algorithms Frequency Moments
On the k-Independence Required by Linear Probing and Minwise Independence Mihai P ă trașcuMikkel Thorup ICALP10.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Randomized Algorithms Randomized Algorithms CS648 Lecture 8 Tools for bounding deviation of a random variable Markov’s Inequality Chernoff Bound Lecture.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Dimensionality Reduction and Embeddings
Dimensionality Reduction
Dimensionality Reduction
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Data Stream Mining and Querying
1 Lecture 18 Syntactic Web Clustering CS
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Dimensionality Reduction
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Topics in Algorithms 2007 Ramesh Hariharan. Random Projections.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
AN ORTHOGONAL PROJECTION
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
Information Complexity Lower Bounds
Approximating the MST Weight in Sublinear Time
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Randomized Algorithms
Streaming & sampling.
Lecture 18: Uniformity Testing Monotonicity Testing
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 10: Sketching S3: Nearest Neighbor Search
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
CIS 700: “algorithms for Big Data”
Linear sketching with parities
Y. Kotidis, S. Muthukrishnan,
Sublinear Algorihms for Big Data
Summarizing Data by Statistics
CSCI B609: “Foundations of Data Science”
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Diving Deeper into Chaining and Linear Probing
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
Sublinear Algorihms for Big Data
Presentation transcript:

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications

The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment: f(x) 4 A 2 B 1 C 3 D E F The second moment:

Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A B C 3 D E F Maintain:

Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A B C 3 D E F Maintain:

AMS Analysis

2-wise independent hash family Suppose h : [d]  [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1  x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? x1 t1 ? x2 t2

2-wise independent hash family H, a family of hash functions h, is 2-wise independent iff  x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 x1 t1 ? x2 t2

2-wise independent hash family H={(ax+b) mod T | 0  a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0  a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions

Draw h from 2-wise ind. family Z2 is an unbiased estimator for F2 !

What is the variance of Z2 ? Here we will assume that h is drawn from a 4-wise inde. family H

What is the variance of Z2 ?

Chebyshev’s Inequality    

Chebyshev’s Inequality If  is small this is meaningless… We need to reduce the variance How ?

Averaging Draw k ind. hash functions h1, h2, …. , hk Use

Chebyshev’s Inequality Pick

Boosting the confidence – Chernoff bounds Pick 1/4 1/4

Boosting the confidence – Chernoff bounds Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?

Boosting the confidence – Chernoff bounds Each of A1,…..,As is bad ((1  ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As

Boosting the confidence – Chernoff bounds What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then  s = O(log(1/)) with a large enough constant

Recap =

This is a random projection.. = Preserve distances in the sense:

Make it look more familiar.. = Preserve distances in the sense:

Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace

Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace JL: ε[0,1]

Dimension reduction (A random orthonormal k  d) = We project into a random k-dim. subspace JL: ε[0,1]

Johnson-Lindenstrauss JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :

The proof (A random orthonormal k  d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

The proof (A random orthonormal k  d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:

The proof (A random orthonormal k  d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:

The case k=1 Random unit vec = JL:

The case k=1 1 ε[0,1]

An application: approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized

An exact algorithm Find r such that is minimized For each value of r takes linear time  O(m2)

An exact algorithm Find r such that is minimized For each value of r takes linear time  O(m2) We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…

Obs1: We can sketch faster.. A running inner-product with a unit vector This is similar to a convolution of two vectors

Convolution 1 2 3 4 5 3 2 1

Convolution 1 2 3 4 5 3 2 1

Convolution 1 2 3 4 5 3 2 1

Convolution 1 2 3 4 5 3 2 1

Convolution 1 2 3 4 5 3 2 1 We can compute the convolution in O(mlog(r)) time using the FFT

Obs1: We can sketch faster We can compute the first coordinate of all sketches in O(mlog(r)) time  We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…

Obs2: Sketch only in powers of 2 We compute all sketches in O(log(m)mlog(r)k)

When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)

The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)

The algorithm z x y S(x) S(y) Total running time is O(mlog3m)

Bibliography Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. Jirí Matousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) Piotr Indyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372