Big Data Lecture 5: Estimating the second moment, dimension reduction, applications
The second moment A,B,A,C,D,D,A,A,E,B,E,E,F,… The second moment: f(x) 4 A 2 B 1 C 3 D E F The second moment:
Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A B C 3 D E F Maintain:
Alon, Matias, Szegedy 96 Gödel Prize 2005 Maintain: h(x) f(x) x -1 4 A B C 3 D E F Maintain:
AMS Analysis
2-wise independent hash family Suppose h : [d] [T] Fix 2 values t1 and t2 in the range of h Fix 2 values x1 x2 in the domain of h What is the probability that h(x1) = t1 and h(x2) = t2 ? x1 t1 ? x2 t2
2-wise independent hash family H, a family of hash functions h, is 2-wise independent iff x1x2 t1 t2 PrhH (h(x1) = t1 and h(x2) = t2) = 1/|T|2 x1 t1 ? x2 t2
2-wise independent hash family H={(ax+b) mod T | 0 a,b < T} is 2-wise independent if T is a prime > d H={2((ax+b) mod T mod 2) - 1| 0 a,b < T} is approximately 2-wise independent from [d] to {-1,1} We can get an exact 2-wise ind. by more complicated constructions
Draw h from 2-wise ind. family Z2 is an unbiased estimator for F2 !
What is the variance of Z2 ? Here we will assume that h is drawn from a 4-wise inde. family H
What is the variance of Z2 ?
Chebyshev’s Inequality
Chebyshev’s Inequality If is small this is meaningless… We need to reduce the variance How ?
Averaging Draw k ind. hash functions h1, h2, …. , hk Use
Chebyshev’s Inequality Pick
Boosting the confidence – Chernoff bounds Pick 1/4 1/4
Boosting the confidence – Chernoff bounds Now repeat the experiment s = O(log(1/)) times We get A1,…..,As (assume they are sorted) Return their median Why is this good ?
Boosting the confidence – Chernoff bounds Each of A1,…..,As is bad ((1 ) far from F2) with probability ≤ ¼ For the median to be bad we need more than ½ of A1,…..,As to be bad (remove the pair consisting of the largest and smallest and repeat... If both components of some pair are good then median is good…) A1, A2 , ……. ,As-1,As
Boosting the confidence – Chernoff bounds What is the probability that more than ½ are bad ? Chernoff: Let X = X1 + …..+ Xs where each Xi is Bernoulli with p = ¼ then s = O(log(1/)) with a large enough constant
Recap =
This is a random projection.. = Preserve distances in the sense:
Make it look more familiar.. = Preserve distances in the sense:
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]
Dimension reduction (A random orthonormal k d) = We project into a random k-dim. subspace JL: ε[0,1]
Johnson-Lindenstrauss JL: Project the vectors x1,….,xn into a random k-dimensional subspace for k=O(log(n)/2) then with probability 1-1/nc :
The proof (A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:
The proof (A random orthonormal k d) = Obs1: Its enough to prove for vectors such that ||x||2=1 JL:
The proof (A random orthonormal k d) = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The proof Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The case k=1 Random unit vec = Obs2: Instead of projecting into a random k-dim subspace, look at the first k coordinates of a random unit vector JL:
The case k=1 Random unit vec = JL:
The case k=1 1 ε[0,1]
An application: approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An application, approximate period 10,3,20,1,10,3,18,1,11,5,20,2,12,1,19,1,......... Find r such that is minimized
An exact algorithm Find r such that is minimized For each value of r takes linear time O(m2)
An exact algorithm Find r such that is minimized For each value of r takes linear time O(m2) We can sketch/project all windows of length r and compare the sketches … but O(m2k) just for sketching…
Obs1: We can sketch faster.. A running inner-product with a unit vector This is similar to a convolution of two vectors
Convolution 1 2 3 4 5 3 2 1
Convolution 1 2 3 4 5 3 2 1
Convolution 1 2 3 4 5 3 2 1
Convolution 1 2 3 4 5 3 2 1
Convolution 1 2 3 4 5 3 2 1 We can compute the convolution in O(mlog(r)) time using the FFT
Obs1: We can sketch faster We can compute the first coordinate of all sketches in O(mlog(r)) time We can sketch all positions in O(mlog(r)k) But we still have many possible values for r…
Obs2: Sketch only in powers of 2 We compute all sketches in O(log(m)mlog(r)k)
When r is not a power of 2 ? z x y S(x) S(y) Use S(x) + S(y) as S(z)
The algorithm z x y S(x) S(y) Compute sketches in powers of 2 in O(log(m)mlog(r)k) time For a fixed r we can approximate in O((m/r)*k) time Summing over r we get O(mlog(m) * k)
The algorithm z x y S(x) S(y) Total running time is O(mlog3m)
Bibliography Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1) (1999), 137-147 W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz maps into a Hilbert space, Contemp Math 26 (1984), 189–206. Jirí Matousek: On variants of the Johnson-Lindenstrauss lemma. Random Struct. Algorithms 33(2): 142-156 (2008) Piotr Indyk, Nick Koudas, S. Muthukrishnan: Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB 2000: 363-372