Join Size The join size is the space required to join two relations. Let t be the number of possible values the join parameter can take on. Let ai be the number of items in relation A with value i, and let bi be defined similarly for B. Then, the join size will be i ai bi . Note that for a self join, ai = bi , and thus this reduces to iai ai = iai 2
Motivation The standard method for estimating the join size involves using random samples as “signatures”. The two relations are sampled with probability p. The join size of the signatures is then computed, and scaled by p-2. In order to get a good estimate with high probability, we need cn2/B samples, where B is a lower bound on the join size and c > 3 is a constant.
Using Self Join Estimates Instead, we use a generalization of the “tug-of-war” self-join size estimators that we saw earlier. We simply compute the tug-of-war estimates for each of the relations, and then multiply them together to obtain the join size estimate. Recall that in the self-join algorithm, we squared the estimates, because we were joining a relation to itself.
Algorithm Let i be four-wise independent {-1,1}-valued random variables. Let S(A) = i i ai and S(B) = i i bi . Our estimator is then S(A) S(B). Note that we use the same i for both sums.
Expectation and Variance Expected value of the estimate: E(S(A)*S(B)) = the join size of A and B. Recall that if we view A as the vector of the ai’s, then the self-join size of the relation is ||A||2 (the magnitude squared). Variance of the estimate: Var(S(A)*S(B)) ≤ 2 * ||A||2 * ||B||2
Proof of Expected Value E(S(A)*S(B)) = E(i i 2ai bi + i≠j i j ai bj ) Note that if i≠j, then E(i j ) = 0. And if i=j, then E(i j ) = 1. Thus, we get E(S(A)*S(B)) = E(i ai bi ). And this is just i ai bi , which is the join size.
Proof of Variance Let X = S(A)S(B)-E(S(A)S(B)) = i≠ji j ai bj Then E(X2) = Var(S(A)S(B)) Note that when we square the above summation, any term which contains an x raised to an odd power will have an expectation of 0. Thus these terms will be eliminated. In the other terms, the product of the ’s will always be 1.
Proof of Variance II Every term will look like i j i’ j’ ai bj ai’ bj’ where i≠j and i’≠j’. In order for all the -powers to be even, either i=i’ and j=j’, or i=j’ and j=i’. Thus, we are left with: Var(S(A)S(B)) = i≠jai 2bj2 + i≠jai bi aj bj Note that i≠jai 2bj2 ≤ iai 2 jbj2 - we just drop the requirement that i≠j. And this is equivalent to ||A||2 ||B||2.
Proof of Variance III Also, i≠jai bi aj bj ≤ iai bi jaj bj = (i ai bi )2. And this equals <A,B>2 (the square of the inner product of A and B). Also, by the Cauchy -Schwartz inequality, <A,B>2 ≤ ||A||2 ||B||2 Thus Var(S(A)S(B)) ≤ 2 ||A||2 ||B||2
Repeated trials If we want constant relative error with high probability, we can repeat the experiment multiple times, and take the mean of the results. Specifically, we need to try it c*||A||2||B||2/b times, where c>2 is a constant determined by the desired accuracy and confidence, and b is a lower bound on join size.
Future Work Exploring other approaches. Three-way joins. Experimental results for “tug-of-war” join scheme.