Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Communication Complexity of Distributed Set-Joins

Similar presentations


Presentation on theme: "The Communication Complexity of Distributed Set-Joins"— Presentation transcript:

1 The Communication Complexity of Distributed Set-Joins
Dirk Van Gucht, Ryan Williams, David Woodruff, Qin Zhang Indiana University, Stanford, IBM Almaden, Indiana University

2 Set-Joins Given sets A1, …, Am from [n] = {1, 2, …, n}
Given sets B1, …, Bm from [n] Join = {(i,j) for which P(Ai, Bj)}, where P is a predicate P might be: set-intersection |Ai ∩ Bj| > 0 set non-intersection |Ai ∩ Bj| = 0 set equality Ai = Bj set-intersection threshold join |Ai ∩ Bj | > T?

3 Communication Complexity
A1, …, Am B1, …, Bm Alice and Bob want to minimize communication so that they together can output the join The protocol can fail with probability 1/100 (over its randomness)

4 Communication Complexity of Joins
Set-Intersection, Set Non-Intersection, Set-Intersection Threshold Join all require Ω(mn) communication Reduction from set-disjointness on a universe of size m*n Matched by trivial upper bound {x1, x2, …, xmn} {y1, y2, …, ymn} B1 = {y1, y2, …, yn} A1 = {x1, x2, …, xn} B2 = {yn+1, y2, …, y2n} A2 = {xn+1, xn+2, …, x2n} Am = {x(m-1)n+1, …, xmn} Bm = {y(m-1)n, …, ymn}

5 Communication Complexity of Joins
Set-Equality has communication Θ(m) Ω(m) bits transferred to the player outputting the join O(m) upper bound due to a set-intersection protocol of [BCKWY] A1, …, Am B1, …, Bm Randomly hash sets to [m2] x1, …, xm in [m2] y1, …, ym in [m2]

6 Communication Complexity of Joins
Can we achieve better communication for joins in practice? Sparsity Output: typically the output size |Join(A1, …, Am, B1, …, Bm)| is at most k << m2 Input: typically |A1|, …, |Am|, |B1|, …, |Bm| each at most s << n Can get better bounds in terms of k, s! Focus on set-intersection join (will generalize to natural join)

7 Our Main Results O(k1/2n) communication protocol for set-intersection join More efficient than O(mn) for small k! Improves O(kn) communication protocol of [Williams, Yu, SODA ’14] Matching Ω(k1/2n) lower bound For input sparsity s, we show Θ(min(ms, (skn)1/2) + n) communication bound Highlights Use 2 rounds of interaction (1 round is impossible!) For each (i,j) in the join, returns all k for which k in Ai and k in Bj (natural join) New algorithms for input/output sparse matrix multiplication

8 Talk Outline Our O(k1/2n) communication protocol
Extension to solve the natural join problem New algorithm for matrix multiplication input/output Application to Sparse Transitive-Closure An Ω(k1/2n) communication lower bound Estimating sizes of set joins

9 A Protocol for Set-Intersection Join
A1, …, Am B1, …, Bm [ B1 B2 …Bm [ A1 A2 Am B = A = Set-Intersection Join = {(i,j) for which Ci,j > 0, where C = A*B}

10 A Protocol for Set-Intersection Join
Set-Intersection Join = {(i,j) for which Ci,j > 0, where C = A*B} m x n Matrix A n x m Matrix B Alice computes S*A for a random k x m matrix S, sends S*A to Bob [Compressed Sensing] S has the property that if x is a vector containing at most k non-zero entries, then given S*x, one can recover x w.h.p.

11 A Protocol for Set-Intersection Join
m x n Matrix A m x n Matrix B Alice computes S*A for a random k x m matrix S, sends S*A to Bob Bob computes (S*A)*B. Each column of A*B has at most k non-zeros From S*A*B, Bob can recover all non-zero entries of A*B S can be a random Gaussian matrix with O(log n) bits of precision per entry

12 A Protocol for Set-Intersection Join
m x n Matrix A m x n Matrix B Problem: sending S*A requires k n log n bits of communication Idea: instead let S be a k1/2 x m random matrix! From S*A*B, can recover each column of A*B with at most k1/2 non-zeros Number of columns of A*B with more than k1/2 non-zeros is at most k1/2

13 Overall Protocol for Set-Intersection Join
Alice sends Bob S*A where S has k1/2 rows Alice also sends Bob T*A, where T has O(log n) rows [KNW] Bob computes T*A*B and finds which columns of A*B have > k1/2 non-zeros Set H of such columns has size at least k1/2 Bob computes S*A*B and recovers (A*B)j for each j not in H Bob sends Alice columns Bj for each j in H Alice directly computes A*Bj for each j in H

14 Extension to Solve the Natural Join
Not enough to just figure out Ci,j > 0, need to find all witnesses A witness is an index k for which Ai,k = Bk,j = 1 Can be solved iteratively Bob deletes entries of B one-at-a-time and re-computes S*A*B Alice just sends S*A once Since Alice has Bj for columns j in H, she directly computes all witnesses

15 A New Algorithm for Matrix Multiplication
Suppose we choose S to be a CountSketch matrix [ The same algorithm works to recover C = A*B! Time nnz(A) to compute S*A nnz(B)k1/2 to compute S*A*B and recover A*Bj for j not in H nnz(A)k1/2 to compute A*Bj for j in H

16 Matrix Multiplication Application
Gives O((nnz(A) + nnz(B))k1/2) time for matrix multiplication [Sparse Transitive Closure Application] Given: a directed graph G on n nodes Transitive-closure is pairs (u,v) of nodes for which v is reachable from u Promise: transitive-closure has O(n) edges Goal: find all edges in the transitive-closure Solve by computing A, A2, A4, …, An , where A is adjacency matrix We obtain O~(n1.5) time algorithm. Previous best is n1.844 time [BCH]

17 Matching Communication Lower Bound
Reduction from set-disjointness on a universe of size k1/2n {y1, y2, …, ysqrt{k}n} {x1, x2, …, xsqrt{k}n} B1 = {y1, y2, …, yn} A1 = {x1, x2, …, xn} B2 = {yn+1, y2, …, y2n} A2 = {xn+1, xn+2, …, x2n} Bm = {y(sqrt{k}-1)n, …, ysqrt{k}n} Am = {x(sqrt{k}-1)n+1, …, xsqrt{k}n} A*B is k1/2 x k1/2 matrix so it has at most k non-zeros A non-zero on the diagonal can decide disjointness

18 Our Other Results Approximate size |Join(A1, …, Am, B1, …, Bm)|
Query strategy planning, measuring similarity, etc. Output X = (1±ε) |Join(A1, …, Am, B1, …, Bm)| Communication of Approximate Set-Intersection Join [1-way Communication] Upper and lower bounds of n/ε2 [2-way Communication] Lower bound of n/ ε2/3 Results for set-intersection threshold join as well (tight for 1-way)

19 Conclusion Summary Separation of communication complexity of different join types Optimal communication protocol for Set-Intersection and Natural Joins in terms of sparsity New algorithms for sparse input and output matrix multiplication Upper and lower communication bounds for approximating join sizes Open Questions Communication bounds not tight for approximating join sizes for most joins For T-threshold join, off by a (large) factor of sqrt{n/T} Multi-player versions and more expressive joins Other applications of our matrix multiplication algorithm


Download ppt "The Communication Complexity of Distributed Set-Joins"

Similar presentations


Ads by Google