The Communication Complexity of Distributed Set-Joins

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Sublinear Algorithms … Lecture 23: April 20.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
Lecture 24 MAS 714 Hartmut Klauck
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Small clique detection and approximate Nash equilibria Danny Vilenchik UCLA Joint work with Lorenz Minder.
PROBABILISTIC COMPUTATION By Remanth Dabbati. INDEX  Probabilistic Turing Machine  Probabilistic Complexity Classes  Probabilistic Algorithms.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
You Did Not Just Read This or did you?. Quantum Computing Dave Bacon Department of Computer Science & Engineering University of Washington Lecture 3:
Asymmetric Communication Complexity And its implications on Cell Probe Complexity Slides by Elad Verbin Based on a paper of Peter Bro Miltersen, Noam Nisan,
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Data Stream Algorithms Lower Bounds Graham Cormode
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
SketchVisor: Robust Network Measurement for Software Packet Processing
Random Access Codes and a Hypercontractive Inequality for
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
Tell Me Who I Am: An Interactive Recommendation System
New Characterizations in Turnstile Streams with Applications
Computing and Compressive Sensing in Wireless Sensor Networks
A Study of Group-Tree Matching in Large Scale Group Communications
Computing Connected Components on Parallel Computers
Approximating the MST Weight in Sublinear Time
Communication Complexity as a Lower Bound for Learning in Games
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
CS 213: Data Structures and Algorithms
CS 3343: Analysis of Algorithms
Applied Discrete Mathematics Week 10: Equivalence Relations
CSCE 411 Design and Analysis of Algorithms
Sketching and Embedding are Equivalent for Norms
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Matrix Martingales in Randomized Numerical Linear Algebra
Dijkstra’s Algorithm for the Shortest Path Problem
Neuro-RAM Unit in Spiking Neural Networks with Applications
Classical Algorithms from Quantum and Arthur-Merlin Communication Protocols Lijie Chen MIT Ruosong Wang CMU.
Imperfectly Shared Randomness
Properties of Relations
Lecture 6: Counting triangles Dynamic graphs & sampling
Combining relations via relational composition
Applied Discrete Mathematics Week 4: Functions
Representing Relations Using Matrices
Presentation transcript:

The Communication Complexity of Distributed Set-Joins Dirk Van Gucht, Ryan Williams, David Woodruff, Qin Zhang Indiana University, Stanford, IBM Almaden, Indiana University

Set-Joins Given sets A1, …, Am from [n] = {1, 2, …, n} Given sets B1, …, Bm from [n] Join = {(i,j) for which P(Ai, Bj)}, where P is a predicate P might be: set-intersection |Ai ∩ Bj| > 0 set non-intersection |Ai ∩ Bj| = 0 set equality Ai = Bj set-intersection threshold join |Ai ∩ Bj | > T?

Communication Complexity A1, …, Am B1, …, Bm Alice and Bob want to minimize communication so that they together can output the join The protocol can fail with probability 1/100 (over its randomness)

Communication Complexity of Joins Set-Intersection, Set Non-Intersection, Set-Intersection Threshold Join all require Ω(mn) communication Reduction from set-disjointness on a universe of size m*n Matched by trivial upper bound {x1, x2, …, xmn} {y1, y2, …, ymn} B1 = {y1, y2, …, yn} A1 = {x1, x2, …, xn} B2 = {yn+1, y2, …, y2n} A2 = {xn+1, xn+2, …, x2n} … … Am = {x(m-1)n+1, …, xmn} Bm = {y(m-1)n, …, ymn}

Communication Complexity of Joins Set-Equality has communication Θ(m) Ω(m) bits transferred to the player outputting the join O(m) upper bound due to a set-intersection protocol of [BCKWY] A1, …, Am B1, …, Bm Randomly hash sets to [m2] x1, …, xm in [m2] y1, …, ym in [m2]

Communication Complexity of Joins Can we achieve better communication for joins in practice? Sparsity Output: typically the output size |Join(A1, …, Am, B1, …, Bm)| is at most k << m2 Input: typically |A1|, …, |Am|, |B1|, …, |Bm| each at most s << n Can get better bounds in terms of k, s! Focus on set-intersection join (will generalize to natural join)

Our Main Results O(k1/2n) communication protocol for set-intersection join More efficient than O(mn) for small k! Improves O(kn) communication protocol of [Williams, Yu, SODA ’14] Matching Ω(k1/2n) lower bound For input sparsity s, we show Θ(min(ms, (skn)1/2) + n) communication bound Highlights Use 2 rounds of interaction (1 round is impossible!) For each (i,j) in the join, returns all k for which k in Ai and k in Bj (natural join) New algorithms for input/output sparse matrix multiplication

Talk Outline Our O(k1/2n) communication protocol Extension to solve the natural join problem New algorithm for matrix multiplication input/output Application to Sparse Transitive-Closure An Ω(k1/2n) communication lower bound Estimating sizes of set joins

A Protocol for Set-Intersection Join A1, …, Am B1, …, Bm [ B1 B2 …Bm [ A1 A2 … Am B = A = Set-Intersection Join = {(i,j) for which Ci,j > 0, where C = A*B}

A Protocol for Set-Intersection Join Set-Intersection Join = {(i,j) for which Ci,j > 0, where C = A*B} m x n Matrix A n x m Matrix B Alice computes S*A for a random k x m matrix S, sends S*A to Bob [Compressed Sensing] S has the property that if x is a vector containing at most k non-zero entries, then given S*x, one can recover x w.h.p.

A Protocol for Set-Intersection Join m x n Matrix A m x n Matrix B Alice computes S*A for a random k x m matrix S, sends S*A to Bob Bob computes (S*A)*B. Each column of A*B has at most k non-zeros From S*A*B, Bob can recover all non-zero entries of A*B S can be a random Gaussian matrix with O(log n) bits of precision per entry

A Protocol for Set-Intersection Join m x n Matrix A m x n Matrix B Problem: sending S*A requires k n log n bits of communication Idea: instead let S be a k1/2 x m random matrix! From S*A*B, can recover each column of A*B with at most k1/2 non-zeros Number of columns of A*B with more than k1/2 non-zeros is at most k1/2

Overall Protocol for Set-Intersection Join Alice sends Bob S*A where S has k1/2 rows Alice also sends Bob T*A, where T has O(log n) rows [KNW] Bob computes T*A*B and finds which columns of A*B have > k1/2 non-zeros Set H of such columns has size at least k1/2 Bob computes S*A*B and recovers (A*B)j for each j not in H Bob sends Alice columns Bj for each j in H Alice directly computes A*Bj for each j in H

Extension to Solve the Natural Join Not enough to just figure out Ci,j > 0, need to find all witnesses A witness is an index k for which Ai,k = Bk,j = 1 Can be solved iteratively Bob deletes entries of B one-at-a-time and re-computes S*A*B Alice just sends S*A once Since Alice has Bj for columns j in H, she directly computes all witnesses

A New Algorithm for Matrix Multiplication Suppose we choose S to be a CountSketch matrix [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 The same algorithm works to recover C = A*B! Time nnz(A) to compute S*A nnz(B)k1/2 to compute S*A*B and recover A*Bj for j not in H nnz(A)k1/2 to compute A*Bj for j in H

Matrix Multiplication Application Gives O((nnz(A) + nnz(B))k1/2) time for matrix multiplication [Sparse Transitive Closure Application] Given: a directed graph G on n nodes Transitive-closure is pairs (u,v) of nodes for which v is reachable from u Promise: transitive-closure has O(n) edges Goal: find all edges in the transitive-closure Solve by computing A, A2, A4, …, An , where A is adjacency matrix We obtain O~(n1.5) time algorithm. Previous best is n1.844 time [BCH]

Matching Communication Lower Bound Reduction from set-disjointness on a universe of size k1/2n {y1, y2, …, ysqrt{k}n} {x1, x2, …, xsqrt{k}n} B1 = {y1, y2, …, yn} A1 = {x1, x2, …, xn} B2 = {yn+1, y2, …, y2n} A2 = {xn+1, xn+2, …, x2n} … … Bm = {y(sqrt{k}-1)n, …, ysqrt{k}n} Am = {x(sqrt{k}-1)n+1, …, xsqrt{k}n} A*B is k1/2 x k1/2 matrix so it has at most k non-zeros A non-zero on the diagonal can decide disjointness

Our Other Results Approximate size |Join(A1, …, Am, B1, …, Bm)| Query strategy planning, measuring similarity, etc. Output X = (1±ε) |Join(A1, …, Am, B1, …, Bm)| Communication of Approximate Set-Intersection Join [1-way Communication] Upper and lower bounds of n/ε2 [2-way Communication] Lower bound of n/ ε2/3 Results for set-intersection threshold join as well (tight for 1-way)

Conclusion Summary Separation of communication complexity of different join types Optimal communication protocol for Set-Intersection and Natural Joins in terms of sparsity New algorithms for sparse input and output matrix multiplication Upper and lower communication bounds for approximating join sizes Open Questions Communication bounds not tight for approximating join sizes for most joins For T-threshold join, off by a (large) factor of sqrt{n/T} Multi-player versions and more expressive joins Other applications of our matrix multiplication algorithm