Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grigory Yaroslavtsev http://grigory.us Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev.

Similar presentations


Presentation on theme: "Grigory Yaroslavtsev http://grigory.us Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev."โ€” Presentation transcript:

1 Grigory Yaroslavtsev http://grigory.us
Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev

2 Clustering on Clusters: Overview
Algorithm design for massively parallel computing Blog: MPC algorithms for graphs Connectivity Correlation clustering MPC algorithms for vectors K-means Single-linkage clustering Open problems and directions

3 Clustering on Clusters 2049: Overview
Graphs Vectors Basic Connectivity Connectivity++ K-means -- Advanced Correlation Clustering Single-Linkage Clustering MST

4 Cluster Computation (a la BSP)
Input: size n (e.g. n = billions of edges in a graph) ๐‘ด Machines, ๐‘บ Space (RAM) each Constant overhead in RAM: ๐‘ดโ‹…๐‘บ = ๐‘‚(๐’) ๐‘บ = ๐’ 1โˆ’๐œ– , e.g. ๐œ– = 0.1 or ๐œ– = 0.5 (๐‘ด=๐‘บ=๐‘‚( ๐’ )) Output: solution to a problem (often size O(๐’)) Doesnโ€™t fit in local RAM (๐‘บโ‰ช๐’) ๐ˆ๐ง๐ฉ๐ฎ๐ญ:size ๐’ โ‡’ โ‡’ ๐Ž๐ฎ๐ญ๐ฉ๐ฎ๐ญ ๐‘ด machines S space

5 Cluster Computation (a la BSP)
Computation/Communication in ๐‘น rounds: Every machine performs a near-linear time computation => Total user time ๐‘‚( ๐‘บ ๐Ÿ+๐’(๐Ÿ) ๐‘น) Every machine sends/receives at most ๐‘บ bits of information => Total communication ๐‘‚(๐’๐‘น). Goal: Minimize ๐‘น Ideally: ๐‘น = constant. ๐‘ด machines S space โ‰ค๐‘บ bits ๐‘ถ( ๐‘บ ๐Ÿ+๐’(๐Ÿ) ) time

6 MapReduce-style computations
What I wonโ€™t discuss today PRAMs (shared memory, multiple processors) (see e.g. [Karloff, Suri, Vassilvitskiiโ€˜10]) Computing XOR requires ฮฉ ( log ๐‘› ) rounds in CRCW PRAM Can be done in ๐‘‚( log ๐’” ๐‘› ) rounds of MapReduce Pregel-style systems, Distributed Hash Tables (see e.g. Ashish Goelโ€™s class notes and papers) Lower-level implementation details (see e.g. Rajaraman-Leskovec-Ullman book)

7 Models of parallel computation
Bulk-Synchronous Parallel Model (BSP) [Valiant,90] Pro: Most general, generalizes all other models Con: Many parameters, hard to design algorithms Massive Parallel Computation [Andoni, Onak, Nikolov, Y. โ€˜14] [Feldman-Muthukrishnan-Sidiropoulos-Stein-Svitkinaโ€™07, Karloff-Suri-Vassilvitskiiโ€™10, Goodrich-Sitchinava-Zhangโ€™11, ..., Beame, Koutris, Suciuโ€™13] Pros: Inspired by modern systems (Hadoop, MapReduce, Dryad, Spark, Giraph, โ€ฆ) Few parameters, simple to design algorithms New algorithmic ideas, robust to the exact model specification # Rounds is an information-theoretic measure => can prove unconditional results Con: sometimes not enough to model more complex behavior

8 Business perspective Pricings: https://cloud.google.com/pricing/
~Linear with space and time usage 100 machines: 5K $/year 10000 machines: 0.5M $/year You pay a lot more for using provided algorithms

9 Part 1: Clustering Graphs
Applications: Community detection Fake account detection Deduplication Storage localization โ€ฆ

10 Problem 1: Connectivity
Input: n edges of a graph (arbitrarily partitioned between machines) Output: is the graph connected? (or # of connected components) Question: how many rounds does it take? ๐‘‚ 1 ๐‘‚ log ๐›ผ n ๐‘‚( n ๐›ผ ) ๐‘‚( 2 ๐›ผn ) Impossible

11 Algorithm for Connectivity
Version of Boruvkaโ€™s algorithm: All vertices assigned to different components Repeat ๐‘‚( log |V| ) times: Each component chooses a neighboring component All pairs of chosen components get merged How to avoid chaining? If the graph of components is bipartite and only one side gets to choose then no chaining Randomly assign components to the sides

12 Algorithms for Graphs: Graph Density
Dense: ๐‘บโ‰ซ|๐‘‰|, e.g. ๐‘บโ‰ฅ ๐‘‰ 3/2 Semi-dense: ๐‘บ=ฮ˜(|๐‘‰|) Sparse: ๐‘บโ‰ช ๐‘‰ , e.g. ๐‘บโ‰ค ๐‘‰ 1/2

13 Algorithms for Graphs: Graph Density
Dense: ๐‘บโ‰ซ|๐‘‰|, e.g. ๐‘บโ‰ฅ ๐‘‰ 3/2 Linear sketching: one round, see [McGregorโ€™14] Workshop at Berkeley tomorrow: โ€œFilteringโ€ [Karloff, Suri Vassilvitskii, SODAโ€™10; Ene, Im, Moseley, KDDโ€™11; Lattanzi, Moseley, Suri, Vassilvitskii, SPAAโ€™11; Suri, Vassilvitskii, WWWโ€™11]โ€ฆ

14 Algorithms for Graphs: Graph Density
Semi-dense Graphs: ๐‘บ=ฮ˜(|๐‘‰|) [Avdyukhin, Y.] Run Boruvkaโ€™s algorithm for O( log |๐‘‰| ) rounds # Vertices reduces down to |๐‘‰| log |๐‘‰| Repeat O( log |๐‘‰| ) times: Compute a spanning tree of locally stored edges Put log |๐‘‰| such trees per machine

15 Algorithms for Graphs: Graph Density
Sparse: ๐‘บโ‰ช ๐‘‰ , ๐‘บโ‰ค ๐‘‰ 1/2 Sparse graph problems appear hard Big open question: connectivity in o(log |๐‘‰|) rounds? Probably no: [Roughgarden, Vassilvitskii, Wangโ€™16] โ€œOne Cycle vs. Two Cycleโ€ Problem Distinguish one cycle from two in o(log |๐‘‰|) rounds? VS.

16 Other Connectivity Algorithms
[Rastogi,Machanavajjhala,Chitnis, Das Sarmaโ€™13] D = graph diameter Algorithm MR Rounds Communication per round Hash-Min D ๐‘‚( ๐‘‰ +|๐ธ|) Hash-to-all Log D ๐‘‚( ๐‘‰ 2 +|๐ธ|) Hash-to-Min ๐‘‚( log |๐‘‰| ) for paths ๐‘‚ (( ๐‘‰ + ๐ธ )) Hash-Greater-to-Min O(log n) O(|V| + |E|)

17 Graph-Dependent Connectivity Algs?
Big question: connectivity in ๐‘‚ log ๐ท rounds with ๐‘‚ ( ๐‘‰ +|๐ธ|) communication per round? [Rastogi et alโ€™13] conjectured that Hash-to-Min can achieve this [Avdyukhin, Y.โ€™17]: Hash-to-Min takes ฮฉ ๐ท rounds Open problem: better connectivity algorithms if we parametrize by graph expansion? Other work: [Kiveris et al. โ€˜14]

18 What about clustering? โ‰ˆsame ideas work for Single-Linkage Clustering
Using connectivity as a primitive can preserve cuts in graphs [Benczur, Kargerโ€™98] Construct a graph with O(n log n) edges All cut sizes are preserved with a factor of 2 Allows to run clustering algorithms that use cuts in the objective using this sparse graph

19 Single Linkage Clustering
[Zahnโ€™71] Clustering via Minimum Spanning Tree: k clusters: remove ๐’Œโˆ’๐Ÿ longest edges from MST Maximizes minimum intercluster distance โ€œโ€ [Kleinberg, Tardos]

20 Part 2: Clustering Vectors
Input: ๐‘ฃ 1 ,โ€ฆ, ๐‘ฃ ๐’ โˆˆ โ„ ๐’… Feature vectors in ML, word embeddings in NLP, etc. (Implicit) weighted graph of pairwise distances Applications: Same as before + Data visualization

21 Large geometric graphs
Graph algorithms: Dense graphs vs. sparse graphs Dense: ๐‘บโ‰ซ|๐‘‰|. Sparse: ๐‘บโ‰ช|๐‘‰|. Our setting: Dense graphs, sparsely represented: O(n) space Output doesnโ€™t fit on one machine (๐‘บโ‰ช ๐’) Today: (1+๐œ–)-approximate MST [Andoni, Onak, Nikolov, Y.] ๐’…=2 (easy to generalize) ๐‘น= log ๐‘บ ๐’ = O(1) rounds (๐‘บ= ๐’ ๐›€(๐Ÿ) )

22 ๐‘‚(log ๐‘›)-MST in ๐‘…=๐‘‚(log ๐‘›) rounds
Assume points have integer coordinates 0, โ€ฆ,ฮ” , where ฮ”=๐‘‚ ๐’ ๐Ÿ . Impose an ๐‘‚(logโก๐’)-depth quadtree Bottom-up: For each cell in the quadtree compute optimum MSTs in subcells Use only one representative from each cell on the next level Wrong representative: O(1)-approximation per level

23 ๐๐‘ณ-nets ๐‘ณ ๐‘ณ ๐๐‘ณ-net for a cell C with side length ๐‘ณ:
Collection S of vertices in C, every vertex is at distance <= ๐๐‘ณ from some vertex in S. (Fact: Can efficiently compute ๐-net of size ๐‘‚ 1 ๐ 2 ) Bottom-up: For each cell in the quadtree Compute optimum MSTs in subcells Use ๐๐‘ณ-net from each cell on the next level Idea: Pay only O(๐๐‘ณ) for an edge cut by cell with side ๐‘ณ Randomly shift the quadtree: Pr ๐‘๐‘ข๐‘ก ๐‘’๐‘‘๐‘”๐‘’ ๐‘œ๐‘“ ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž โ„“ ๐‘๐‘ฆ ๐‘ณ โˆผโ„“/๐‘ณ โ€“ charge errors Wrong representative: O(1)-approximation per level ๐‘ณ ๐‘ณ ๐œ–๐‘ณ

24 Randomly shifted quadtree
Top cell shifted by a random vector in 0,๐‘ณ 2 Impose a randomly shifted quadtree (top cell length ๐Ÿ๐šซ) Bottom-up: For each cell in the quadtree Compute optimum MSTs in subcells Use ๐๐‘ณ-net from each cell on the next level Pay 5 instead of 4 Pr[๐๐š๐ ๐‚๐ฎ๐ญ] = ๐›€(1) 2 ๐๐š๐ ๐‚๐ฎ๐ญ 1

25 1+๐ -MST in ๐‘=๐‘‚(log ๐‘›) rounds
Idea: Only use short edges inside the cells Impose a randomly shifted quadtree (top cell length ๐Ÿ๐šซ ๐ ) Bottom-up: For each node (cell) in the quadtree compute optimum Minimum Spanning Forests in subcells, using edges of length โ‰ค๐๐‘ณ Use only ๐ ๐Ÿ ๐‘ณ-net from each cell on the next level Sketch of analysis ( ๐‘ป โˆ— = optimum MST): ๐”ผ[Extra cost] = ๐”ผ[ ๐’†โˆˆ ๐‘ป โˆ— Pr ๐’† ๐‘–๐‘  ๐‘๐‘ข๐‘ก ๐‘๐‘ฆ ๐‘๐‘’๐‘™๐‘™ ๐‘ค๐‘–๐‘กโ„Ž ๐‘ ๐‘–๐‘‘๐‘’ ๐‘ณ โ‹… ๐๐‘ณ ] โ‰ค ๐ log ๐’ ๐’†โˆˆ ๐‘ป โˆ— ๐‘‘ ๐’† = ๐ log ๐’โ‹…๐‘๐‘œ๐‘ ๐‘ก( ๐‘ป โˆ— ) ๐‘ณ=๐›€( ๐Ÿ ๐ ) 2 Pr[๐๐š๐ ๐‚๐ฎ๐ญ] = ๐‘ถ(๐) 1

26 โ‡’ 1+๐ -MST in ๐‘=๐‘‚(1) rounds ๐‘ด = ๐’ ฮฉ(1)
๐‘‚(log ๐’) rounds => O( log ๐‘บ ๐’ ) = O(1) rounds Flatten the tree: ( ๐‘ด ร— ๐‘ด )-grids instead of (2x2) grids at each level. Impose a randomly shifted ( ๐‘ด ร— ๐‘ด )-tree Bottom-up: For each node (cell) in the tree compute optimum MSTs in subcells via edges of length โ‰ค๐๐‘ณ Use only ๐ ๐Ÿ ๐‘ณ-net from each cell on the next level ๐‘ด = ๐’ ฮฉ(1) โ‡’

27 Single-Linkage Clustering [Y., Vadapalli]
Q: Single-linkage clustering from (1+๐œ–)-MST? A: No, a fixed edge can be arbitrarily distorted Idea: Run ๐‘‚ log ๐‘› times & collect all (1+๐œ–)-MST edges Compute MST of these edges using Boruvka Use this MST for k-Single Linkage Clustering for all k Overall: O(log n) rounds of MPC instead of O(1) Q: Is this actually necessary? A: Most likely yes, i.e. yes, assuming sparse connectivity is hard

28 Single-Linkage Clustering [Y., Vadapalli]
Conj 1: Sparse Connectivity requires ฮฉ( log |๐‘‰| ) Conj 2: โ€œ1 cycle vs. 2 cyclesโ€ requires ฮฉ( log |๐‘‰| ) Under โ„“ ๐‘ -Distances: Distance Approximation Hardness under Conjecture 1* Hardness under Conjecture 2* Hamming Exact 2 3 โ„“ 1 (1+๐œ–) โ„“ 2 1.41 โˆ’๐œ– 1.84 โˆ’๐œ– โ„“ โˆž

29 Thanks! Questions? Slides will be available on http://grigory.us
More about algorithms for massive data: More in the classes I teach:

30 1+๐ -MST in ๐‘=๐‘‚(1) rounds ๐šซ ๐‘ท ๐‘ข,๐‘ฃ
Theorem: Let ๐’ = # levels in a random tree P ๐”ผ ๐‘ท ๐€๐‹๐† โ‰ค 1+๐‘‚ ๐๐’๐’… ๐Ž๐๐“ Proof (sketch): ๐šซ ๐‘ท (๐‘ข,๐‘ฃ) = cell length, which first partitions (๐‘ข, ๐‘ฃ) New weights: ๐’˜ ๐‘ท ๐‘ข,๐‘ฃ = ๐‘ข โˆ’๐‘ฃ 2 +๐ ๐šซ ๐‘ท ๐‘ข,๐‘ฃ ๐‘ข โˆ’๐‘ฃ 2 โ‰ค ๐”ผ ๐‘ท [๐’˜ ๐‘ท ๐‘ข,๐‘ฃ ]โ‰ค 1+๐‘‚ ๐๐’๐’… ๐‘ข โˆ’๐‘ฃ 2 Our algorithm implements Kruskal for weights ๐’˜ ๐‘ท ๐‘ข ๐‘ฃ ๐šซ ๐‘ท ๐‘ข,๐‘ฃ

31 Technical Details (1+๐œ–)-MST:
โ€œLoad balancingโ€: partition the tree into parts of the same size Almost linear time locally: Approximate Nearest Neighbor data structure [Indykโ€™99] Dependence on dimension d (size of ๐-net is ๐‘‚ ๐’… ๐ ๐’… ) Generalizes to bounded doubling dimension

32 Algorithm for Connectivity: Setup
Data: n edges of an undirected graph. Notation: ๐œ‹(๐‘ฃ) โ‰ก unique id of ๐‘ฃ ฮ“(๐‘†)ย โ‰ก set of neighbors of a subset of verticesย S. Labels: Algorithm assigns a labelย โ„“(๐‘ฃ)ย to each ๐‘ฃ. ๐ฟ ๐‘ฃ ย  โ‰ก the set of vertices with the labelย โ„“(๐‘ฃ) (invariant: subset of the connected component containingย ๐‘ฃ). Active vertices: Some vertices will be called active (exactly one per ๐ฟ ๐‘ฃ ).

33 Algorithm for Connectivity
Mark every vertexย asย activeย and letย โ„“(๐‘ฃ)=๐œ‹(๐‘ฃ). For phasesย ๐‘–=1,2,โ€ฆ,๐‘‚(logโกn)ย do: Call eachย activeย vertex aย leaderย with probabilityย 1/2. Ifย vย is a leader, mark all vertices inย  ๐ฟ ๐‘ฃ ย asย leaders. For everyย active non-leaderย vertexย w, find the smallestย leader (byย ๐œ‹) vertexย  w โ‹† in ฮ“( ๐ฟ ๐‘ค ). Markย wย passive, relabel each vertex with labelย wย byย  w โ‹† . Output: set of connected components based on ย โ„“.

34 Algorithm for Connectivity: Analysis
Ifย โ„“(๐‘ข)=โ„“(๐‘ฃ)ย thenย ๐‘ข andย ๐‘ฃย are in the same CC. Claim: Unique labels with high probability afterย ๐‘‚( log ๐‘ )ย phases. For every CC # active vertices reduces by a constant factor in every phase. Half of the active vertices declared as non-leaders. Fix an active non-leader vertexย ๐’—. If at least two different labels in the CC ofย vย then there is an edgeย (๐’—โ€ฒ,๐’–)ย such thatย โ„“(๐’—)=โ„“(๐’—โ€ฒ)ย and โ„“(๐’—โ€ฒ)โ‰ โ„“(๐’–). ๐’–ย marked as a leader with probabilityย 1/2 โ‡’ half of the active non-leader vertices will change their label. Overall, expectย 1/4ย of labels to disappear. Afterย ๐‘‚( log ๐‘ )ย phases # of active labels in every connected component will drop to one with high probability

35 Algorithm for Connectivity: Implementation Details
Distributed data structure of size ๐‘‚ ๐‘‰ to maintain labels, ids, leader/non-leader status, etc. O(1) rounds per stage to update the data structure Edges stored locally with all auxiliary info Between stages: use distributed data structure to update local info on edges For everyย active non-leaderย vertexย w, find the smallestย leader (w.r.tย ๐œ‹) vertexย  w โ‹† โˆˆฮ“( ๐ฟ ๐‘ค ) Each (non-leader, leader) edge sends an update to the distributed data structure Much faster with Distributed Hash Table Service (DHT) [Kiveris, Lattanzi, Mirrokni, Rastogi, Vassilvitskiiโ€™14]

36 โ‡’ Problem 3: K-means Input: ๐‘ฃ 1 ,โ€ฆ, ๐‘ฃ ๐’ โˆˆ โ„ ๐’…
Find ๐’Œ centers ๐‘ 1 ,โ€ฆ, ๐‘ ๐’Œ Minimize sum of squared distance to the closest center: ๐‘–=1 ๐‘› min ๐‘—=1 ๐‘˜ || ๐‘ฃ ๐‘– โˆ’ ๐‘ ๐‘— | | 2 2 || ๐‘ฃ ๐‘– โˆ’ ๐‘ ๐‘— | | 2 2 = ๐‘ก=1 ๐’… ๐‘ฃ ๐‘–๐‘ก โˆ’ ๐‘ ๐‘—๐‘ก 2 NP-hard โ‡’

37 K-means++ [Arthur,Vassilvitskiiโ€™07]
๐ถ={ ๐‘ 1 ,โ€ฆ, ๐‘ ๐‘ก } (collection of centers) ๐‘‘ 2 ๐‘ฃ,๐ถ = min ๐‘—=1 ๐‘˜ ||๐‘ฃโˆ’ ๐‘ ๐‘— | | 2 2 K-means++ algorithm (gives ๐‘‚ log ๐’Œ -approximation): Pick ๐‘ 1 uniformly at random from the data Pick centers ๐‘ 2 โ€ฆ, ๐‘ ๐’Œ sequentially from the distribution where point ๐‘ฃ has probability ๐‘‘ 2 ๐‘ฃ,๐ถ ๐‘–=1 ๐‘› ๐‘‘ 2 ( ๐‘ฃ ๐‘– , ๐ถ)

38 K-means|| [Bahmani et al. โ€˜12]
Pick ๐ถ=๐‘ 1 uniformly at random from data Initial cost: ๐œ“= ๐‘–=1 ๐‘› ๐‘‘ 2 ( ๐‘ฃ ๐‘– , ๐‘ 1 ) Do ๐‘‚( log ๐œ“ ) times: Add ๐‘‚ ๐’Œ centers from the distribution where point ๐‘ฃ has probability ๐‘‘ 2 ๐‘ฃ,๐ถ ๐‘–=1 ๐‘› ๐‘‘ 2 ( ๐‘ฃ ๐‘– , ๐ถ) Solve k-means for these O(๐’Œ log ๐œ“ ) points locally Thm. If final step gives ๐œถ-approximation โ‡’ ๐‘‚(๐œถ)-approximation overall

39 Problem 2: Correlation Clustering
Inspired by machine learning at Practice: [Cohen, McCallum โ€˜01, Cohen, Richman โ€™02] Theory: [Blum, Bansal, Chawla โ€™04]

40 Correlation Clustering: Example
Minimize # of incorrectly classified pairs: # Covered non-edges + # Non-covered edges 4 incorrectly classified = 1 covered non-edge + 3 non-covered edges

41 Approximating Correlation Clustering
Minimize # of incorrectly classified pairs โ‰ˆ20000-approximation [Blum, Bansal, Chawlaโ€™04] [Demaine, Emmanuel, Fiat, Immorlicaโ€™04],[Charikar, Guruswami, Wirthโ€™05], [Ailon, Charikar, Newmanโ€™05] [Williamson, van Zuylenโ€™07], [Ailon, Libertyโ€™08],โ€ฆ โ‰ˆ2-approximation [Chawla, Makarychev, Schramm, Y. โ€™15] Maximize # of correctly classified pairs (1โˆ’๐œ–)-approximation [Blum, Bansal, Chawlaโ€™04]

42 Correlation Clustering
One of the most successful clustering methods: Only uses qualitative information about similarities # of clusters unspecified (selected to best fit data) Applications: document/image deduplication (data from crowds or black-box machine learning) NP-hard [Bansal, Blum, Chawla โ€˜04], admits simple approximation algorithms with good provable guarantees

43 Correlation Clustering
More: Survey [Wirth] KDDโ€™14 tutorial: โ€œCorrelation Clustering: From Theory to Practiceโ€ [Bonchi, Garcia-Soriano, Liberty] Wikipedia article:

44 Data-Based Randomized Pivoting
3-approximation (expected) [Ailon, Charikar, Newman] Algorithm: Pick a random pivot vertex ๐’— Make a cluster ๐’—โˆช๐‘(๐’—), where ๐‘ ๐’— is the set of neighbors of ๐’— Remove the cluster from the graph and repeat

45 Data-Based Randomized Pivoting
Pick a random pivot vertex ๐’‘ Make a cluster ๐’‘โˆช๐‘(๐’‘), where ๐‘ ๐’‘ is the set of neighbors of ๐’‘ Remove the cluster from the graph and repeat 8 incorrectly classified = 2 covered non-edges + 6 non-covered edges

46 Parallel Pivot Algorithm
(3+๐)-approx. in ๐‘‚( log 2 ๐‘› / ๐œ–) rounds [Chierichetti, Dalvi, Kumar, KDDโ€™14] Algorithm: while the graph is not empty ๐‘ซ= current maximum degree Activate each node independently with prob. ๐/๐‘ซ Deactivate nodes connected to other active nodes The remaining nodes are pivots Create cluster around each pivot as before Remove the clusters

47 Parallel Pivot Algorithm: Analysis
Fact: Halves max degree after 1 ๐ log ๐’ rounds โ‡’ terminates in O log 2 ๐’ ๐ rounds Fact: Activation process induces close to uniform marginal distribution of the pivots โ‡’ analysis similar to regular pivot gives (3+๐)-approximation


Download ppt "Grigory Yaroslavtsev http://grigory.us Clustering on Clusters 2049: Massively Parallel Algorithms for Clustering Graphs and Vectors vs Grigory Yaroslavtsev."

Similar presentations


Ads by Google