Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kijung Shin1 Mohammad Hammoud1

Similar presentations


Presentation on theme: "Kijung Shin1 Mohammad Hammoud1"— Presentation transcript:

1 Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams
Kijung Shin1 Mohammad Hammoud1 Euiwoong Lee1 Jinoh Oh2 Christos Faloutsos1 1 Carnegie Mellon University 2 Adobe Systems

2 Triangles in a Graph Graphs are everywhere!
Introduction Algorithm Experiments Problem Conclusion Analysis Triangles in a Graph Graphs are everywhere! social networks, the web, citation networks Triangles are a fundamental primitive 3 nodes connected to each other Counting triangles has many applications community detection, anomaly detection, query optimization Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

3 Application: Anomaly Detection
Introduction Algorithm Experiments Problem Conclusion Analysis Application: Anomaly Detection [KMF11] [LJK18] # Incident Triangles # Incident Triangles Telemarketer Degree Degree Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

4 Remaining Challenges Counting triangles in real-world graphs, such as
Introduction Algorithm Experiments Problem Conclusion Analysis Remaining Challenges Counting triangles in real-world graphs, such as Real-world graphs are Large: not fitting in main memory Dynamic: growing with new nodes and edges online social networks Web Citation networks Call networks Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

5 Previous Approaches Distributed algorithms [SS11] [PC13] [PPK18]
Introduction Algorithm Experiments Problem Conclusion Analysis Previous Approaches Distributed algorithms [SS11] [PC13] [PPK18] pros: utilize multiple machines cons: inapplicable to dynamic graphs Streaming algorithms [DERU16] [Shi17] [LJK18] pros: applicable to dynamic graphs cons: limited to a single machine Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

6 Our Approach and Goal Can we have the best of both worlds?
Introduction Algorithm Experiments Problem Conclusion Analysis Our Approach and Goal Can we have the best of both worlds? for dynamic graphs on multiple machines We design a distributed streaming algorithm Fast and Accurate: outperforming competitors Scalable: with linear data scalability Theoretically Sound: with unbiased estimates Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

7 Road Map Problem Definition Algorithm: Tri-Fly Theoretical Analyses
Experiments Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

8 Problem Definition Given: graph stream
Introduction Algorithm Experiments Problem Conclusion Analysis Problem Definition Given: graph stream a sequence of new edges in a dynamic graph Estimate: counts of global and local triangles Using: multiple machines with limited memory up to 𝑘 edges can be stored in each machine to Minimize: estimation error Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

9 Problem Definition (cont.)
Introduction Algorithm Experiments Problem Conclusion Analysis Problem Definition (cont.) Given: graph stream a sequence of new edges in a dynamic graph Estimate: counts of global and local triangles Using: multiple machines with limited memory up to 𝑘 edges can be stored in each machine to Minimize: estimation error 3 2 1 2 3 4 1 Global triangles: all triangles in the graph Local triangles: the triangles incident to each node 3 2 1 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

10 Road Map Problem Definition Algorithm: Tri-Fly <<
Theoretical Analyses Experiments Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

11 Inputs: new edges streamed from source(s)
discover triangles with limited memory aggregate estimates Introduction Algorithm Experiments Problem Conclusion Analysis Overview of Tri-Fly Inputs: new edges streamed from source(s) master(s) worker(s) aggregator(s) source(s) Outputs: estimated counts of global and local triangles Processes each new edge when it arrives Updates estimated counts for each edge Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

12 Overview of Tri-Fly (cont.)
discover triangles with limited memory aggregate estimates Introduction Algorithm Experiments Problem Conclusion Analysis Overview of Tri-Fly (cont.) unicast broadcast shuffle counts by ℎ(node) new edge ℎ( ) ℎ( )=ℎ( ) master(s) worker(s) ×4 ×2 aggregator(s) source(s) count new triangles using local memory ×4 ×2 aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

13 Challenge: Limited Memory
Introduction Algorithm Experiments Problem Conclusion Analysis Challenge: Limited Memory How should we ‘count’ and ‘aggregate’ for accurate estimation when each machine has limited memory? Our solution adapts Triest-IMPR [DERU16] ℎ( ) master(s) worker(s) ×4 aggregator(s) ×4 ℎ( ) ℎ( )=ℎ( ) ×4 source(s) ×2 count new triangles using local memory ×4 ×4 aggregate counts & update outputs ×4 ×2 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

14 Workers in Detail Details Runs three steps for each received edge
Introduction Algorithm Experiments Problem Conclusion Analysis Details Workers in Detail Runs three steps for each received edge (a) Edge arrival (b) Discovering (c) Sampling new edge 𝑢−𝑣 𝑢−𝑣 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑦 𝑢 | 𝑥 𝑢 | 𝑣 𝑣 | 𝑥 𝑣 | 𝑦 memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

15 Workers in Detail (cont.)
Introduction Algorithm Experiments Problem Conclusion Analysis Details Workers in Detail (cont.) (a) Edge arrival step receives a new edge (a) Edge arrival new edge 𝑢−𝑣 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

16 Workers in Detail (cont.)
Introduction Algorithm Experiments Problem Conclusion Analysis Details Workers in Detail (cont.) (b) Discovering step discovers new triangles in its local memory sends updates to the aggregators 𝛿:= 1 / discovering prob. of the triangle 𝑢−𝑣 𝑥 (a) Edge arrival (b) Discovering discovered !! new edge 𝑢−𝑣 𝑢−𝑣 send (𝑢,𝛿) to aggregator ℎ 𝑢 (𝑣,𝛿) to aggregator ℎ 𝑣 (𝑥,𝛿) to aggregator ℎ 𝑥 ( ,𝛿) to aggregator ℎ( ) 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑦 memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

17 Workers in Detail (cont.)
Introduction Algorithm Experiments Problem Conclusion Analysis Details Workers in Detail (cont.) (b) Discovering step discovers new triangles in its local memory sends updates to the aggregators 𝛿:= 1 / discovering prob. of the triangle 𝑢−𝑣 𝑦 (a) Edge arrival (b) Discovering discovered !! new edge 𝑢−𝑣 𝑢−𝑣 send (𝑢,𝛿) to aggregator ℎ 𝑢 (𝑣,𝛿) to aggregator ℎ 𝑣 (𝑦,𝛿) to aggregator ℎ 𝑦 ( ,𝛿) to aggregator ℎ( ) 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑦 memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

18 Workers in Detail (cont.)
Introduction Algorithm Experiments Problem Conclusion Analysis Details Workers in Detail (cont.) (c) Sampling step stores or discards the new edge follows the standard reservoir sampling (a) Edge arrival (b) Discovering (c) Sampling new edge 𝑢−𝑣 𝑢−𝑣 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑢 | 𝑥 𝑢 | 𝑦 𝑣 | 𝑥 𝑣 | 𝑦 𝑢 | 𝑥 𝑢 | 𝑣 𝑣 | 𝑥 𝑣 | 𝑦 memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

19 Aggregators in Detail Details Maintain estimates
Introduction Algorithm Experiments Problem Conclusion Analysis Details Aggregators in Detail Maintain estimates 𝒈 in ℎ( ) for the global triangle count 𝒍[𝒖] in ℎ(𝑢) for the local triangle count of node 𝑢 Update estimates for each update ,𝜹 , increase 𝒈 by 𝛿 number of workers for each update 𝒖,𝜹 , increase 𝒍[𝒖] by 𝛿 number of workers Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

20 Summary of Tri-Fly discover triangles with limited memory
aggregate estimates Introduction Algorithm Experiments Problem Conclusion Analysis Summary of Tri-Fly unicast broadcast shuffle counts by ℎ(node) new edge ℎ( ) ℎ( )=ℎ( ) master(s) worker(s) ×4 ×2 aggregator(s) source(s) count new triangles in its local memory ×4 ×2 aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

21 Road Map Problem Definition Algorithm: Tri-Fly
Theoretical Analyses << Experiments Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

22 𝑬𝒙𝒑 𝒈 =𝐓𝐫𝐮𝐞 𝐠𝐥𝐨𝐛𝐚𝐥 𝐜𝐨𝐮𝐧𝐭 𝑬𝒙𝒑 𝒍 𝒖 =𝐓𝐫𝐮𝐞 𝐥𝐨𝐜𝐚𝐥 𝐜𝐨𝐮𝐧𝐭 𝐨𝐟 𝒖
Introduction Algorithm Experiments Problem Conclusion Analysis THM1: Unbiasedness Tri-Fly maintains estimates satisfying the following: 𝑬𝒙𝒑 𝒈 =𝐓𝐫𝐮𝐞 𝐠𝐥𝐨𝐛𝐚𝐥 𝐜𝐨𝐮𝐧𝐭 For each node 𝑢, 𝑬𝒙𝒑 𝒍 𝒖 =𝐓𝐫𝐮𝐞 𝐥𝐨𝐜𝐚𝐥 𝐜𝐨𝐮𝐧𝐭 𝐨𝐟 𝒖 True Count Frequency Estimates Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

23 THM2: Linear Drop of Variance
Introduction Algorithm Experiments Problem Conclusion Analysis THM2: Linear Drop of Variance Tri-Fly maintains estimates satisfying the following: 𝑽𝒂𝒓 𝒈 ∝𝟏 / 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐰𝐨𝐫𝐤𝐞𝐫𝐬 For each node 𝑢, 𝑽𝒂𝒓 𝒍 𝒖 ∝𝟏 / 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐰𝐨𝐫𝐤𝐞𝐫𝐬 log(Variance) log(# Workers) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

24 THM3: Linear Scalability
Introduction Algorithm Experiments Problem Conclusion Analysis THM3: Linear Scalability With a fixed per-worker memory budget 𝑘, 𝐑𝐮𝐧𝐧𝐢𝐧𝐠 𝐭𝐢𝐦𝐞 𝐨𝐟 Tri-Fly 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐞𝐝𝐠𝐞𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐢𝐧𝐩𝐮𝐭 𝐬𝐭𝐫𝐞𝐚𝐦 Running Time # Edges Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

25 Properties of Tri-Fly Fast and accurate: outperforming competitors
Introduction Algorithm Experiments Problem Conclusion Analysis Properties of Tri-Fly Fast and accurate: outperforming competitors Scalable: with linear data scalability (THM 3) Theoretically sound: with unbiased estimates (THM 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

26 Road Map Problem Definition Algorithm: Tri-Fly Theoretical Analyses
Experiments << Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

27 Experimental Settings
Introduction Algorithm Experiments Problem Conclusion Analysis Experimental Settings Competitors: MASCOT [LJK18] & Triest-IMPR [DERU16] state-of-the-art single-machine streaming algorithms for both global and local triangle counts Implementations: C++ & MPICH (asynchronous communication) 1 master & 1 aggregator & up to 40 workers Datasets: ER Synthetic (100B) Social (1.8B+) Social (22M+) Patent citation (16M+) Web (6M+) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

28 EXP1. Bias Analysis “Does Tri-Fly give unbiased estimates?” (THM 1)
Introduction Algorithm Experiments Problem Conclusion Analysis EXP1. Bias Analysis “Does Tri-Fly give unbiased estimates?” (THM 1) True Count Tri-Fly (10 workers) Tri-Fly (5 workers) Tri-Fly (1 worker) - 𝑘: 5% of edges - Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

29 EXP2. Variance Analysis “How rapidly does the variance decrease
Introduction Algorithm Experiments Problem Conclusion Analysis EXP2. Variance Analysis “How rapidly does the variance decrease w.r.t. the number of workers?” (THM 2) MASCOT Triest-IMPR Tri-Fly Slope=−1.0 - 𝑘: 5% of edges - Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 29/36

30 Introduction Algorithm Experiments Problem Conclusion Analysis EXP3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines?” Tri-Fly 30 workers, 𝑘: {2%,5%,40%} of edges, Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

31 Introduction Algorithm Experiments Problem Conclusion Analysis EXP3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines?” Tri-Fly Root Mean Square Error 30 workers, 𝑘: {2%,5%,40%} of edges, Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

32 EXP4. Scalability ER “Does Tri-Fly scale linearly with
Introduction Algorithm Experiments Problem Conclusion Analysis EXP4. Scalability “Does Tri-Fly scale linearly with the size of the input stream?” (THM 3) Tri-Fly Linear Increase (slope=1) 100B edges (800GB) ER 30 workers, 𝑘: , Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

33 Introduction Algorithm Experiments Problem Conclusion Analysis Properties of Tri-Fly Fast and accurate: outperforming competitors (EXP 3) Scalable: with linear data scalability (EXP 4) Theoretically sound: with unbiased estimates (EXP 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

34 Road Map Problem Definition Algorithm: Tri-Fly Theoretical Analyses
Experiments Conclusion << Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

35 Conclusion Tri-Fly We propose Tri-Fly
Introduction Algorithm Experiments Problem Conclusion Analysis Conclusion We propose Tri-Fly the first distributed streaming algorithm for counting global and local triangles Code and datasets: Fast & Accurate Scalable Theoretically Sound Tri-Fly Download Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)

36 References Introduction Algorithm Experiments Problem Conclusion
Analysis References [SV11] Siddharth Suri, Sergei Vassilvitskii, “Counting triangles and the curse of the last reducer” WWW 2011 [KMF11] U Kang, Brendan Meeder, Christos Faloutsos, “Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation” PADD 2011 [PC13] Ha-Myung Park, Chin-Wan Chung, “An Efficient MapReduce Algorithm for Counting Triangles in a Very Large graph”, CIKM 2013 [DERU16] Lorenzo De Stefani et al., “TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size.” KDD 2016 [Shi17] Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017 [LJK18] Yongsub Lim, Minsoo Jung, U Kang, “Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams: From Simple to Multigraphs”, TKDD 2018 [PPK18] Ha-Myung Park, Chiwan Park, U Kang, “PegasusN: A Scalable and Versatile Graph Mining System”, AAA 18 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)


Download ppt "Kijung Shin1 Mohammad Hammoud1"

Similar presentations


Ads by Google