Download presentation
Published byMeagan Arnold Modified over 10 years ago
1
Efficient Cohesive Subgraph Detection in Parallel
Yingxia Shao Lei Chen Bin Cui School of EECS, Peking University Hong Kong University of Science and Technology Hello everyone, I am Yingxia Shao, from Peking University. Today, I will introduce our two works on parallel graph processing and both works are collaborated with Prof. Lei Chen, who is from Hong Kong University of Science and technology. The first topic is about how to efficiently detect the cohesive subgraph in parallel.
2
Outline Background Preliminaries
PeTa: parallel and efficient truss detection algorithm Basic framework Triangle complete subgraph Subgraph-oriented model Optimization techniques Evaluation Conclusion This is the outline of our first topic. I will first introduce the background of the problem, and then present the details of our solution PeTa. Finally, I will show some representative results of the experiments.
3
Cohesive Subgraph Cohesive subgraph
Background Cohesive Subgraph Cohesive subgraph identifies cohesive subgroups within social networks. helps social network analysts focus on areas of the network that are likely to be fruitful. E.g., clique, 𝑛-clique, 𝑛-clan, 𝑛-club, 𝑘-plex, 𝑘-core, etc. Image source: Large Scale Cohesive Subgraph Discovery for Social Network Visual Analysis, VLDB’13 Cohesive subgraph is a primary vehicle for network analysis. It identifies cohesive subgroups within social networks and helps social network analysts focus on areas of the network that are likely to be fruitful. Many different definitions of the cohesive subgraph have been proposed. For example, clique, n-clique, k-core, etc.
4
𝐾-Truss Problem Statement:
Background 𝐾-Truss The subgraph with black thick edges is a 4-truss. 𝑘-truss is a cohesive subgraph, where the support of each edge is at least 𝑘-2. Support of an edge is the number of triangles that contain the edge. The maximal 𝑘-truss. Problem Statement: Given a graph 𝐺=(𝑉,𝐸) and a threshold 𝑘, finding the maximal 𝑘-truss in 𝐺. Recently, Cohen introduced a new cohesive subgraph, called k-truss. In k-truss, each edge has a support not smaller than k-2. The support of an edge is defined as the number of triangles that contains the edge. This is definition is motivated by a natural observation of social cohesion. The maximal k-truss is the one no other k-trusses include it. In the right figure, the subgraph with black thick edges is a maximal 4-truss, which means each edge is contained by at least 2 triangles. For example, the edge di is involved in triangle dki and dmi. In this paper, we focus on improving the efficiency and scalability of detecting maximal k-truss in graph G.
5
Fundamental operation
Preliminaries Fundamental operation The operation computes the support of an edge. Counting triangles around an edge (𝑢,𝑣). Two solutions Classic solution [Cohen ’08, VLDB ’13] Sorts the neighbors of each vertex in ascending order; Counts triangles in 𝑂(𝑑 𝑣 +𝑑(𝑢)) time complexity. Index-based solution [Wang ’12] When processing an edge (𝑢,𝑣), only enumerate neighbors 𝑤 of vertex 𝑢 which has smaller degree; Test the existence of the third edge (𝑤,𝑣) with the help of a HashTable. Time complexity: 𝑂(min{𝑑 𝑢 ,𝑑(𝑣)}) Before introducing our solution, PeTa, let’s first review the fundamental operation in serial k-truss detection. This operation compute the support of an edge, which means it counts the number of triangles around a single edge. There exist two popular solutions. First one is called classic solution. This solution sorts the neighbors of each vertex in ascending order; and then counts the triangles in linear time complexity. Second is called index-based solution. In this solution, it only enumerates the neighbors of vertex which has smaller degree, and tests the existence of the third edge with the help of a HashTable. The time complexity is improved which is only related to the smaller degree.
6
A straightforward detection framework
Preliminaries A straightforward detection framework The framework is introduced by J. Cohen. Enumerate triangles For each edge, record the number of triangles, containing that edge Keep only edges with the support greater than 𝑘−2 If step 3 dropped any edges, return to step 1 The remained graph is the maximal 𝑘-truss Mapreduce solution [J. Cohen ’09] Two MapReduce jobs One is for steps 1-2. One is for step 3. Pregel-based solution [L. Quick ’12] Three supersteps. Two are for steps 1-2 Classic solution of the fundamental operation. Well, here is a straightforward detection framework for parallel solutions. The framework is introduced by Cohen, and consists of five steps. First, it enumerates the triangles; Second, each edge records the number of triangles containing that edge; Third, delete the edge whose support is smaller than k-2. Fourth, through previous three steps, if the graph is modified, then repeat above three steps, Or else the remained graph is the maximal k-truss. Based on this framework, two kinds of parallel solutions are designed. One is MapReduce solution and the other is Pregel-based solution. Due to the restriction of the framework, Both solutions suffer from high communication cost and large number of iterations. Besides, only the classic approach of the fundamental operation can be used because of the limintations of parallel processing tools. Inefficiency high communication cost large number of iterations
7
Contributions We propose a parallel and efficient truss detection algorithm. We introduce a subgraph-oriented programming model to efficiently implement the algorithm into popular graph computation systems. We address the edge-support law in real-world graph.
8
PeTa: Parallel and efficient Truss detection algorithm
Basic framework PeTa: Parallel and efficient Truss detection algorithm New detection framework behind PeTa Each worker constructs a special-designed subgraph; Simultaneously detects local 𝑘-truss among workers; Communicates the update when it is unavoidable; Goto step 2 until all local 𝑘-trusses are stable. ① ④ So we introduce a new parallel and efficient truss detection framework, called PeTa. PeTa consists of four steps. After loading the graph into memory, each worker first constructs a special designed subgraph; Second, simultaneously detects local k-truss in parallel and communicates the update when it is necessary. When all the local k-trusses are stable, the final global maximal k-truss is a simple union of all the local k-truss. The main challenges lie in two sides. First, what kind of a special subgraph will help to detect the local k-truss ; Second, how to efficiently detect the local k-truss on the special subgraph. ② ③ (a) original graph (c) local 𝑘-trusses (b) special subgraphs New detection framework behind PeTa
9
Triangle Complete Subgraph
PeTa TC Subgraph Triangle Complete Subgraph Triangle Complete Subgraph (TC-subgraph) For internal and cross edges, TC-subgraph maintains all their triangles at local. Property TC-subgraph preserves sufficient knowledge. Theorem 1 and Theorem 2 prove the correctness of new framework in PeTa with TC-subgraph. To solve the first challenge, PeTa applies a special subgraph called Triangle Complete Subgraph. The TC-subgraph maintains all the related triangles of internal and cross edges at local. The internal edge is the one whose end-points are belong this partition. And the cross edge has an end-point in other partition. The right figure shows an example of TC-subgraph. This subgraph is constructed from partition 2, which is the left part of the figure. The subgraph records the triangle dkm and dfi for the cross edge dm by storing external edges km and mi at local. In the paper, we prove that TC-subgraph preserves sufficient knowledge to guarantee the correctness of new framework in PeTa.
10
Subgraph-Oriented Model
PeTa Subgraph Model Subgraph-Oriented Model The subgraph-oriented model allows to flexibly process the local subgraph by designing proper APIs. In PeTa, we can use index-based approach to detect local k-truss. Vertex-centric Model Subgraph-oriented Model Accessible data Vertex and one-hop neighbors Entire local subgraph Access pattern Sequence Sequence/random Local updates Require extra supersteps By-pass User defined function Simple Fruitful Expressivity Good Better To solve the second challenge, which is how to efficiently detect the local k-truss, we introduce a subgraph oriented model. In the subgraph oriented model, the whole local partition can be accessed. The table summaries the comparison between vertex-centric model and subgraph-oriented model. In vertex centric model, the UDF can only access the vertex data and its one-hop neighbors. And the vertex is accessed in sequence. And the local updates require extra supersteps or iterations. In the subgraph oriented model, entire local subgraph can be accessed. And both sequence and random access pattern are supported. Moreover, the local updates can directly modify the local subgraph. With the help of subgraph-oriented model, we can use index-based approach to detect the local k-truss. *Refer to the paper for API design.
11
Local subgraph algorithm for PeTa
Local algorithm Local subgraph algorithm for PeTa The algorithm contains two phases. Initialization phase. Constructs TC-subgraph via triangle counting routine Require two supersteps Detection phase. Apply index-based solution to compute the support of an edge. First detection iteration, scan over internal and cross edges. Successive detection iterations, modify local k-truss based on the removal of external edges. seamless detection! Finally we design a local subgraph algorithm for PeTa based on subgraph-oriented model and the TC-subgraph. The algorithm contains two phases: initialization phase and detection phase. In the initialization phase, it constructs TC-subgraph via triangle counting approach. In the detection phase, it applies the index-based solution to compute the support of an edge and detect the local k-truss iteratively. And during the first detection iteration, it scans over all the internal and cross edges, eliminates the invalid edges and notify the related other subgraphs. In the successive detection iteration, the algorithm does not need to rescan all the local edges, it only modify the local k-truss based on the removal of external edges. We call this pattern as seamless detection.
12
Efficiency analysis of PeTa
Computation Complexity It is the same as the one of best-known serial algorithms, 𝑂( 𝑚 1.5 ). Communication Complexity Worst case is bounded by 3|Δ|. The number of iteration It is minimal when a graph partition is given. Space complexity Worst case is bounded by 2𝑚+3|∆|. Drawback The worst space cost is achievable in theoretic, thus it may be infeasible for large scale graphs. e.g., clique Through detailed analysis of the framework, it has following advantages. The computation complexity is the same as the one of best-known serial algorithms The communication cost is bounded by three times the number of triangles in the graph. The number of iteration is minimized when a graph partition is fixed. The worst space complexity is closed to three times the number of triangles. Unfortunately, the worst case can be achievable in theoretic. For example when the data is a clique and each single vertex is a partition.
13
Optimizations - I Edge replicating factor ( 𝜌 )
is the average number that an edge is replicated. Small 𝜌 leads to low space cost, computation cost and communication cost. Small 𝜌 implies few number of iterations. So we discuss the optimization techniques to make the PeTa more practical. First, we define the edge replicating factor. It is the average number that an edge is replicated. A small factor leads to low space cost, computation cost and communication cost. Moreover, a small Factor also indicates few number of iterations. The edge replicating factor can be formalized as this equation, which includes three terms. The first term is the original edges. The second term gamma stands for the cross edge replication; And the third term is external edge replication. In order to achieve a small factor, we need to reduce gamma and study the property of the graph. 𝛾 is edge cut ratio and stands for cross edge replication. The third term is external edge replication, where 𝜃(𝑒) is the support of an edge.
14
Optimizations - II Edge-Support Law in real-world graph
The frequency distribution of the initial support of edges follows Power-Law. Via studying the distribution of the initial support of edges, we addressed the edge-support law. We found that in real-world graphs, the frequency distribution of the initial supports of edges follows the power law. Here we list three examples, we can easily see that the majority of initial supports in real-world graphs are small, and only a few number are large. The power-law distribution indicates that the expectation of the supports are relatively small compared to the size of the graph. This guarantees the edge replicating factor won’t be large.
15
Optimizations - III Edge-balanced partition strategy
Improve the performance of the algorithm further. Use METIS to generate a “good” partition. Since METIS is unable to balance the core edges directly, we assign each vertex’s degree as its weights, and balance the degree as an indicator for core edge balance. Graph E[θ(e)] ρest ρrand ρmetis livejournal 20.00 20.74 8.99 1.77 us-patent 2.36 3.25 3.13 1.19 wikitalk 5.93 7.53 4.52 3.31 dbpedia 7.61 9.11 6.77 2.10 Furthermore, we discuss the relationship between partition scheme and the edge replicating factor. We introduce an edge-balanced partition strategy to improve the performance. Currently we use METIS to generate a good partition. The table shows the replicating factor under different partition strategies. We found that even the random partition is used, the space cost is acceptable. This is benefit from the edge-support law. When METIS-based partition is used, the space cost is improved further. Graphs are partitioned into 32 parts.
16
Evaluation Evaluation All experiments are conducted on a cluster with 23 physical nodes. Baselines Cohen-MR [J. Cohen ’09] Orig-LQ [L. Quick ’12] Graph |V| |E| wikitalk 2.4M 4.7M us-patent 3.8M 16.5M livejournal 4.8M 42.9M dbpedia 17.2M 117.4M At last, we show some representative results of our experiments. The experiments are conducted on a cluster …
17
Efficiency livejournal us-patent
Evaluation Efficiency livejournal us-patent This experiment shows the overall performance comparison. It clearly reveals that “on different datasets, PeTa ….” On different datasets, PeTa achieves 5x to 6x speedup compared with original pregel- based solution (i.e., orig-LQ and impr-LQ). The performance of Cohen-MR is at least 10X slower than the best one. So, it is not visualized for figures’ clarity.
18
Scalability 10-truss in dbpedia 40-truss in dbpedia
Evaluation Scalability These two figures illustrate the scalability of PeTa with respect to the number of workers. 10-truss in dbpedia 40-truss in dbpedia The performance of PeTa improves gracefully on both random and METIS-based partition schemes.
19
Conclusions We designed an efficient parallel 𝑘-truss detection algorithm, named PeTa and thoroughly prove the advantages of PeTa. The subgraph-oriented model has potential to improve the performance of other complex graph analysis tasks. In future, we will solve other truss-related problem under this framework. Let’s summarize our work on the efficient cohesive subgraph detection.
20
Q&A
21
Backup Expr. – the number of iterations
K-truss Orig-LQ Cohen-MR PeTa Random METIS 5-truss 2212(503) 1006(503) 21 9 10-truss 272(68) 136(68) 23 14 40-truss 112(28) 56(28) 6 The number of iterations for k-truss detection on dbpedia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.