Download presentation
Presentation is loading. Please wait.
Published byFerdinand Marsh Modified over 9 years ago
1
Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong yu@se.cuhk.edu.hkyu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yuhttp://www.se.cuhk.edu.hk/~yu
2
Big Graphs/Networks
3
Graph Systems There are many and many graph systems in the literature. 3
4
Graph Computing on Cloud Workload Balancing Auto Approximation 4
5
Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel) Concurrent computing Communication Barrier synchronization
6
Workload Balancing Computing Determined by the slowest Workload balancing Communication The volume matters Cross-edges Computing + Communication Balanced Partitioning
7
Balanced k-way Graph Partitioning Size balanced partition The minimum possible cross-edges It solves our problem if the graph is static By static, we mean the vertices are always active during the computation However, for graph analytics, the vertices may toggle between active and inactive. Workload Balancing
8
Dynamic Workload Balancing 8 Computing Determined by the slowest Workload balancing Communication The volume matters Cross-edges Dynamic workload balancing Respond to vertices’ status active/inactive
9
We do not know anything about what graph algorithms will be used. We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on Cloud. We cannot assume how graphs are initially partitioned on Cloud. It needs to react to workload balancing in good timing, and it cannot take long to balance itself. Any General Approach?
10
An Example
11
PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets Representative Graph Algorithms
12
The three algorithms PageRank Semi-clustering Graph Coloring The vertices are always active Ideal case for static partition Perfectly balanced as expected Category 1: Always Active
13
The Three Algorithms Single Source Shortest Path Breadth First Search Random Walk Significantly imbalanced Category 2: Traversal
14
The Three Algorithms Maximal Matching Minimum Spanning Tree Maximal Independent Sets Somewhat balanced Category 3: Multi-Phases
15
Predicable? For category 1, the algorithms have stable working window. For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property. SSS has a reasonable hit-rate between supersteps. For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.
16
Our Approach [Shang et al. ICDE’13]
17
Some Basic Ideas
18
Compare with Random Partitioning
19
Graph Computing on Cloud The factors Memory consumption, communication cost, CPU cost, and the number of rounds. The classes MapReduce Class (MRC) by Karloff et al. in SODA’10. Minimal MapReduce Class (MMC) by Tao et al. in SIGMOD’13. Scalable Graph Processing (SGC) on MapReduce by Qin et al. in SIGMOD’14. Balanced Practical Pregel Algorithms (BPPA) on BSP by Yan et al. in VLDB’14.
20
Big data and bigger data Google: 2+EB twitter: hit 8PB Yahoo: 400PB Facebook: 300PB Big data needs to get answers fast More data beat cleaver algorithm A few useful things to know about machine learning by P. Domingos in CACM 2012. Auto-Approximate Graph Computing [Sang et al. VLDB’15]
21
Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm? Hard + hard The target is fast answer! But, it is impossible to know the meaning of programs. Why Auto-Approximate?
22
To modify the vertex-centric programs (UDF) Auto-Approximate Graph Computing Traditional Computing Approximation Computing
23
The Errors Init value Default UDF Approx. UDF final results error term
24
The Errors The error comes from two sides The “bad” input Error inherited from previous iterations Wrong calculation Error from the new approx. UDF
25
Approximation There does not exist a way to have an approach that can approximate all problems, as restricted by Rice’s theorem. Any nontrivial property about the language recognized by a Turing machine is undecidable. Approximation Continuous functions Discrete functions The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM, 2012. shortest paths, minimum spanning trees 25
27
An Example Sampling as an example Find chances of sampling Synthesize codes Correct the answer by regression
28
Error-Time Tradeoff
29
The Sampling Strategies
30
Graph Algorithms 30
31
Real Datasets 31
32
PR over twitter-mp (10 iterations)
34
The Eight Graph Algorithms
35
Time/Error Prediction
36
Some Remarks There are many reported graph systems in the literature. It needs to reconsider something new to explore further to deal with big graphs. 36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.