Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.

Similar presentations


Presentation on theme: "Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan."— Presentation transcript:

1 Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan * Institute of Parallel and Distributed Systems + Department of Computer Science * Shanghai Jiao Tong University 2014 HPDC

2 100 of Video 100 Hrs of Video every minute 1.11 Users 1.11 Billion Users 6 Photos 6 Billion Photos 400 Tweets 400 Million Tweets/day Big Data ? How do we understand and use Big Data ? Big Data Everywhere

3 100 of Video 100 Hrs of Video every minute 1.11 Users 1.11 Billion Users 6 Photos 6 Billion Photos 400 Tweets 400 Million Tweets/day NLPNLP Big Data  Big Learning Machine Learning and Data Mining

4 It’s about the graphs...

5 45 314 Example: PageRank relative rank each element linked A centrality analysis algorithm to measure the relative rank for each element of a linked setCharacteristics □ Linked set  data dependence □ Rank of who links it  local accesses □ Convergence  iterative computation 45 1 23 45 314 4 5 31 2 1

6 Existing Graph-parallel Systems “Think as a vertex” philosophy 1.aggregate value of neighbors 2.update itself value 3.activate neighbors compute (v) PageRank double sum = 0 double value, last = v.get () foreach (n in v.in_nbrs) sum += n.value / n.nedges; value = 0.15 + 0.85 * sum; v.set (value); activate (v.out_nbrs); 1 2 3 45 1 23

7 Existing Graph-parallel Systems “Think as a vertex” philosophy 1.aggregate value of neighbors 2.update itself value 3.activate neighbors Execution Engine □ sync: BSP-like model □ async: dist. sched_queues Communication □ message passing: push value □ dist. shared memory: sync & pull value 45 1 23 comp. comm. 12 push 11 pull 2 sync barrier

8 Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing w/o dynamic comp. high contention 3 keep alive 21 4 x1 21 master 21 replica msg GraphLab GraphLab [VLDB’12] PowerGraph PowerGraph [OSDI’12]

9 Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing GraphLab GraphLab [VLDB’12] → Async engine → Edge-cut + DSM (replicas) w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost 3 keep alive 2 233 11 2 replica 11 44 x1 x2 5 dup 21 master 21 replica msg PowerGraph PowerGraph [OSDI’12]

10 Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing GraphLab GraphLab [VLDB’12] → Async engine → Edge-cut + DSM (replicas) PowerGraph PowerGraph [OSDI’12] → (A)Sync engine → Vertex-cut + GAS (replicas) w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost high contention heavy comm. cost 3 keep alive 2 3 11 2 1 x5 1 44 x1 233 11 2 replica 1 4 x2 5 21 master 21 replica msg 5 dup

11 Contributions Distributed Immutable View □ Easy to program/debug □ Support dynamic computation □ Minimized communication cost (x1 /replica) □ Contention (comp. & comm.) immunity Multicore-based Cluster Support □ Hierarchical sync. & deterministic execution □ Improve parallelism and locality

12 Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Evaluation

13 General Idea : For most graph algorithms, vertex only aggregates neighbors’ data in one direction and activates in another direction □ e.g. PageRank, SSSP, Community Detection, … Observation aggregation/update activation Local aggregation/update & distributed activation □ Partitioning: avoid duplicate edges □ Computation: one-way local semantics □ Communication: merge update & activate messages

14 Graph Organization Partitioning graph and build local sub-graph □ Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis) □ Only create one direction edges (e.g., in-edges) → Avoid duplicated edges □ Create read-only replicas for edges spanning machines 45 231 4 31 4 231 5 21 master replica M1 M2M3

15 Vertex Computation Local aggregation/update dynamic computation □ Support dynamic computation → one-way local semantic □ Immutable view: read-only access neighbors → Eliminate contention on vertex 45 231 4 31 4 231 5 21 M1M2M3 read-only

16 Communication Sync. & Distributed Activation □ Merge update & activate messages 1.Update value of replicas 2.Invite replicas to activate neighbors 45 231 4 31 4 231 5 21 rlist:W1 l-act: 1 value: 8 msg: 4 l-act:3 value:6 msg:3 msg: v|m|s e.g. 8 4 0 M1M2M3 8 4 active s 0

17 Communication Distributed Activation □ Unidirectional message passing → Replica will never be activated → Always master  replicas → Contention immunity 45 231 4 31 4 231 5 21 M1M2M3

18 Change of Execution Flow Original Execution Flow Original Execution Flow (e.g. Pregel) 5 parsing 11 8 computation sending 1 4 7 10 receiving high overhead high contention M2M3 M1 thread vertex message 4 2

19 Change of Execution Flow M1 M3 out-queues computation sending 1 4 7 10 receiving lock-free 2 3 8 9 5 2 11 8 4 3 1 6 1 7 4 4 7 4 7 1 3 6 Execution Flow on Distributed Immutable View low overhead no contention thread master 4 replica 4 M2M3 M1

20 Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Evaluation

21 Multicore Support Two Challenges 1.Two-level hierarchical organization → Preserve synchronous and deterministic computation nature (easy to program/debug) 2.Original BSP-like model is hard to parallelize → High contention to buffer and parse messages → Poor locality in message parsing

22 Hierarchical Model Design Principle □ Three level: iteration  worker  thread □ Only the last-level participants perform actual tasks □ Parents (i.e. higher level participants) just wait until all children finish their tasks loop tasktask tasktask tasktask Level-0Level-1Level-2 worker thread iteration global barrier local barrier

23 Parallelism Improvement Original BSP-like model is Original BSP-like model is hard to parallelize M1 M3 out-queues in-queues 5 parsing 2 11 8 computation sending 1 4 7 10 receiving thread vertex message 4 M2M3 M1

24 Parallelism Improvement Original BSP-like model is Original BSP-like model is hard to parallelize M1 M3 out-queues priv. out-queues in-queues 5 parsing 2 11 8 computation sending 1 4 7 10 receiving M1 M3 high contention poor locality thread vertex message 4 M2M3 M1

25 Parallelism Improvement M1 M3 out-queues 1 4 7 10 2 3 8 9 5 2 11 8 4 3 1 6 1 7 4 4 7 1 7 4 6 3 computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity thread master 4 replica 4 M2M3 M1

26 M2M3 M1 Parallelism Improvement M1 M3 out-queues priv. out-queues 1 4 7 10 M1 M3 2 3 8 9 5 2 11 8 1 7 4 4 7 1 7 4 6 3 4 3 1 6 poor locality lock-free computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity thread master 4 replica 4

27 Parallelism Improvement M1 M3 1 4 7 10 M1 M3 2 3 8 9 5 2 11 8 1 7 4 4 7 1 7 4 3 6 6 3 1 4 lock-free computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity no interference thread master 4 replica 4 M2M3 M1 out-queues priv. out-queues

28 M2M3 M1 Parallelism Improvement Distributed immutable view Distributed immutable view opens an opportunity M1 M3 1 4 7 10 M1 M3 2 3 8 9 5 2 11 8 1 7 4 4 7 1 7 4 3 6 6 3 4 1 lock-free sorted computation sending receiving good locality thread master 4 replica 4 out-queues priv. out-queues

29 Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Implementation & Experiment

30 ImplementationCyclops(MT) □ Based on (Java & Hadoop) □ ~2,800 SLOC □ Provide mostly compatible user interface □ Graph ingress and partitioning → Compatible I/O-interface → Add an additional phase to build replicas □ Fault tolerance → Incremental checkpoint → Replication-based FT [DSN’14]

31 Experiment Settings Platform □ 6X12-core AMD Opteron (64G RAM, 1GigE NIC) Graph Algorithms □ PageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)Workload □ 7 real-world dataset from SNAP 1 □ 1 synthetic dataset from GraphLab 2 1 http://snap.stanford.edu/data/ http://snap.stanford.edu/data/ 2 http://graphlab.org http://graphlab.org

32 Overall Performance Improvement PageRankALSCDSSSP Push-mode 8.69X 2.06X 48 workers 6 workers(8)

33 Performance Scalability AmazonGWebLJournalWiki 50.2 SYN-GLDBLPRoadCA threads workers

34 Performance Breakdown PageRankALSCDSSSP CyclopsMT Hama Cyclops

35 Comparison with PowerGraph 1 Cyclops-like engine on GraphLab 1 Platform Preliminary Results 1 http://graphlab.org http://graphlab.org 2 synthetic 10-million vertex regular (even edge) and power-law ( α =2.0) graphs 22 1 C++ & Boost RPC lib.

36 Conclusion Cyclops: Cyclops: a new synchronous vertex-oriented graph processing system □ Preserve synchronous and deterministic computation nature (easy to program/debug) □ Provide efficient vertex computation with significantly fewer messages and contention immunity by distributed immutable view □ Further support multicore-based cluster with hierarchical processing model and high parallelism Source Code Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclopshttp://ipads.se.sjtu.edu.cn/projects/cyclops

37 QuestionsThanks Cyclop s http://ipads.se.sjtu.edu.cn/ projects/cyclops.html IPA DS I Pa D S Institute of Parallel and Distributed Systems

38 PowerLyra : graph computation and partitioning on naturalgraphs PowerLyra : differentiated graph computation and partitioning on skewed natural graphs □ Hybrid engine and partitioning algorithms □ Outperform PowerGraph by up to 3.26X for natural graphs What’s Next? http://ipads.se.sjtu.edu.cn/projects/powerlyra.html 21 3 Low High Preliminary Results PL PG Cyclops most few Power-law: “most vertices have relatively few neighbors while a few have many neighbors”

39 Generality Algorithms: Algorithms: aggregate/activate all neighbors □ e.g. Community Detection (CD) □ Transfer to undirected graph and duplicate edges 4 31 4 231 5 21 M1 M2M3 5 45 231 45 231 4 31 4 231 5 21 M1 M2M3

40 Generality Algorithms: Algorithms: aggregate/activate all neighbors □ e.g. Community Detection (CD) □ Transfer to undirected graph and duplicate edges □ Still aggregate in one direction (e.g. in-edges) and activate in another direction (e.g. out-edges) □ Preserve all benefits of Cyclops → x1 /replica & contention immunity & good locality 4 31 4 231 5 21 M1 M2M3 5 45 231

41 4 31 4 231 5 21 M1 M2M3 5Generality Difference Difference between Cyclops and GraphLab 1.How to construct local sub-graph 2.How to aggregate/activate neighbors 4 31 4 231 5 21 M1 M2M3 5 45 231 45 231

42 Improvement of CyclopsMT #[M]achines MxWxT/RMxWxT/R #[W]orkers #[T]hreads #[R]eceivers Cyclops CyclopsMT

43 Communication Efficiency 50M25M5M 25.6X 16.2X 55.6% 12.6X 25.0% W0 W1 W2 W3 W4 W5 message message: (id,data) Hadoop RPC lib (Java) Boost RPC lib (C++) Hadoop RPC lib (Java) Hama: PowerGraph: Cyclops: send + buffer + parse (contention) send + update (contention) 31.5%

44 Using Heuristic Edge-cut Using Heuristic Edge-cut (i.e. Metis)PageRankALSCDSSSP 23.04X 5.95X 48 workers 6 workers(8)

45 Memory Consumption Memory Behavior 1 Memory Behavior 1 per Worker (PageRank with Wiki dataset) 2 GC: Concurrent Mark-Sweep 1 jStat

46 Ingress Time Cyclops Hama

47 Selective Activation Sync. & Distributed Activation □ Merge update & activate messages 1.Update value of replicas 2.Invite replicas to activate neighbors 45 231 4 31 4 231 5 21 rlist:W1 l-act: 1 value: 8 msg: 4 l-act:3 value:6 msg:3 msg: v|m|s e.g. 8 4 0 M1M2M3 8 4 active msg: v|m|s|l *Selective Activation (e.g. ALS) Option: Activation_List s 0

48 M2M3 M1 Parallelism Improvement Distributed immutable view Distributed immutable view opens an opportunity M1 M3 out-queues 1 4 7 10 M1 M3 2 3 8 9 5 2 11 8 1 7 4 4 7 1 7 4 3 6 6 3 4 1 lock-free sorted computation sending receiving good locality comp. threads comm. threads vs. separate configuration thread master 4 replica 4

49 w/ dynamic comp. no contention easy to program duplicated edges low comm. cost Cyclops Existing graph-parallel systems Existing graph-parallel systems (e.g., Pregel, GraphLab, PowerGraph)Cyclops(MT) → Distributed Immutable View w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost 233 11 5 replica 1 4 x1

50 BiGraph : distributed graph partitioning for big learning BiGraph : bipartite-oriented distributed graph partitioning for big learning □ A set of online distributed graph partition algorithms designed for bipartite graphs and applications □ Partition graphs in a differentiated way and loading data according to the data affinity □ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic What’s Next? http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

51

52 Multicore Support Two Challenges 1.Two-level hierarchical organization → Preserve synchronous and deterministic computation nature (easy to program/debug) 2.Original BSP-like model is hard to parallelize → High contention to buffer and parse messages → Poor locality in message parsing → Asymmetric degree of parallelism for CPU and NIC


Download ppt "Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan."

Similar presentations


Ads by Google