Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

epiC: an Extensible and Scalable System for Processing Big Data
Differentiated Graph Computation and Partitioning on Skewed Graphs
Armend Hoxha Trevor Hodde Kexin Shi Mizan: A system for Dynamic Load Balancing in Large-Scale Graph Processing Presented by:
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Pregel: A System for Large-Scale Graph Processing
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chronos: A Graph Engine for Temporal Graph Analysis
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
R ONG C HEN R ONG C HEN, J IAXIN S HI, Y ANZHE C HEN, H AIBO C HEN Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China H.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
THE LITTLE ENGINE(S) THAT COULD: SCALING ONLINE SOCIAL NETWORKS B 圖資三 謝宗昊.
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees Da Yan (CUHK), James Cheng (CUHK), Kai Xing (HKUST), Yi Lu (CUHK), Wilfred.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Supporting On-Demand Elasticity in Distributed Graph Processing Mayank Pundir*, Manoj Kumar, Luke M. Leslie, Indranil Gupta, Roy H. Campbell University.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Percolator: Incrementally Indexing the Web OSDI’10.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
TensorFlow– A system for large-scale machine learning
CSCI5570 Large Scale Data Processing Systems
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
Sub-millisecond Stateful Stream Querying over
Efficient and Simplified Parallel Graph Processing over CPU and MIC
Supporting Fault-Tolerance in Streaming Grid Applications
Data Structures and Algorithms in Parallel Computing
湖南大学-信息科学与工程学院-计算机与科学系
Mingxing Zhang, Youwei Zhuo (equal contribution),
Distributed Systems CS
Replication-based Fault-tolerance for Large-scale Graph Processing
Pregelix: Think Like a Vertex, Scale Like Spandex
Da Yan, James Cheng, Yi Lu, Wilfred Ng Presented By: Nafisa Anzum
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan * Institute of Parallel and Distributed Systems + Department of Computer Science * Shanghai Jiao Tong University 2014 HPDC

100 of Video 100 Hrs of Video every minute 1.11 Users 1.11 Billion Users 6 Photos 6 Billion Photos 400 Tweets 400 Million Tweets/day Big Data ? How do we understand and use Big Data ? Big Data Everywhere

100 of Video 100 Hrs of Video every minute 1.11 Users 1.11 Billion Users 6 Photos 6 Billion Photos 400 Tweets 400 Million Tweets/day NLPNLP Big Data  Big Learning Machine Learning and Data Mining

It’s about the graphs...

Example: PageRank relative rank each element linked A centrality analysis algorithm to measure the relative rank for each element of a linked setCharacteristics □ Linked set  data dependence □ Rank of who links it  local accesses □ Convergence  iterative computation

Existing Graph-parallel Systems “Think as a vertex” philosophy 1.aggregate value of neighbors 2.update itself value 3.activate neighbors compute (v) PageRank double sum = 0 double value, last = v.get () foreach (n in v.in_nbrs) sum += n.value / n.nedges; value = * sum; v.set (value); activate (v.out_nbrs);

Existing Graph-parallel Systems “Think as a vertex” philosophy 1.aggregate value of neighbors 2.update itself value 3.activate neighbors Execution Engine □ sync: BSP-like model □ async: dist. sched_queues Communication □ message passing: push value □ dist. shared memory: sync & pull value comp. comm. 12 push 11 pull 2 sync barrier

Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing w/o dynamic comp. high contention 3 keep alive 21 4 x1 21 master 21 replica msg GraphLab GraphLab [VLDB’12] PowerGraph PowerGraph [OSDI’12]

Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing GraphLab GraphLab [VLDB’12] → Async engine → Edge-cut + DSM (replicas) w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost 3 keep alive replica x1 x2 5 dup 21 master 21 replica msg PowerGraph PowerGraph [OSDI’12]

Issues of Existing Systems Pregel Pregel [SIGMOD’09] → Sync engine → Edge-cut + Message Passing GraphLab GraphLab [VLDB’12] → Async engine → Edge-cut + DSM (replicas) PowerGraph PowerGraph [OSDI’12] → (A)Sync engine → Vertex-cut + GAS (replicas) w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost high contention heavy comm. cost 3 keep alive x x replica 1 4 x master 21 replica msg 5 dup

Contributions Distributed Immutable View □ Easy to program/debug □ Support dynamic computation □ Minimized communication cost (x1 /replica) □ Contention (comp. & comm.) immunity Multicore-based Cluster Support □ Hierarchical sync. & deterministic execution □ Improve parallelism and locality

Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Evaluation

General Idea : For most graph algorithms, vertex only aggregates neighbors’ data in one direction and activates in another direction □ e.g. PageRank, SSSP, Community Detection, … Observation aggregation/update activation Local aggregation/update & distributed activation □ Partitioning: avoid duplicate edges □ Computation: one-way local semantics □ Communication: merge update & activate messages

Graph Organization Partitioning graph and build local sub-graph □ Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis) □ Only create one direction edges (e.g., in-edges) → Avoid duplicated edges □ Create read-only replicas for edges spanning machines master replica M1 M2M3

Vertex Computation Local aggregation/update dynamic computation □ Support dynamic computation → one-way local semantic □ Immutable view: read-only access neighbors → Eliminate contention on vertex M1M2M3 read-only

Communication Sync. & Distributed Activation □ Merge update & activate messages 1.Update value of replicas 2.Invite replicas to activate neighbors rlist:W1 l-act: 1 value: 8 msg: 4 l-act:3 value:6 msg:3 msg: v|m|s e.g M1M2M3 8 4 active s 0

Communication Distributed Activation □ Unidirectional message passing → Replica will never be activated → Always master  replicas → Contention immunity M1M2M3

Change of Execution Flow Original Execution Flow Original Execution Flow (e.g. Pregel) 5 parsing 11 8 computation sending receiving high overhead high contention M2M3 M1 thread vertex message 4 2

Change of Execution Flow M1 M3 out-queues computation sending receiving lock-free Execution Flow on Distributed Immutable View low overhead no contention thread master 4 replica 4 M2M3 M1

Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Evaluation

Multicore Support Two Challenges 1.Two-level hierarchical organization → Preserve synchronous and deterministic computation nature (easy to program/debug) 2.Original BSP-like model is hard to parallelize → High contention to buffer and parse messages → Poor locality in message parsing

Hierarchical Model Design Principle □ Three level: iteration  worker  thread □ Only the last-level participants perform actual tasks □ Parents (i.e. higher level participants) just wait until all children finish their tasks loop tasktask tasktask tasktask Level-0Level-1Level-2 worker thread iteration global barrier local barrier

Parallelism Improvement Original BSP-like model is Original BSP-like model is hard to parallelize M1 M3 out-queues in-queues 5 parsing computation sending receiving thread vertex message 4 M2M3 M1

Parallelism Improvement Original BSP-like model is Original BSP-like model is hard to parallelize M1 M3 out-queues priv. out-queues in-queues 5 parsing computation sending receiving M1 M3 high contention poor locality thread vertex message 4 M2M3 M1

Parallelism Improvement M1 M3 out-queues computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity thread master 4 replica 4 M2M3 M1

M2M3 M1 Parallelism Improvement M1 M3 out-queues priv. out-queues M1 M poor locality lock-free computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity thread master 4 replica 4

Parallelism Improvement M1 M M1 M lock-free computation sending receiving Distributed immutable view Distributed immutable view opens an opportunity no interference thread master 4 replica 4 M2M3 M1 out-queues priv. out-queues

M2M3 M1 Parallelism Improvement Distributed immutable view Distributed immutable view opens an opportunity M1 M M1 M lock-free sorted computation sending receiving good locality thread master 4 replica 4 out-queues priv. out-queues

Outline Distributed Immutable View → Graph organization → Vertex computation → Message passing → Change of execution flow Multicore-based Cluster Support → Hierarchical model → Parallelism improvement Implementation & Experiment

ImplementationCyclops(MT) □ Based on (Java & Hadoop) □ ~2,800 SLOC □ Provide mostly compatible user interface □ Graph ingress and partitioning → Compatible I/O-interface → Add an additional phase to build replicas □ Fault tolerance → Incremental checkpoint → Replication-based FT [DSN’14]

Experiment Settings Platform □ 6X12-core AMD Opteron (64G RAM, 1GigE NIC) Graph Algorithms □ PageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)Workload □ 7 real-world dataset from SNAP 1 □ 1 synthetic dataset from GraphLab

Overall Performance Improvement PageRankALSCDSSSP Push-mode 8.69X 2.06X 48 workers 6 workers(8)

Performance Scalability AmazonGWebLJournalWiki 50.2 SYN-GLDBLPRoadCA threads workers

Performance Breakdown PageRankALSCDSSSP CyclopsMT Hama Cyclops

Comparison with PowerGraph 1 Cyclops-like engine on GraphLab 1 Platform Preliminary Results synthetic 10-million vertex regular (even edge) and power-law ( α =2.0) graphs 22 1 C++ & Boost RPC lib.

Conclusion Cyclops: Cyclops: a new synchronous vertex-oriented graph processing system □ Preserve synchronous and deterministic computation nature (easy to program/debug) □ Provide efficient vertex computation with significantly fewer messages and contention immunity by distributed immutable view □ Further support multicore-based cluster with hierarchical processing model and high parallelism Source Code Source Code:

QuestionsThanks Cyclop s projects/cyclops.html IPA DS I Pa D S Institute of Parallel and Distributed Systems

PowerLyra : graph computation and partitioning on naturalgraphs PowerLyra : differentiated graph computation and partitioning on skewed natural graphs □ Hybrid engine and partitioning algorithms □ Outperform PowerGraph by up to 3.26X for natural graphs What’s Next? Low High Preliminary Results PL PG Cyclops most few Power-law: “most vertices have relatively few neighbors while a few have many neighbors”

Generality Algorithms: Algorithms: aggregate/activate all neighbors □ e.g. Community Detection (CD) □ Transfer to undirected graph and duplicate edges M1 M2M M1 M2M3

Generality Algorithms: Algorithms: aggregate/activate all neighbors □ e.g. Community Detection (CD) □ Transfer to undirected graph and duplicate edges □ Still aggregate in one direction (e.g. in-edges) and activate in another direction (e.g. out-edges) □ Preserve all benefits of Cyclops → x1 /replica & contention immunity & good locality M1 M2M

M1 M2M3 5Generality Difference Difference between Cyclops and GraphLab 1.How to construct local sub-graph 2.How to aggregate/activate neighbors M1 M2M

Improvement of CyclopsMT #[M]achines MxWxT/RMxWxT/R #[W]orkers #[T]hreads #[R]eceivers Cyclops CyclopsMT

Communication Efficiency 50M25M5M 25.6X 16.2X 55.6% 12.6X 25.0% W0 W1 W2 W3 W4 W5 message message: (id,data) Hadoop RPC lib (Java) Boost RPC lib (C++) Hadoop RPC lib (Java) Hama: PowerGraph: Cyclops: send + buffer + parse (contention) send + update (contention) 31.5%

Using Heuristic Edge-cut Using Heuristic Edge-cut (i.e. Metis)PageRankALSCDSSSP 23.04X 5.95X 48 workers 6 workers(8)

Memory Consumption Memory Behavior 1 Memory Behavior 1 per Worker (PageRank with Wiki dataset) 2 GC: Concurrent Mark-Sweep 1 jStat

Ingress Time Cyclops Hama

Selective Activation Sync. & Distributed Activation □ Merge update & activate messages 1.Update value of replicas 2.Invite replicas to activate neighbors rlist:W1 l-act: 1 value: 8 msg: 4 l-act:3 value:6 msg:3 msg: v|m|s e.g M1M2M3 8 4 active msg: v|m|s|l *Selective Activation (e.g. ALS) Option: Activation_List s 0

M2M3 M1 Parallelism Improvement Distributed immutable view Distributed immutable view opens an opportunity M1 M3 out-queues M1 M lock-free sorted computation sending receiving good locality comp. threads comm. threads vs. separate configuration thread master 4 replica 4

w/ dynamic comp. no contention easy to program duplicated edges low comm. cost Cyclops Existing graph-parallel systems Existing graph-parallel systems (e.g., Pregel, GraphLab, PowerGraph)Cyclops(MT) → Distributed Immutable View w/o dynamic comp. high contention hard to program duplicated edges heavy comm. cost replica 1 4 x1

BiGraph : distributed graph partitioning for big learning BiGraph : bipartite-oriented distributed graph partitioning for big learning □ A set of online distributed graph partition algorithms designed for bipartite graphs and applications □ Partition graphs in a differentiated way and loading data according to the data affinity □ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic What’s Next?

Multicore Support Two Challenges 1.Two-level hierarchical organization → Preserve synchronous and deterministic computation nature (easy to program/debug) 2.Original BSP-like model is hard to parallelize → High contention to buffer and parse messages → Poor locality in message parsing → Asymmetric degree of parallelism for CPU and NIC