Download presentation
Presentation is loading. Please wait.
1
Abelian: A compiler for Graph Analytics on Distributed, Heterogeneous Platforms
Gurbinder Gill Roshan Dathathri Loc Hoang Andrew Lenharth Keshav Pingali
2
Graph Analytics Applications: Datasets: unstructured graphs
machine learning and network analysis Datasets: unstructured graphs Credits: Sentinel Visualizer Credits: Wikipedia, SFL Scientific, MakeUseOf
3
Credits: HPP-Hardware architecture
Motivation HPC hardware trends today: Heterogeneous processors: GPU, FPGA, … Increased popularity of Distributed-memory to fit huge graphs Credits: HPP-Hardware architecture Hardware today is becoming more and more heterogeneous, with co-processors like GPU, FPGA, etc present along with CPU Also, in order to fit the ever increasing input graphs, distributed memory is a popular solution to do graph analytics Different programming models and architectures have different semantics and it is difficult to seamlessly manage communication among them.
4
Credits: HPP-Hardware architecture
Motivation Writing efficient distributed graph application takes: Lots of expertise: Application specific Device specific Complex programming, execution and communication models Different devices need different Optimizations Data layout Knowledge expertise of one device/algorithm doesn’t carry to another and changes over time Naïve implementation will be orders of magnitude slower than hand-tuned Credits: HPP-Hardware architecture 1. Writing efficient distributed graph application is not easy as it takes lots of application and device specific expertise
5
Contributions: A compiler called Abelian that translates shared memory specification of graph analytics applications to distributed, heterogeneous code Simple programming model Increases programming productivity Novel code transformations for distributed execution Novel compiler generated communication optimizations Generated code outperforms state-of-the-art and matches hand-tuned performance
6
Abelian Compiler for Graph Analytics: Overview
Input Galois LLVM AST Graph-Data Access Analysis Abelian Compiler Restructuring Computation Inserting Communication g++ IrGL CUDA OpenCL C++ Device Specific Compilers CPU code Gluon Runtime GPU code Source to source translation tool based on llvm clang’s libTooling Input: Shared memory Galois program Output: Efficient distributed heterogenous program Galois [SoSP’13] Gluon [PLDI’18] IrGL [OOPSLA’16]
7
Outline Abelian programming & execution model Code transformations
Required for correctness Optional optimizations Analysis and Inserting communication Communication optimizations Device Specific Compilers Experimental Results
8
Abelian programming & Execution model
Finish gluon in one go.. Inclduing strural invariance
9
Generalized Vertex Programming Model
Every node has a label e.g., rank in pageRank (PR) Apply an operator on an active node in the graph Operator can update node, it’s neighbors or both e.g., updating rank operator in PR R W push-style R W pull-style
10
BSP Execution Model Why distributed bulk synchronous execution (BSP)?
Overheads in distributed asynchronous are prohibitively high State-of-the-art distributed graph analytics system use BSP: D-Galois[PLDI’18] Gemini[OSDI’16] BSP round: Computation phase Communication phase Abelian converts Asynchronous Galois programs to BSP style execution Refernce Gemini .. State of the art: justify why is this good. Asycnhrousnisty in distributed is .. Overheads are probitively high That is why we need go from asynchronous to BSP via compiler
11
Gluon API Gluon[PLDI’18] provides efficient high level bulk synchronous API for synchronization: A communication substrate to build distributed and heterogeneous graph analytics systems Provides: Graph partitioner (Edge-cuts and Vertex-cuts) Communication-optimized runtime: Exploits partitioning structural invariants Avoids sending unnecessary meta-data Abelian required: To insert Gluon API calls for distributed execution Provide information to exploit Gluon optimization Galois [SoSP’13] Ligra [PPoPP’13] IrGL [OOPSLA’16] LCI [IPDPS’18] GPU CPU IrGL/CUDA/... Gluon Comm. Runtime Partitioner Network (LCI/MPI) Gluon Plugin Galois/Ligra/...
12
Shared Memory Galois PageRank (Input)
struct pageRank() { Graph* g pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.residual += delta; if(dstData.residual > tolerance) { wl.push(dst); } while (!wl.empty()) { Galois::do_all(wl, pageRank{&g}; } Graph node labels: rank, residual, nout Asynchronous execution Push-style operator Update residual on immediate neighbors Work-list based Apply operator to items in work-list Push new active nodes to work-list Animation in the code. Include do_all And node attributes Just this much code….. Simple code + compiler will handle everything nothing distributed , nothing device specific;… don’t read the code + this is all the code
13
Distributed Heterogeneous PageRank (Output)
do{ Galois:do_all (graph.getSources(), pageRank{&graph, alpha, tolerance}); Flag_rank.set_writeSrc(); Flag_contrib.set_reduceDst(); If(Flag_contrib.is_reduceDst()) { graph.sync<reduceDst, readSrc, Add_contrib, Bcast_contrib>(Bitvec_contrib); Flag_contrib.reset_reduceDst(); } else if (Flag_contrib.is_reduceDst()) { … } else { … } Galois:do_all (graph.getSources(), pageRank_split{&graph}); Flag_residual.set_writeSrc(); }while(work_done.reduce()); struct pageRank() { Graph* g pageRank(Graph* g) : g(g) {} DistributedAcc work_done; void operator()(GNode src, Worklist& wl) { if(srcData.residual > tolerance) { … for(auto e : g->getEdges(src)) { dstData.contrib += delta; } }} struct pageRank_split() { Graph* g; void operator()(GNode src) { auto& srcData = g->getData(src); srcData.residual += srcData.contrib; srcData.contrib = 0; } Graph node labels: rank, residual, nout, contrib BSP style execution Operators: pageRank and pageRank_split Filter-based data-driven On-demand communication Gluon sync calls
14
Input to output Code Transformations: Communication optimizations:
Required transformation Asynchronous to BSP style of execution Closure conversion for heterogeneous execution Optimization transformation Work-list elimination Communication optimizations: On-demand communication Fine-grain synchronizations
15
Code Transformations
16
Asynchronous to BSP Execution
Fine-grain synchronization needed Reading and writing residual at source Updating residual at destination on immediate neighbors BSP execution does not permit fine grain synchronization Hosts see different values during computation phase due to disjoint address space Requires synchronization struct pageRank() { Graph* g pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.residual += delta; if(dstData.residual > tolerance) { wl.push(dst); }
17
Restructuring Computation: Required Transformation 1
Fine-Grain iteration-level to round based BSP Splitting operator: PageRank: read and write access PageRank_split: reduction Introduce new variables (contrib) struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; if(dstData.contrib > local_tolerance) { wl.push(dst); } } } struct pageRank_split() { Graph* g; pageRank_split(Graph* g) : g(g) {} void operator()(GNode src) { auto& srcData = g->getData(src); srcData.residual += srcData.contrib; srcData.contrib = 0; }
18
Transformations for Heterogeneous Execution
Global variables present tolerance and alpha Heterogenous hosts may not share common visible memory space struct pageRank() { Graph* g pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; if(dstData.contrib > tolerance) { wl.push(dst); }
19
Restructuring Computation: Required Transformation 2
Eliminating global variables: Make operators self contained (closure conversion) Introduce corresponding local variables struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; if(dstData.contrib > local_tolerance) { wl.push(dst); } } } Local Variables
20
Optional Optimization Transformation
Work-list elimination Optimization Distributed work-list required Needs synchronization struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; if(dstData.residual > local_tolerance) { wl.push(dst); } } }
21
Restructuring Computation: Optimization Transformation
Conversion to filter based data-driven algorithm: Avoid distributed work-list synchronization overhead Conditional used to push work is used to filter struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); if(srcData.residual > local_tolerance) { auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; }
22
Analysis and Inserting communication
1. After doing the required transformations, we will now look at how compiler analyses the code and inserts the communication for distributed execution
23
Information Required by Gluon Synchronization API
Application-specific: What: Field to synchronize When: Point of synchronization How: Reduction operator to use Optional information for optimization: Field specific bit-set to track updated fields Precise read and write locations for exploiting structural invariant optimization
24
Graph-Data Access Analysis
Input Galois LLVM AST Graph-Data Access Analysis Abelian Compiler Restructuring Computation Inserting Communication g++ IrGL CUDA OpenCL C++ Device Specific Compilers CPU code Gluon Runtime GPU code Fields accessed on graph Nodes (What) Operator field access types (How) : Reduction: Field is read and updated using reduction operation (inside edge iterator) Read: Field is only read Write: Field is written (not part of a reduction) Operator field access location: At source: At the source of an edge At destination: At the destination of an edge At any: At the node or both end points of edge
25
PageRank Example Field accessed: contrib, residual, rank
Reduction operation (contrib): += Field accessed location: residual and rank Read at source contrib Updated at destination struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); if(srcData.residual > local_tolerance) { auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; }
26
Inserting Communication
Input Galois LLVM AST Graph-Data Access Analysis Abelian Compiler Restructuring Computation Inserting Communication g++ IrGL CUDA OpenCL C++ Device Specific Compilers CPU code Gluon Runtime GPU code Now let’s look at how Abelian inserts communication
27
Inserting Naïve Communication: UnOptimized PageRank
Inserting Gluon sync API calls struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { … } while(!workDone){ Galois:do_all (graph.getSources(), pageRank{&graph, alpha, tol}); graph.sync<reduceAny, readAny, Add_contrib, Bcast_contrib>(); Galois:do_all (graph.getSources(), pageRank_split{&graph}); graph.sync<reduceAny, readAny, Add_residual, Bcast_residual>(); struct pageRank_split() { Graph* g; pageRank_split(Graph* g) : g(g) {} void operator()(GNode src) { … }
28
Communication optimizations Optimization 1:
PageRank Operator Contrib updated Naïve synchronization: Conservative approach contrib updated in pageRank operator PageRank_split Operator Contrib read Data flow diagram Contrib variable
29
PageRank_split Operator
Communication optimizations Optimization 1: PageRank Operator Contrib updated Sync contrib Naïve synchronization: Conservative approach contrib updated in pageRank operator Sync contrib after pageRank operator PageRank_split Operator Contrib read Data flow diagram Contrib variable
30
Communication optimizations Optimization 1:
PageRank Operator Contrib updated On-demand synchronization: Only sync before operator if required Sync before using contrib on-demand in pageRank_split before reading PageRank_split Operator Contrib read Sync contrib Data flow diagram Contrib variable
31
Communication optimizations Optimization 1:
Inserting on-demand communication: Field specific sync-state flags Before operator: Checks sync-state flags with precise read/write location provided by Abelian Calls Gluon sync if required. After operator: Inserts code to set or invalidate field-specific sync-state flags (uses write location) Additional information from Abelian: Abelian precisely identifies abstract read and write location Exploits Gluon’s structural invariant optimization
32
Inserting On-demand Communication: PageRank example
Insert and Set sync flags Check sync flags and insert Gluon sync calls Galois:do_all (graph.getSources(), pageRank{&graph, alpha, tolerance}); Flag_rank.set_writeSrc(); Flag_contrib.set_reduceDst(); If(Flag_contrib.is_reduceDst()) { graph.sync<reduceDst, readSrc, Add_contrib, Bcast_contrib>(Bitvec_contrib); Flag_contrib.reset_reduceDst(); } else if (Flag_contrib.is_reduceDst()) { … } else { … } Galois:do_all (graph.getSources(), pageRank_split{&graph}); Flag_residual.set_writeSrc(); 1. It adds extra logic to check and set/invalidate sync flags
33
Communication optimizations Optimization 2:
Fine-grained communication: Inserts field specific bit-vector to keep track of updated nodes Passes bit-vector to Gluon API to communicate only updated nodes Another CO, which is to do.. Static analysis is not adequate to determine these properties, so instrumentation code is inserted to track this dynamically.
34
Inserting Fine-grained communication PageRank example
Inserting bit-vector for contrib and residual struct pageRank() { Graph* g; float &local_alpha, &local_tolerance pageRank(Graph* g) : g(g) {} void operator()(GNode src, Worklist& wl) { auto& srcData = g->getData(src); auto residual_old = srcData.residual.exchange(0); srcData.rank += residual_old; auto delta = residual_old*local_alpha/srcData.nout; for(auto e : g->getEdges(src)) { GNode dst = g->getEdgeDst(e); auto& dstData = g->getData(dst); dstData.contrib += delta; Bitvec_contrib.set(dst); …. } struct pageRank_split() { Graph* g; pageRank_split(Graph* g) : g(g) {} void operator()(GNode src) { auto& srcData = g->getData(src); srcData.residual += srcData.contrib; Bitvec_residual.set(src); srcData.contrib = 0; } It inserts bitvector for all the fields that are updated As we can see in the highlighted regions, Abelian inserts bitvec for residual and contrib
35
Device specific compilers
Input Galois LLVM AST Graph-Data Access Analysis Abelian Compiler Restructuring Computation Inserting Communication g++ IrGL CUDA OpenCL C++ Device Specific Compilers CPU code Gluon Runtime GPU code CPU code: C++ compiled with g++ GPU code: Abelian generates IrGL[OOPSLA’16] intermediate code IrGL compiler generate CUDA or OpenCL
36
Experimental Results
37
Experimental setup Systems: Benchmarks: Betweenness centrality (bc)
Abelian code variants: Unoptimized (UO) Fine-Grained Comm. Opt. (FG) FG + on-demand Comm. Opt. (FO) Hand-tuned (HT) (D-Galois) Gemini (state-of-the-art) Benchmarks: Betweenness centrality (bc) Breadth first search (bfs) Connected components (cc) Pagerank (pr) Single source shortest path (sssp) K-core decomposition (kcore) Matrix completion (sgd) Inputs rmat28 kron30 clueweb12 Amazon |V| 268M 1073M 978M 31M |E| 4B 11B 42B 82.5M |E|/|V| 16 44 2.7 Size (CSR) 35GB 136GB 325GB 1.2GB Clusters Stampede (CPU) Bridges (GPU) Max. hosts 32 16 Machine Intel Xeon Phi KNL 4 NVIDIA Tesla K80s Each host 272 threads of KNL 1 Tesla K80 Memory 96GB DDR3 128GB DDR5 Different graph sizes
38
Comparison with state-of-the-art (kron30)
Abelian produces efficient code matches or outperforms Gemini Matches D-Galois (difference < 12%)
39
CPU: Impact of Comm. Optimizations (Stampede: 32 hosts and 68 cores on each host)
clueweb12 kron30 FO reduced communication volume by 23x over UO FO Geo-mean speedup of 3.4x over UO FO matches HT performance
40
GPU: Impact of Comm. Optimizations (Bridges: 16 hosts and 1 GPU on each host)
rmat28
41
Different partitioning strategies
Abelian programs exploit Gluon’s partition aware optimizations Betweenness-centrality on clueweb12
42
Conclusions Abelian produces high-performance, distributed, heterogenous code form shared memory specification Fine-grained and on-demand communication optimizations give speedup of 3.4x Generated code matches hand-tuned CPU and GPU implementations
43
~ Thank you ~
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.