A Lightweight Communication Runtime for Distributed Graph Analytics

A Lightweight Communication Runtime for Distributed Graph Analytics
Hoang-Vu Dang*, Roshan Dathathri+, Gurbinder Gill+, Alex Brooks*, Nikoli Dryden*, Andrew Lenharth+, Loc Hoang+ Keshav Pingali+, Marc Snir* *: University of Illinois at Urbana-Champaign +: University of Texas at Austin

Distributed Graph Analytics
Graph analytics is used in a variety of applications like: Search engines (pagerank) Machine learning (e.g., stochastic gradient descent) Social network analysis (e.g., betweenness centrality) Datasets are unstructured graphs: Billions of nodes and trillions of edges. Do not fit in the memory of a single machine, so a distributed cluster can be used. Credits: Sentinel Visualizer

Contributions A study of the communication requirements of graph analytics applications. Issues in the Message Passing Interface (MPI) standard that limit the performance of graph analytics applications. Design and implementation of a Lightweight Communication Interface (LCI). Evaluation of the benefits of LCI over MPI in state-of-the-art distributed graph analytics frameworks on distributed clusters up to core hosts.

Communication in Distributed-Memory Graph Analytics

Distributed-memory graph analytics
E.g., D-Galois [Dathathri et al., PLDI'18] and Gemini [Zhu et al., OSDI'16] Every node has a label E.g., distance in single source shortest path Apply an operator on an active node in the graph (data-driven) E.g., relaxation operator in single source shortest path Operators compute labels on nodes Vertex programs: operators read or write the labels of the node and its neighbors

D-Galois: Gluon + Galois
Host h1 Host h2 Gluon partitioner Gluon partitioner Galois on multicore Galois on multicore Gluon comm. runtime MPI/LCI MPI/LCI Gluon comm. runtime Galois [Nguyen et al., SoSP’13] Gluon [Dathathri et al., PLDI'18]

Partitions of the graph
Partitioning Host h1 Host h2 B C B C B C D A D A F G F G F G H E H E I J J I J : Master proxy : Mirror proxy Original graph Partitions of the graph

Communication pattern
Sparse all-to-all communication. Labels of different nodes may be updated in different rounds, so data to synchronize varies among rounds. Data to be sent or received is sparse and not contiguous in memory. Millions of nodes may need to be synchronized. Message aggregation required: Gather Communicate Scatter Host h1 Host h2 B C B C A D F G F G E H Avoid lots of pending messages I J J : Master proxy : Mirror proxy Partitions of the graph

Communication runtime
Dedicated communication thread for handling communication requests and progress. Compute threads gather or scatter data and enqueue or dequeue communication requests. Incoming data is unknown: Unknown size. Unknown sender. Data is buffered until compute threads are available for processing it: Unknown size and time of buffering. Communication thread need to synchronize with other threads. Overlap within communication

Checklist for efficient communication in distributed graph analytics
Flow-control tightly coupled with hardware buffer management. Explicit memory resource management. Directly mapped to hardware interface. RDMA for long messages. Low-overhead thread-synchronization primitive for multicores. Message aggregation done by upper layer. Avoid lots of pending messages Hardware buffers, software buffers MPI requires users to know that buffers are unlimited

Issues in Message Passing Interface (MPI)

MPI communication layer 1: MPI-Probe
Progress send: MPI_Isend: issuing as soon as data is available. Progress receive: MPI_Iprobe: wildcard polling of incoming data. Allocate memory according to the returned MPI_STATUS. MPI_Irecv: finally receiving the data into the allocated buffer. MPI_THREAD_FUNNELED mode is used. Wildcard – MPI_ANY_SOURCE, MPI_ANY_TAG

MPI communication layer 1: MPI-Probe
Issues: Requires custom buffering to maintain flow-control over MPI implementation. Otherwise, crashes due to out-of-resource exception (MPI_Isend always succeeds). Probing and receiving is slow, due to overhead of wildcard tag-matching. MPI_Mprobe can help, but not fully supported by all implementations. Overhead of MPI with multi-threading. MPI_Test has additional overhead of performing network progress.

MPI communication layer 2: MPI-RMA
RMA windows are created for receiving aggregated messages from each host. Based on a conservative upper bound that assumes all nodes are active. Generalized active target synchronization: Enqueue: MPI_Start, MPI_Put, MPI_Complete Dequeue: MPI_Win_post, MPI_Win_wait Dedicated communication thread uses MPI_Iprobe to ensure forward progress. MPI_THREAD_MULTIPLE mode is used.

MPI communication layer 2: MPI-RMA
Advantages: Avoids two-sided matching of send and receive. Avoids synchronization between communication thread and compute threads. Issues: Pre-allocated buffers in the RMA windows use more memory than required. Memory window creation time increases with number of hosts. Data locality is lost since communication buffers are not reused.

Lightweight Communication Interface (LCI)

LCI: lightweight communication interface
RESOURCES (MEMORY, POOL) THREAD SYNC COLLECTIVE Primitives required from network: lc_send(p): send a network packet lc_put(src, dst): perform a RDMA when remote address is known lc_progress(): ensure forward progress, executes callbacks associated with packet (rendezvous) Interfacing with threads/frameworks: wait/signal: callback mechanism. memory allocator passed to interface. Implemented API over network API: One-sided and two-Sided like in MPI [Dang et al, EuroMPI 2016] Queue (this paper) Two-sided (Send/Recv) One-sided (Put/Get) Completion mechanism (signal/wait/queue) Communication Server IBV OFI PSM2

LCI design principles Issue in MPI LCI design principle
Overhead of wildcard tag-matching MPI_Isend and MPI_Irecv may need to use MPI_ANY_SOURCE and MPI_ANY_TAG, traversing sequential lists to find a match. Decouple different producer-consumer matching One-sided, two-sided: no wildcard matching [Dang-2016] lc_send_tag(src, size, rank, tag) / lc_recv_tag(src, size, rank, tag) Queue: no matching, streaming to consumer/producer is application-specific lc_send_queue(src, size, rank, meta) / lc_recv_queue(src, &size, &rank, &meta); Traversal of sequential lists - producer/consumer matching Simple flag check – multithreaded progress

Overhead of request completion checks and thread-synchronization MPI_Test*, MPI_Wait* functions are a combination of both network progress and checking request object along with internal lock if MPI_THREAD_MULTIPLE Decouple completion events and progress lc_progress() can be explicitly called (by the dedicated communication thread). A simple lock-free flag check is required for request completion. Traversal of sequential lists - producer/consumer matching Simple flag check – multithreaded progress

Requires custom buffering to ensure flow-control Resource exhaustion in MPI is a fatal error in current MPI implementations Decouple fatal errors and other recoverable errors lc_send returns ERR_RETRY if no resource is available in the network. Traversal of sequential lists - producer/consumer matching Simple flag check – multithreaded progress

Overhead of unused features MPI implementations have branches for data-type and memory allocator for requests and buffering. MPI enforces ordering constraints for received packets. Decouple high-level features with low-level network features No data-type (pack or unpack provided by application). No implicit memory allocations (allocator is provided by application). Returns any completed request (first-packet policy). Traversal of sequential lists - producer/consumer matching Simple flag check – multithreaded progress

LCI Queue interface Sender: send short messages directly, or send a RTS packet for long messages and issue RDMA when RTR is received. When there is no more buffer space in the network layer, add to pending list and retry later. Receive: piggy-back short messages that fit in the network buffer, and when RTS is received, allocate memory and send back a RTR packet with memory descriptor for RDMA. send buffer pending-list send-enqueue feedback comm thread retry Network comm thread allocate custom allocator recv-dequeue rdma-pending recv buffer parallel scatter

Experimental Results

Experimental setup Benchmarks Breadth first search (bfs)
Connected components (cc) Pagerank (pr) Single source shortest path (sssp) Cluster Stampede 2 NIC Omni-path CPU KNL Number of cores 68 Clock per core 1.4 Ghz Memory 96GB DDR4 L3 Cache 16 GB MPI impl. IntelMPI 17 Input clueweb12 kron30 rmat28 |V| 978M 1073M 268M |E| 42,574M 10,791M 4,295M |E|/|V| 44 16 Max out-degree 7,447 3.2M 4M Max in-degree 75M 0.3M

Microbenchmarking: queue vs. probe
Point-to-Point latency and message rate. Comparing Send+Iprobe+Recv (probe) with LCI queue (queue) and MPI_Send/Recv (no-probe). 3x difference between queue and probe

Strong scaling of D-Galois: LCI vs. MPI
LCI is 1.34x and 1.08x faster than MPI-Probe and MPI-RMA on 128 hosts

Memory usage comparison
MPI-RMA has fixed max-min usage, because it allocates based on a conservative upper-bound (excluding internal memory of MPI). LCI has lower max and min usage across different hosts since buffers are reused.

Strong scaling of Gemini: LCI vs. MPI
LCI is 1.64x faster than MPI-Probe on 128 hosts

Related Work Libraries targeting specific languages: ARMCI, GasNet
Similar issues with multi-threading and high-thread count Active Messages frameworks: AM++, GasNet Extra overheads for simple operations (set flags) AM handlers has to be non-blocking or fast Generic frameworks targeting current architecture: Libfabric, UCX Great flexibility, but do not optimize for specific domain Not yet thread-friendly

https://github.com/danghvu/LCI
Conclusion MPI interface does not match requirements of distributed graph applications. LCI presents a clean ground-up design: better match to hardware. better match to distributed graph analytics frameworks. LCI performance will be better with more asynchronous communication. LCI studies benefit MPI: [mpich/pull/2976] add non-catastrophic error MPIX_ERR_EAGAIN [mpich/pull/3068] implements better threading in MPI [mpi-forum/mpi-forum-historic/issues/381] Info-key for MPI relaxation Hoang-Vu Dang

Message Passing Interface (MPI)
MPI is the de-facto communication layer for high performance computing. Current state-of-the-art Network + MPI Implementations: Infiniband (ibverbs): MVAPICH2, OpenMPI Driver/Hardware RDMA Software Tag-matching Completion notification: completion queue Intel Omnipath (psm2): IntelMPI Software RDMA Driver/Hardware Tag-matching Completion mechanism: matching queue OFI (libfabric): MPICH / IntelMPI No native support; porting layer for now.

Experimental setup Stampede 2 Stampede 1 NIC Omni-path Mellanox FDR
CPU KNL Xeon E5 Number of cores 68 16 Clock per core 1.4 Ghz Ghz Memory 96GB DDR4 32GB DDR3 L3 Cache 16 GB 20 MB MPI impl. IntelMPI 17 MVAPICH2 2.1 Better network, more throughput, but higher latency

Motivation Distributed graph analytics applications differ from traditional scientific applications: Higher communication-to-computation ratio. Sparse all-to-all communication. The amount of data communicated dynamically changes. Message aggregation required to communicate data. Multiple threads produce and consume communication data. MPI semantics and features do not match the needs of such applications.

Communication pattern
parallel gather Sparse all-to-all communication patterns: A communication thread is dedicated for handling communication requests and progress Computation threads are used to gather data, then submit to the communication thread (for overlap) Incoming data is unknown Unknown size Unknown sender Data are buffered until threads are available for processing Unknown amount of buffering Many threads are involved in communication send buffer Send phase comm thread Network comm thread Recv phase recv buffer parallel scatter

LCI design principles Decouple different producer-consumer matching
Tag-matching: no wildcard matching [Dang-2016] lc_send_tag(src, size, rank, tag) / lc_recv_tag(src, size, rank, tag) Queue: no matching, streaming to consumer/producer is application-specific lc_send_queue(src, size, rank, meta) / lc_recv_queue(src, &size, &rank, &meta); MPI_Send/Recv is a mixed with MPI_ANY_SOURCE, MPI_ANY_TAG Decouple completion events and progress lc_progress() called explicit by the communication server thread. A simple flag check is required for request completion MPI_Test*, MPI_Wait* series is a combination of both network progress and checking request object + internal lock if MPI_THREAD_MULTIPLE Traversal of sequential lists - producer/consumer matching Simple flag check – multithreaded progress

LCI design principles Decouple fatal errors and other recoverable errors lc_send returns ERR_RETRY if no resource is available in the network Resource exhaustion in MPI is a fatal error in current MPI implementations Decouple high-level features with low-level network features No data-type, pack/unpack provided by application No implicit memory allocations, allocator provided by application No ordering constraints. MPI implementation has branches for data-type and memory allocator for requests and buffering. MPI enforces ordering constraints. Traversal of sequential lists? Ordering constraints? Multi-threading

A Lightweight Communication Runtime for Distributed Graph Analytics

Similar presentations

Presentation on theme: "A Lightweight Communication Runtime for Distributed Graph Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Lightweight Communication Runtime for Distributed Graph Analytics

Similar presentations

Presentation on theme: "A Lightweight Communication Runtime for Distributed Graph Analytics"— Presentation transcript:

Similar presentations

About project

Feedback