Efficient and Simplified Parallel Graph Processing over CPU and MIC Linchuan Chen, Xin Huo, Bin Ren, Surabhi Jain and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University
The System A Vertex Oriented Parallel Graph Programming System over CPU and Xeon Phi 9/20/2018
Intel Xeon Phi Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes 61 X86 cores, each supporting 4 hardware threads Wide SIMD Lanes 512 bit SIMD lanes 16 floats/ints or 8 doubles Limited Shared Main Memory per Core Only a few GBs of main memory 9/20/2018
Graph Applications Popular Irregular Classical algorithms Graph mining Route problems Graph mining Social network Irregular Hard to utilize thread level parallelism Loop dependency Load balance Hard to utilize SIMD Random access 9/20/2018
Vertex Oriented Programming Models E.g., Google’s Pregel Model Follows BSP Model Concurrent local processing Communication Synchronization Uses Message Passing to Abstract Graph Applications Simple to Use User Defined functions Expresse sequential logics: message generation, message processing, and vertex update General Enough to Specify Common Graph Algorithms 9/20/2018
The Challenges Xeon Phi Specific Utilizing CPU at the Same Time Memory Access and Contention Overhead among Threads Irregular memory access Locking leads to contention among threads Difficult to Automatically Utilize SIMD Complex SSE programming Load Imbalance Associated processing to different vertices vary Need to keep as many as possible cores busy Utilizing CPU at the Same Time Graph partitioning between devices 9/20/2018
Programming Interface 9/20/2018
Programming Interface Keeps the Simplicity of Pregel Users Express Application Logic through Message Passing message generation Active vertices send messages along neighbor links message processing Process the received messages for each vertex vertex update Update the vertex value using the processing result of the previous step 9/20/2018
Programming Interface: SSSP Example ∞ 1 2 2 1 1 1 2 1 2 ∞ 4 4 5 1 4 5 1 2 ∞ 4 2 ∞ 4 1 1 1 1 6 6 3 3 3 3 ∞ ∞ ∞ 1 1 1 2 2 1 1 1 2 3 1 2 3 1 4 5 4 2 2 1 4 5 4 2 2 1 1 1 1 6 6 3 3 3 3 4 4 1 1 active inactive 9/20/2018
Programming Interface: SSSP Example Vector types with overloaded vector operations // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } User defined functions: // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; 9/20/2018
Runtime 9/20/2018
Workflow Message Generation Message Buffer Message Processing Vertex Update 9/20/2018
Message Buffer Design Condensed Static Buffer Vertex Grouping Pre-allocated space Avoids frequent memory allocation Vertices sorted according to in-degree Each vertex granted no less than in-degree msg slots Vertex Grouping Multiple consecutive vertices are grouped together Same group uses the same message array Length of the array equals the max in-degree in the group Width equals the number of vertices in the group Messages can be processed in SIMD For associative and commutative operations Moderate Memory Consumption Take care of power-law graphs 9/20/2018
Using Message Buffer One-to-one mapping Dynamic column allocation Not all vertices receive msgs Lots of bubbles waste SIMD lanes Dynamic column allocation All columns in a group are dynamically allocated Columns are consumed from left to right continuously 9/20/2018
Message Insertion Using Buffers Locking Based Concurrent writes to the same buffer column: use locks Contention overhead, especially for dense/power-law graphs Pipelining avoids locks Each worker thread gens msgs to its private msg queues num_msg_queue = num_mover_threads qid = dst_id % num_mover_threads Mover thread tid moves messages from queues qid = tid to msg buffer Each queue is accessed by one writer and one reader locking free Message insertion is locking-free Workers and movers run concurrently Pipelining using Workers + Movers 9/20/2018
Pipelining Pros Cons Suitable for Avoids locking Each queue is accessed by one worker(writer) and one mover(reader) Each msg buffer column is written by one mover Cons Introduces extra message storage cost Choice of optimal workers/movers configuration is non-trivial Suitable for Dense graphs Message intensive applications 9/20/2018
Inter-core Load Balancing Message Generation & Vertex Updating Basic task unit: a vertex Dynamically retrieved by working threads in chunks Message Processing Basic task unit: a message array in the message buffer Dynamically allocated to processing threads 9/20/2018
CPU-MIC Coprocessing Same runtime code is used on both CPU and MIC MPI symmetric computing Key Issue: graph partitioning Load balance Communication overhead 9/20/2018
CPU-MIC Coprocessing Load Balance Communication Overhead Dynamic load balancing not feasible High data movement cost Static load balancing more practical User indicates the workload ratio between CPU and MIC (user estimates the relative speeds of both devices) System does graph partitioning based on the ratio Communication Overhead The less cross edges, the better 9/20/2018
CPU-MIC Graph Partitioning Problem: partitioning the graph according to a ratio of a : b Continuous Partitioning (An Intuitive Way) Directly divide vertices according to the partitioning ratio (a : b) Load imbalance, many graphs are power-law graphs --> # edges assigned to devices do not follow the ratio Round-robin For every a + b vertices, assign first a vertices to CPU, and remaining b vertices to MIC Good load balance High communication overhead due to cross edges between devices Hybrid Approach Partition graph into min-connectivity blocks (using Metis) Assign blocks to CPU and MIC round-robinly in (a : b) ratio Good load balance and less messages between devices 9/20/2018
Experiments 9/20/2018
Experimental Results Platform Applications A heterogeneous CPU-MIC node CPU Xeon E5-2680, 16 cores, 2.70 GH, 63 GB Ram MIC Xeon Phi SE10P, 61 cores, 1.1 GHz, 4 hyperthreads/core, 8GB Mpic++ 4.1, built on top of icpc 13.1.0 Symmetric mode Applications PageRank, BFS, Semi-clustering (SC), SSSP, Topological Sort 9/20/2018
Execution Modes CPU OMP: Multi-core CPU using OpenMP MIC OMP: MIC execution with OpenMP CPU Lock: Multi-core CPU using framework, locking-based msg generation MIC Lock: MIC execution using our framework with locking-based msg generation CPU Pipe: Multi-core CPU using framework with pipelined msg generation MIC Pipe: MIC execution using our framework with pipelined msg generation (use 180 worker threads + 660 mover threads) CPU-MIC: Heterogeneous execution using CPU and MIC, with the best graph partitioning ratio Could not benefit from SIMD 9/20/2018
Overall Performance - PageRank CPU CPU Lock is 30% faster than CPU Pipe (CPU does not suffer from contention) CPU Lock is as fast as CPU OMP MIC MIC Pipe is 2.32x faster than MIC Lock (MIC has much larger number of threads) MIC Pipe is 1.84x faster than MIC OMP CPU-MIC 1.30x faster than MIC Pipe 9/20/2018
Overall Performance - BFS CPU CPU Lock is 1.45x faster than CPU Pipe CPU OMP is 1.07x faster than CPU Lock MIC MIC Lock is 1.22x faster than MIC Pipe (BFS is not message intensive) MIC Lock is 1.5x faster than MIC OMP CPU-MIC 1.32x faster than CPU Lock 9/20/2018
Overall Performance - TopoSort CPU CPU Lock is 1.58x faster than CPU Pipe CPU OMP is 1.04x faster than CPU Lock MIC MIC Pipe is 3.36x faster than MIC Lock (TopoSort uses a dense graph - message intensive, contention intensive) MIC Pipe is 4.15x faster than MIC OMP CPU-MIC 1.2x faster than MIC Pipe 9/20/2018
Overall Performance CPU prefers Locking-based msg generation Less threads less contention Lower bandwidth, higher msg storage overhead for pipelining MIC prefers Pipelined msg generation For message-intensive applications (e.g., PageRank and TopoSort) More threads high contention Higher parallel bandwidth, less IO overhead for pipelining. More threads hide memory latency 9/20/2018
Benefit from SIMD Three of the five applications involve reductions (in msg processing sub-step) Sub-step speedup: CPU: 2.22x – 2.35x MIC: 5.16x – 7.85x Overall speedup (depends on relative amount of time of the msg processing step): CPU: 1.08x – 1.13x MIC: 1.18x – 1.23x 9/20/2018
Effect of Hybrid Graph Partitioning Partitioning ratio was chosen according to relative performance of single device executions Hybrid partitioning Communication time As low as Continuous partitioning Due to less cross edges between devices Execution time As low as round-robin partitioning Due to more balanced workload 9/20/2018
Summary Graph Processing Framework over CPU and MIC Condensed Static Buffer Moderate memory consumption Support efficient SIMD message processing Pipelining Execution Flow Overlaps message generation and message insertion Reduces locking for Xeon Phi in message insertion Hybrid Graph Partitioning Maintains load balance and low communication overhead 9/20/2018