Efficient and Simplified Parallel Graph Processing over CPU and MIC

Efficient and Simplified Parallel Graph Processing over CPU and MIC
Linchuan Chen, Xin Huo, Bin Ren, Surabhi Jain and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

The System A Vertex Oriented Parallel Graph Programming System over CPU and Xeon Phi 9/20/2018

Intel Xeon Phi Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes
61 X86 cores, each supporting 4 hardware threads Wide SIMD Lanes 512 bit SIMD lanes 16 floats/ints or 8 doubles Limited Shared Main Memory per Core Only a few GBs of main memory 9/20/2018

Graph Applications Popular Irregular Classical algorithms Graph mining
Route problems Graph mining Social network Irregular Hard to utilize thread level parallelism Loop dependency Load balance Hard to utilize SIMD Random access 9/20/2018

Vertex Oriented Programming Models
E.g., Google’s Pregel Model Follows BSP Model Concurrent local processing Communication Synchronization Uses Message Passing to Abstract Graph Applications Simple to Use User Defined functions Expresse sequential logics: message generation, message processing, and vertex update General Enough to Specify Common Graph Algorithms 9/20/2018

The Challenges Xeon Phi Specific Utilizing CPU at the Same Time
Memory Access and Contention Overhead among Threads Irregular memory access Locking leads to contention among threads Difficult to Automatically Utilize SIMD Complex SSE programming Load Imbalance Associated processing to different vertices vary Need to keep as many as possible cores busy Utilizing CPU at the Same Time Graph partitioning between devices 9/20/2018

Programming Interface
9/20/2018

Programming Interface
Keeps the Simplicity of Pregel Users Express Application Logic through Message Passing message generation Active vertices send messages along neighbor links message processing Process the received messages for each vertex vertex update Update the vertex value using the processing result of the previous step 9/20/2018

Programming Interface: SSSP Example
∞ 1 2 2 1 1 1 2 1 2 ∞ 4 4 5 1 4 5 1 2 ∞ 4 2 ∞ 4 1 1 1 1 6 6 3 3 3 3 ∞ ∞ ∞ 1 1 1 2 2 1 1 1 2 3 1 2 3 1 4 5 4 2 2 1 4 5 4 2 2 1 1 1 1 6 6 3 3 3 3 4 4 1 1 active inactive 9/20/2018

Programming Interface: SSSP Example
Vector types with overloaded vector operations // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } User defined functions: // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; 9/20/2018

Runtime 9/20/2018

Workflow Message Generation Message Buffer Message Processing
Vertex Update 9/20/2018

Message Buffer Design Condensed Static Buffer Vertex Grouping
Pre-allocated space Avoids frequent memory allocation Vertices sorted according to in-degree Each vertex granted no less than in-degree msg slots Vertex Grouping Multiple consecutive vertices are grouped together Same group uses the same message array Length of the array equals the max in-degree in the group Width equals the number of vertices in the group Messages can be processed in SIMD For associative and commutative operations Moderate Memory Consumption Take care of power-law graphs 9/20/2018

Using Message Buffer One-to-one mapping Dynamic column allocation
Not all vertices receive msgs Lots of bubbles waste SIMD lanes Dynamic column allocation All columns in a group are dynamically allocated Columns are consumed from left to right continuously 9/20/2018

Message Insertion Using Buffers
Locking Based Concurrent writes to the same buffer column: use locks Contention overhead, especially for dense/power-law graphs Pipelining  avoids locks Each worker thread gens msgs to its private msg queues num_msg_queue = num_mover_threads qid = dst_id % num_mover_threads Mover thread tid moves messages from queues qid = tid to msg buffer Each queue is accessed by one writer and one reader  locking free Message insertion is locking-free Workers and movers run concurrently Pipelining using Workers + Movers 9/20/2018

Pipelining Pros Cons Suitable for Avoids locking
Each queue is accessed by one worker(writer) and one mover(reader) Each msg buffer column is written by one mover Cons Introduces extra message storage cost Choice of optimal workers/movers configuration is non-trivial Suitable for Dense graphs Message intensive applications 9/20/2018

Inter-core Load Balancing
Message Generation & Vertex Updating Basic task unit: a vertex Dynamically retrieved by working threads in chunks Message Processing Basic task unit: a message array in the message buffer Dynamically allocated to processing threads 9/20/2018

CPU-MIC Coprocessing Same runtime code is used on both CPU and MIC
MPI symmetric computing Key Issue: graph partitioning Load balance Communication overhead 9/20/2018

CPU-MIC Coprocessing Load Balance Communication Overhead
Dynamic load balancing not feasible High data movement cost Static load balancing more practical User indicates the workload ratio between CPU and MIC (user estimates the relative speeds of both devices) System does graph partitioning based on the ratio Communication Overhead The less cross edges, the better 9/20/2018

CPU-MIC Graph Partitioning
Problem: partitioning the graph according to a ratio of a : b Continuous Partitioning (An Intuitive Way) Directly divide vertices according to the partitioning ratio (a : b) Load imbalance, many graphs are power-law graphs --> # edges assigned to devices do not follow the ratio Round-robin For every a + b vertices, assign first a vertices to CPU, and remaining b vertices to MIC Good load balance High communication overhead due to cross edges between devices Hybrid Approach Partition graph into min-connectivity blocks (using Metis) Assign blocks to CPU and MIC round-robinly in (a : b) ratio Good load balance and less messages between devices 9/20/2018

Experiments 9/20/2018

Experimental Results Platform Applications
A heterogeneous CPU-MIC node CPU Xeon E5-2680, 16 cores, 2.70 GH, 63 GB Ram MIC Xeon Phi SE10P, 61 cores, 1.1 GHz, 4 hyperthreads/core, 8GB Mpic++ 4.1, built on top of icpc Symmetric mode Applications PageRank, BFS, Semi-clustering (SC), SSSP, Topological Sort 9/20/2018

Execution Modes CPU OMP: Multi-core CPU using OpenMP
MIC OMP: MIC execution with OpenMP CPU Lock: Multi-core CPU using framework, locking-based msg generation MIC Lock: MIC execution using our framework with locking-based msg generation CPU Pipe: Multi-core CPU using framework with pipelined msg generation MIC Pipe: MIC execution using our framework with pipelined msg generation (use 180 worker threads + 660 mover threads) CPU-MIC: Heterogeneous execution using CPU and MIC, with the best graph partitioning ratio Could not benefit from SIMD 9/20/2018

Overall Performance - PageRank
CPU CPU Lock is 30% faster than CPU Pipe (CPU does not suffer from contention) CPU Lock is as fast as CPU OMP MIC MIC Pipe is 2.32x faster than MIC Lock (MIC has much larger number of threads) MIC Pipe is 1.84x faster than MIC OMP CPU-MIC 1.30x faster than MIC Pipe 9/20/2018

Overall Performance - BFS
CPU CPU Lock is 1.45x faster than CPU Pipe CPU OMP is 1.07x faster than CPU Lock MIC MIC Lock is 1.22x faster than MIC Pipe (BFS is not message intensive) MIC Lock is 1.5x faster than MIC OMP CPU-MIC 1.32x faster than CPU Lock 9/20/2018

Overall Performance - TopoSort
CPU CPU Lock is 1.58x faster than CPU Pipe CPU OMP is 1.04x faster than CPU Lock MIC MIC Pipe is 3.36x faster than MIC Lock (TopoSort uses a dense graph - message intensive, contention intensive) MIC Pipe is 4.15x faster than MIC OMP CPU-MIC 1.2x faster than MIC Pipe 9/20/2018

Overall Performance CPU prefers Locking-based msg generation
Less threads  less contention Lower bandwidth, higher msg storage overhead for pipelining MIC prefers Pipelined msg generation For message-intensive applications (e.g., PageRank and TopoSort) More threads  high contention Higher parallel bandwidth, less IO overhead for pipelining. More threads hide memory latency 9/20/2018

Benefit from SIMD Three of the five applications involve reductions (in msg processing sub-step) Sub-step speedup: CPU: 2.22x – 2.35x MIC: 5.16x – 7.85x Overall speedup (depends on relative amount of time of the msg processing step): CPU: 1.08x – 1.13x MIC: 1.18x – 1.23x 9/20/2018

Effect of Hybrid Graph Partitioning
Partitioning ratio was chosen according to relative performance of single device executions Hybrid partitioning Communication time As low as Continuous partitioning Due to less cross edges between devices Execution time As low as round-robin partitioning Due to more balanced workload 9/20/2018

Summary Graph Processing Framework over CPU and MIC
Condensed Static Buffer Moderate memory consumption Support efficient SIMD message processing Pipelining Execution Flow Overlaps message generation and message insertion Reduces locking for Xeon Phi in message insertion Hybrid Graph Partitioning Maintains load balance and low communication overhead 9/20/2018

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Similar presentations

Presentation on theme: "Efficient and Simplified Parallel Graph Processing over CPU and MIC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Similar presentations

Presentation on theme: "Efficient and Simplified Parallel Graph Processing over CPU and MIC"— Presentation transcript:

Similar presentations

About project

Feedback