Efficient and Simplified Parallel Graph Processing over CPU and MIC

Slides:



Advertisements
Similar presentations
U of Houston – Clear Lake
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Systems CS
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Data Structures and Algorithms in Parallel Computing
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Concurrency and Performance Based on slides by Henri Casanova.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Martin Kruliš by Martin Kruliš (v1.1)1.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Auburn University
A New Distributed Processing Framework
Chapter 2 Memory and process management
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Lecture 1 – Parallel Programming Primer
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Ch 13 WAN Technologies and Routing
Conception of parallel algorithms
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Parallel Programming in C with MPI and OpenMP
Linchuan Chen, Xin Huo and Gagan Agrawal
Advisor: Dr. Gagan Agrawal
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
What is Parallel and Distributed computing?
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Data-Intensive Computing: From Clouds to GPU Clusters
CSE8380 Parallel and Distributed Processing Presentation
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Distributed Systems CS
Gary M. Zoppetti Gagan Agrawal
Hybrid Programming with OpenMP and MPI
By Brandon, Ben, and Lee Parallel Computing.
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Accelerating Regular Path Queries using FPGA
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Efficient and Simplified Parallel Graph Processing over CPU and MIC Linchuan Chen, Xin Huo, Bin Ren, Surabhi Jain and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

The System A Vertex Oriented Parallel Graph Programming System over CPU and Xeon Phi 9/20/2018

Intel Xeon Phi Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes 61 X86 cores, each supporting 4 hardware threads Wide SIMD Lanes 512 bit SIMD lanes 16 floats/ints or 8 doubles Limited Shared Main Memory per Core Only a few GBs of main memory 9/20/2018

Graph Applications Popular Irregular Classical algorithms Graph mining Route problems Graph mining Social network Irregular Hard to utilize thread level parallelism Loop dependency Load balance Hard to utilize SIMD Random access 9/20/2018

Vertex Oriented Programming Models E.g., Google’s Pregel Model Follows BSP Model Concurrent local processing Communication Synchronization Uses Message Passing to Abstract Graph Applications Simple to Use User Defined functions Expresse sequential logics: message generation, message processing, and vertex update General Enough to Specify Common Graph Algorithms 9/20/2018

The Challenges Xeon Phi Specific Utilizing CPU at the Same Time Memory Access and Contention Overhead among Threads Irregular memory access Locking leads to contention among threads Difficult to Automatically Utilize SIMD Complex SSE programming Load Imbalance Associated processing to different vertices vary Need to keep as many as possible cores busy Utilizing CPU at the Same Time Graph partitioning between devices 9/20/2018

Programming Interface 9/20/2018

Programming Interface Keeps the Simplicity of Pregel Users Express Application Logic through Message Passing message generation Active vertices send messages along neighbor links message processing Process the received messages for each vertex vertex update Update the vertex value using the processing result of the previous step 9/20/2018

Programming Interface: SSSP Example ∞ 1 2 2 1 1 1 2 1 2 ∞ 4 4 5 1 4 5 1 2 ∞ 4 2 ∞ 4 1 1 1 1 6 6 3 3 3 3 ∞ ∞ ∞ 1 1 1 2 2 1 1 1 2 3 1 2 3 1 4 5 4 2 2 1 4 5 4 2 2 1 1 1 1 6 6 3 3 3 3 4 4 1 1 active inactive 9/20/2018

Programming Interface: SSSP Example Vector types with overloaded vector operations // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } User defined functions: // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; 9/20/2018

Runtime 9/20/2018

Workflow Message Generation Message Buffer Message Processing Vertex Update 9/20/2018

Message Buffer Design Condensed Static Buffer Vertex Grouping Pre-allocated space Avoids frequent memory allocation Vertices sorted according to in-degree Each vertex granted no less than in-degree msg slots Vertex Grouping Multiple consecutive vertices are grouped together Same group uses the same message array Length of the array equals the max in-degree in the group Width equals the number of vertices in the group Messages can be processed in SIMD For associative and commutative operations Moderate Memory Consumption Take care of power-law graphs 9/20/2018

Using Message Buffer One-to-one mapping Dynamic column allocation Not all vertices receive msgs Lots of bubbles waste SIMD lanes Dynamic column allocation All columns in a group are dynamically allocated Columns are consumed from left to right continuously 9/20/2018

Message Insertion Using Buffers Locking Based Concurrent writes to the same buffer column: use locks Contention overhead, especially for dense/power-law graphs Pipelining  avoids locks Each worker thread gens msgs to its private msg queues num_msg_queue = num_mover_threads qid = dst_id % num_mover_threads Mover thread tid moves messages from queues qid = tid to msg buffer Each queue is accessed by one writer and one reader  locking free Message insertion is locking-free Workers and movers run concurrently Pipelining using Workers + Movers 9/20/2018

Pipelining Pros Cons Suitable for Avoids locking Each queue is accessed by one worker(writer) and one mover(reader) Each msg buffer column is written by one mover Cons Introduces extra message storage cost Choice of optimal workers/movers configuration is non-trivial Suitable for Dense graphs Message intensive applications 9/20/2018

Inter-core Load Balancing Message Generation & Vertex Updating Basic task unit: a vertex Dynamically retrieved by working threads in chunks Message Processing Basic task unit: a message array in the message buffer Dynamically allocated to processing threads 9/20/2018

CPU-MIC Coprocessing Same runtime code is used on both CPU and MIC MPI symmetric computing Key Issue: graph partitioning Load balance Communication overhead 9/20/2018

CPU-MIC Coprocessing Load Balance Communication Overhead Dynamic load balancing not feasible High data movement cost Static load balancing more practical User indicates the workload ratio between CPU and MIC (user estimates the relative speeds of both devices) System does graph partitioning based on the ratio Communication Overhead The less cross edges, the better 9/20/2018

CPU-MIC Graph Partitioning Problem: partitioning the graph according to a ratio of a : b Continuous Partitioning (An Intuitive Way) Directly divide vertices according to the partitioning ratio (a : b) Load imbalance, many graphs are power-law graphs --> # edges assigned to devices do not follow the ratio Round-robin For every a + b vertices, assign first a vertices to CPU, and remaining b vertices to MIC Good load balance High communication overhead due to cross edges between devices Hybrid Approach Partition graph into min-connectivity blocks (using Metis) Assign blocks to CPU and MIC round-robinly in (a : b) ratio Good load balance and less messages between devices 9/20/2018

Experiments 9/20/2018

Experimental Results Platform Applications A heterogeneous CPU-MIC node CPU Xeon E5-2680, 16 cores, 2.70 GH, 63 GB Ram MIC Xeon Phi SE10P, 61 cores, 1.1 GHz, 4 hyperthreads/core, 8GB Mpic++ 4.1, built on top of icpc 13.1.0 Symmetric mode Applications PageRank, BFS, Semi-clustering (SC), SSSP, Topological Sort 9/20/2018

Execution Modes CPU OMP: Multi-core CPU using OpenMP MIC OMP: MIC execution with OpenMP CPU Lock: Multi-core CPU using framework, locking-based msg generation MIC Lock: MIC execution using our framework with locking-based msg generation CPU Pipe: Multi-core CPU using framework with pipelined msg generation MIC Pipe: MIC execution using our framework with pipelined msg generation (use 180 worker threads + 660 mover threads) CPU-MIC: Heterogeneous execution using CPU and MIC, with the best graph partitioning ratio Could not benefit from SIMD 9/20/2018

Overall Performance - PageRank CPU CPU Lock is 30% faster than CPU Pipe (CPU does not suffer from contention) CPU Lock is as fast as CPU OMP MIC MIC Pipe is 2.32x faster than MIC Lock (MIC has much larger number of threads) MIC Pipe is 1.84x faster than MIC OMP CPU-MIC 1.30x faster than MIC Pipe 9/20/2018

Overall Performance - BFS CPU CPU Lock is 1.45x faster than CPU Pipe CPU OMP is 1.07x faster than CPU Lock MIC MIC Lock is 1.22x faster than MIC Pipe (BFS is not message intensive) MIC Lock is 1.5x faster than MIC OMP CPU-MIC 1.32x faster than CPU Lock 9/20/2018

Overall Performance - TopoSort CPU CPU Lock is 1.58x faster than CPU Pipe CPU OMP is 1.04x faster than CPU Lock MIC MIC Pipe is 3.36x faster than MIC Lock (TopoSort uses a dense graph - message intensive, contention intensive) MIC Pipe is 4.15x faster than MIC OMP CPU-MIC 1.2x faster than MIC Pipe 9/20/2018

Overall Performance CPU prefers Locking-based msg generation Less threads  less contention Lower bandwidth, higher msg storage overhead for pipelining MIC prefers Pipelined msg generation For message-intensive applications (e.g., PageRank and TopoSort) More threads  high contention Higher parallel bandwidth, less IO overhead for pipelining. More threads hide memory latency 9/20/2018

Benefit from SIMD Three of the five applications involve reductions (in msg processing sub-step) Sub-step speedup: CPU: 2.22x – 2.35x MIC: 5.16x – 7.85x Overall speedup (depends on relative amount of time of the msg processing step): CPU: 1.08x – 1.13x MIC: 1.18x – 1.23x 9/20/2018

Effect of Hybrid Graph Partitioning Partitioning ratio was chosen according to relative performance of single device executions Hybrid partitioning Communication time As low as Continuous partitioning Due to less cross edges between devices Execution time As low as round-robin partitioning Due to more balanced workload 9/20/2018

Summary Graph Processing Framework over CPU and MIC Condensed Static Buffer Moderate memory consumption Support efficient SIMD message processing Pipelining Execution Flow Overlaps message generation and message insertion Reduces locking for Xeon Phi in message insertion Hybrid Graph Partitioning Maintains load balance and low communication overhead 9/20/2018