1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

Multi-GPU System Design with Memory Networks

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.

1. Aim High with Oracle Real World Performance Andrew Holdsworth Director Real World Performance Group Server Technologies.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

OpenFOAM on a GPU-based Heterogeneous Cluster

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu

Mingxing Zhang, Youwei Zhuo (equal contribution),

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Scale-Free Graph Processing on a NUMA Machine

Tahsin Reza Matei Ripeanu Nicolas Tripoul

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

2 Our ‘field’ to plow : Graph processing |V| = 1.4B, |E| = 6.6B

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab The University of British Columbia

Graph Processing: The Challenges Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Varying degrees of parallelism (both intra- and inter- stage) Low compute-to- memory access ratio

Graph Processing: The GPU Opportunity Data-dependent memory access patterns Caches + summary data structures Large memory footprint >128GB CPUs Poor locality Assemble a heterogeneous platform Massive hardware multithreading 6GB! GPUs Low compute-to- memory access ratio Caches + summary data structures Varying degrees of parallelism (both intra- and inter- stage)

6 YES WE CAN! 2x speedup (8 billion edges) Motivating Question Can we efficiently use hybrid systems for large-scale graph processing ?

7  Performance Model  Predicts speedup  Intuitive  Totem  A graph processing engine for hybrid systems  Applies algorithm-agnostic optimizations  Evaluation  Predicated vs. achieved  Hybrid vs. Symmetric Methodology

8 b The Performance Model (I) α =α = β =β = r cpu = Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host) c =

9 The Performance Model (II) β = 20% r cpu = 0.5 BEPS It is beneficial to process the graph on a hybrid system if communication overhead is low α = β = r cpu = c = Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS x Best reported single- node BFS performance [Agarwal, V. 2010] |V| = 32M, |E| = 1B Worst case (e.g., bipartite graph)

10 Totem: Programming Model Bulk Synchronous Parallel  Rounds of computation and communication phases  Updates to remote vertices are delivered in the next round  Partitions vote to terminate execution......

11 Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Updates to remote vertices aggregated locally Comm2: merge with local state Comm1: transfer outbox buffer to remote input buffer

12 |E| = 512 Million Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution sparse graph: ~5x reduction Denser graph has better opportunity for aggregation: ~50x reduction

13 Evaluation Setup Workload  R-MAT graphs  |V|=32M, |E|=1B, unless otherwise noted Algorithms  Breadth-first Search  PageRank Metrics  Speedup compared to processing on the host only Testbed  Host: dual-socket Intel Xeon with 16GB  GPU: Nvidia Tesla C2050 with 3GB

14 Predicted vs. Achieved Speedup Linear speedup with respect to offloaded part GPU partition fills GPU memory After aggregation, β = 2%. A low value is critical for BFS

15 Breakdown of Execution Time PageRank is dominated by the compute phase Aggregation significantly reduced communication overhead GPU is > 5x faster than the host

16 So far …  Performance modeling  Simple  Useful for initial system provisioning  Totem  Generic graph processing framework  Algorithm-agnostic optimizations  Evaluation (Graph500 scale-28)  2x speedup over a symmetric system  1.13 Billion TEPS edges on a dual-socket, dual-GPU system But, random partitioning! Can we do better?

Better partitioning strategies. The search space.  Handles large (billion-edge scale) graphs. o Low space and time complexity. o Ideally, quasi-linear!  Handles well scale-free graphs.  Minimizes algorithm’s execution time by reducing computation time o (rather than communication) 17

The strategies we explore H IGH : vertices with high degree left on the host L OW : vertices with low degree left on the host R AND : random 18 The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

Evaluation platform 19 Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets)(2x GPUs) Core Frequency 2.67GHz1.15GHz Num Cores (SMs) 614 HW-thread/Core 2x 48warps (x32/warp) Last Level Cache 12MB2MB Main Memory 144GB6GB Memory Bandwidth 32GB/sec144GB/sec Total Power (TDP) 95W225W

BSF performance 20 BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) 2x performance gain! LOW: No gain over random! Exploring the 75% data point

Host is the bottleneck in all cases ! BSF performance – more details

PageRank performance 22 PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B) LOW: Minimal gain over random! Better packing 25% performance gain!

Small graphs (scale-25 RMAT graphs: |V|=32m, |E|=512m) 23 BFSPageRank  Intelligent partitioning provides benefits  Key for performance: load balancing

Uniform graphs (not scale free) 24 BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28  Hybrid techniques not useful for uniform graphs

Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B) Platform size: 1,2, 4 sockets  2xsockets + 2 x GPU 25 BFSPageRank

Power Normalizing by power (TDP – thermal design power) Metric: million TEPS / watt 26 BFSPageRank

Conclusions Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs) Q: Can one design a processing engine for hybrid platforms that both generic and efficient? A: Yes. Q: Are there near-linear complexity partitioning strategies that enable higher performance? A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random. Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)? A: No. (for scale free graphs) Q: Which strategies work best? A: It depends! Large graphs: shape the load. Small graphs: load balancing.

28 If you were plowing a field, which would you rather use? - Two oxen, or 1024 chickens? - Both!

29 code available at: netsyslab.ece.ubc.ca Papers: A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012 On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS

30 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia