George Caragea,and Uzi Vishkin University of Maryland 1 Speaker James Edwards.

Slides:

Advertisements

Similar presentations

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Advertisements

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.

James Edwards and Uzi Vishkin University of Maryland 1.

CSE 589 Applied Algorithms Spring 1999 Course Introduction Depth First Search.

Background Computer System Architectures Computer System Software.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Reference: Message Passing Fundamentals.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

CSE 589 Applied Algorithms Course Introduction. CSE Lecture 1 - Spring Instructors Instructor –Richard Ladner –206.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Department of Electrical Engineering National Cheng Kung University

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

Pregel: A System for Large-Scale Graph Processing

Project Mentor – Prof. Alan Kaminsky

Computer System Architectures Computer System Software

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Data Structures and Algorithms in Parallel Computing Lecture 3.

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Spare Register Aware Prefetching for Graph Algorithms on GPUs

XMT Another PRAM Architectures

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

University of Wisconsin-Madison

Distributed Systems CS

Chapter 4 Multiprocessors

Presentation transcript:

George Caragea,and Uzi Vishkin University of Maryland 1 Speaker James Edwards

 It has proven to be quite difficult to obtain significant performance improvements using current parallel computing platforms.  National Research Council report: While heroic programmers can exploit today vast amounts of parallelism, whole new computing “stacks” are required to allow expert and typical programmers to do that easily. 2

 The Parallel Random Access Machine (PRAM) is the simplest model of a parallel computer. ◦ Work-Depth is a conceptually simpler model that is equivalent to the PRAM. ◦ At each point in time, specify all operations that can be performed in parallel. ◦ Any processor can access any memory address in constant time.  Advantages ◦ Ease of algorithm design ◦ Provability of correctness ◦ Ease of truly PRAM-like programming  So, what’s the problem? 3

 Many doubt the direct practical relevance of PRAM algorithms. ◦ Example: lack of any poly-logarithmic PRAM graph algorithms in the new NSF/IEEE-TCPP curriculum  Past work provided very limited evidence to alleviate these doubts.  Graph algorithms in particular tend to be difficult to implement efficiently, as shown in two papers from Georgia Tech: ◦ Biconnectivity (IPDPS ‘05, 12-processor Sun machine): Speedups of up to 4x with a modified version of the Tarjan-Vishkin biconnectivity algorithm  No speedup without major changes to the algorithm ◦ Maximum flow (IPDPS ‘10, hybrid GPU-CPU implementation): Speedups of up to 2.5x 4

 Cause: PRAM algorithms are not a good match for current hardware: ◦ Fine-grained parallelism = overheads  Requires managing many threads  Synchronization and communication are expensive  Clustering reduces granularity, but at the cost of load balancing ◦ Irregular memory accesses = poor locality  Cache is not used efficiently  Performance becomes sensitive to memory latency  Unlike models such as BSP and LogP, PRAM does not explicitly take these factors into account. 5

 The Explicit Multi-Threading (XMT) architecture was developed at the University of Maryland with the following goals in mind: ◦ Good performance on parallel algorithms of any granularity ◦ Support for regular or irregular memory access ◦ Efficient execution of code derived from PRAM algorithms  A 64-processor FPGA hardware prototype and a software toolchain (compiler and simulator) are freely available for download.  Note: Unless otherwise specified, speedup results for XMT were obtained using the simulator and are given in terms of cycle counts. 6

 Main feature of XMT: Using similar hardware resources (e.g. silicon area, power consumption) as existing CPUs and GPUs, provide a platform that to a programmer looks as close to a PRAM as possible. ◦ Instead of ~8 “heavy” processor cores, provide ~1,024 “light” cores for parallel code and one “heavy” core for serial code. ◦ Devote on-chip bandwidth to a high-speed interconnection network rather than maintaining coherence between private caches. 7

◦ For the PRAM algorithms presented, the number of HW threads is more important than the processing power per thread because they happen to perform more work than an equivalent serial algorithm. This cost is overridden by sufficient parallelism in hardware. ◦ Balance between the tight synchrony of the PRAM and hardware constraints (such as locality) is obtained through support for fine-grained multithreaded code, where a thread can advance at it own speed between (a form of) synchronization barriers. 8

 Consider the following two systems: 1.XMT running a PRAM algorithm with few or no modifications 2.A multi-core CPU or GPU running a heavily modified version of the same PRAM algorithm or another algorithm solving the same problem  It is perhaps surprising that (1) can outperform (2) while being easier to implement.  This idea was demonstrated with the following four PRAM graph algorithms: ◦ BFS ◦ Connectivity ◦ Biconnectivity ◦ Maximum flow 9

 None of 40+ students in a fall 2010 joint UIUC/UMD course got any speedups using OpenMP programming on simple irregular problems such as breadth-first search (BFS) using an 8-processor SMP, but they got 8x- 25x speedups on the XMT FPGA prototype.  On BFS, we show potential speedups of 5.4x over an optimized GPU implementation, 73x when the input graph provides low degree of parallelism during execution. 10

 Using the Shiloach-Vishkin (SV) PRAM algorithm for connectivity, we show potential speedups of 39x-100x over a best serial implementation and 2.2x-4x over an optimized GPU implementation that greatly modified the original algorithm.  In fact, for XMT the SV PRAM connectivity algorithm did not need to wait for a research paper. It was given as one of 6 programming assignments in standard PRAM algorithm classes, and was even done by a couple of 10th graders at Blair High School, Maryland. 11

 Complete graph: Every vertex is connected to every other vertex  Random graph: Edges are added at random between unique pairs of vertices  Great lakes road graph: From the 9 th DIMACS Implementation Challenge  Google web graph: Undirected version of the largest connected component of the Google web graph of web pages and hyperlinks between them, from the Stanford network analysis platform 12

 Maximal planar graph ◦ Built layer by layer ◦ The first layer has three vertices and three edges. ◦ Each additional layer has three vertices and nine edges. 13

14

 On biconnectivity, we show potential speedups of 9x-33x using a direct implementation of the Tarjan-Vishkin (TV) biconnectivity algorithm, a logarithmic-time PRAM algorithm.  When compared with two other algorithms, one based on BFS and the other on DFS, TV was the only algorithm that provided strong speedups on all evaluated input graphs. ◦ The other algorithms use less work but lose out to TV on balance  Furthermore, TV provided the best speedup on sparse graphs. 15

16

 Biconnectivity provides a good example of how programming differs between XMT and other platforms.  For both XMT and SMPs, a significant challenge was to improve the work efficiency of subroutines used within the biconnectivity algorithm. 17

 On XMT, we left the core algorithm as is without reducing its available parallelism ◦ When computing graph connectivity (first on the input graph, then on an auxiliary graph), compact the adjacency list every few iterations ◦ When computing the preorder numbering of the spanning tree of the input graph, accelerate the iterations by choosing faster but more work demanding list ranking algorithms for different iterations (“accelerating cascades”, [CV86]) ◦ Transition as many computations as possible from the original input graph to the spanning tree.  In contrast, speedups on SMPs could not be achieved without reducing the parallelism of TV (e.g. by performing a DFS traversal of the input graph), effectively replacing many of its components 18

 On maximum flow, we show potential speedups of up to 108x compared to a modern CPU architecture running a best serial implementation.  The XMT solution is a PRAM lock-free implementation, based on balancing the Goldberg- Tarjan Push-Relabel algorithm with the first PRAM max-flow algorithm (SV).  Performance is highly dependent on the structure of the graph, determined by: ◦ The amount of parallelism available during execution ◦ The number of parallel steps (kernel invocations) ◦ The amount of memory queuing due to conflicts 19

 Acyclic Dense Graphs (ADG) ◦ From 1 st DIMACS Challenge [JM93] ◦ Complete direct acyclic graphs ◦ Node degrees range between N-1 to 1  Washington Random Level Graphs (RLG) ◦ From 1 st DIMACS Challenge [JM93] ◦ Rectangular grids. Each vertex in a row has three edges to randomly chosen vertices in next row ◦ Source and sink external to grid, connected to first and last row

 RMF Graphs ◦ From 1 st DIMACS Challenge [JM93] and [GG88] ◦ a square grids of vertices (frames), with b x b vertices per frame. N = a x b x b ◦ Each vertex connected to neighbors in frame, and one random vertex in next frame ◦ Source in first frame, sink in last frame ◦ RMF long: many “small” frames; RMF wide: fewer “large” frames.  RANDOM ◦ Random unstructured graphs ◦ Edges are placed uniformly at random between pairs of nodes ◦ Average degree is 6 ◦ Short diameter, high degree of parallelism

22

 These experimental algorithm results show not only that theory-based algorithms can provide good speedups in practice, but also that they are sometimes the only ones that can do so.  Perhaps most surprising to theorists would be that the nominal number of processors is not as important for a fair comparison among same-generation many-core platforms as silicon area. 23

 [CB05] G. Cong and D.A. Bader. An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs). In Proc. 19th IEEE International Parallel and Distributed Processing Symposium., page 45b, April  [CKTV10] G. C. Caragea, F. Keceli, A. Tzannes, and U. Vishkin. General-purpose vs. GPU: Comparison of many-cores on irregular workloads. In HotPar ’10: Proceedings of the 2nd Workshop on Hot Topics in Parallelism. USENIX, June  [CV86] R. Cole, U. Vishkin. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In Proc. STOC  [CV11] G. Caragea, U. Vishkin. Better Speedups for Parallel Max- Flow. Brief Announcement, SPAA

 [EV11] J. Edwards and U. Vishkin. An Evaluation of Biconnectivity Algorithms on Many-Core Processors Under review.  [FM10] S.H. Fuller and L.I. Millett (Eds.). The Future of Computing Performance: Game Over or Next level. Computer Science and Telecommunications Board, National Academies Press, December  [GG88] D. Goldfarb and M. Grigoriadis. A Computational Comparison of the Dinic and Network Simplex Methods for Maximum Flow. Annals of Operations Research, 13:81-123,  [GT88] A.Goldberg, R.Tarjan, A new approach to the maximum-flow problem Journal of ACM,

 [HH10] Z. He and Bo Hong, Dynamically Tuned Push- Relabel Algorithm for the Maximum Flow Problem on CPU- GPU-Hybrid Platforms. In Proc. 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10),  [JM93] D.S. Jonson and C.C. McGeoch, editors. Network Flows and Matching: First DIMACS Implementation Challenge. AMS, Providence, RI,  [KTCBV11] F. Keceli, A. Tzannes, G. Caragea, R. Barua and U. Vishkin. Toolchain for programming, simulating and studying the XMT many-core architecture. Proc. 16th Int. Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), in conjunction with IPDPS, Anchorage, Alaska, May 20, 2011, to appear. 26

 [SV82a] Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. J. Algorithms, 3(1):57–67,  [SV82b] Y. Shiloach and U. Vishkin. An O(n 2 log n) parallel max- flow algorithm. J. Algorithms, 3:128–146,  [TCPP10] NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing - core topics for undergraduates. December  [TV85] R. E. Tarjan and U. Vishkin. An Efficient Parallel Biconnectivity Algorithm. SIAM J. Computing, 14(4):862–874,  [WV08] X. Wen and U. Vishkin. FPGA-Based Prototype of a PRAM- on-Chip Processor. In Proceedings of the 5th Conference on Computing Frontiers, CF ’08, pages 55–66, New York, NY, USA, ACM. 27