Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat.

Slides:

Advertisements

Similar presentations

Parallel Algorithms.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

CSC 421: Algorithm Design & Analysis

Recursion vs. Iteration The original Lisp language was truly a functional language: –Everything was expressed as functions –No local variables –No iteration.

A Model of Computation for MapReduce

Graphs CSC 220 Data Structure. Introduction One of the Most versatile data structures like trees. Terminology –Nodes in trees are vertices in graphs.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Graph Searching CSE 373 Data Structures Lecture 20.

Advanced Data Structures

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Chapter 17 Parallel Processing.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.

MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Data Structures and Algorithms in Parallel Computing Lecture 2.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,

2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.

Data Structures and Algorithms in Parallel Computing Lecture 3.

Data Structures and Algorithms in Parallel Computing Lecture 7.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Data Structures and Algorithms in Parallel Computing

Graphs & Paths Presentation : Part II. Graph representation Given graph G = (V, E). May be either directed or undirected. Two common ways to represent.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CSC 421: Algorithm Design & Analysis

CSC 421: Algorithm Design & Analysis

Parallel Programming By J. H. Wang May 2, 2017.

CSC 421: Algorithm Design & Analysis

Parallel Graph Algorithms

CSC 421: Algorithm Design & Analysis

CS 584 Project Write up Poster session for final Due on day of final

CSC 421: Algorithm Design & Analysis

Presentation transcript:

Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat

Programming language from the perspective of a not-so-distant admirer

Mapping graph algorithms onto distributed memory machines has been a challenge Efficient mapping PRAM algorithm onto SMPs is hard Mapping onto a cluster of SMPs is even harder Optimizations are available and shown to improve performance Can these be somehow automated with help from the language design, compiler and runtime development? Expectations of the languages –Expressiveness SPMD, task parallelism (spawn/async), pipeline, future, virtual shared- memory abstraction, work-stealing, data distribution, … Ease of programming –Efficiency Mapping high level constructs to run fast on the target machine –SMP –Multi-core, multi-threaded –MPP –Heterogeneous with accelerators Leverage for tuning

A case study with connected components on a cluster of SMPs with UPC A connected component of an undirected graph G=(V,E), |V|=n, |E|=m, is a maximal connected subgraph –Connected components algorithm find all such components in G Sequential algorithms –Breadth-first traversal (BFS) –Depth-first traversal (DFS) One parallel algorithm -- Shiloach-Vishkin algorithm (SV82) –Edge list as input –Adopts the graft and shortcut approach Start with n isolated vertices. Graft vertex v to a neighbor u with (u < v) Shortcut the connected components into super-vertices and continue on the reduced graph

Example: SV Input graph graftshortcut 1,42, st iter. 2 nd iter.

Simple? Yes, performs poorly Memory-intensive, irregular accesses, poor temporal locality Sun enterprise E4500

Typical behavior of graph algorithms CPI construction BC – betweeness centrality BiCC – Biconnected components MST – Minimum spanning tree LRU stack distance plot

On distributed-memory machines Random access and indirection make it hard to – implement, e.g, no fast MPI implementation –Optimize, i.e., random access creates problems for both communication and cache performance The partitioned global address space (PGAS) paradigm – presents a shared-memory abstraction to the programmer for distributed-memory machines. receives a fair amount of attention recently. –allows the programmer to control the data layout and work assignment –improve ease of programming, and also give the programmer leverage to tune for high performance

Implementation in UPC is straightforward UPC implementationPthread implementation

Performance is miserable

Communication efficient algorithms Proposed to address the “bottleneck of processor-to-processor communication” –Goodrich [96] presented a communication-efficient sorting algorithmon weak- CREWBSP that runs in O(log n/ log(h + 1)) communication rounds and O((n log n)/p) local computation time, for h = Θ(n/p) –Adler et. al. [98] presented a communication-optimal MST algorithm –Dehne et al. [02] designed an efficient list ranking algorithm for coarse-grained multicomputers (CGM) and BSP that takes O(log p) communication rounds with O(n/p) local computation Common approach –simulating several (e.g., O(log p) or O(log log p) ) steps of the PRAM algorithms to reduce the input size so that it fits in the memory of a single node –A “sequential” algorithm is then invoked to process the reduced input of size O(n/p) –finally the result is broadcast to all processors for computing the final solution Question –How well do communication efficient algorithms work on practice? –How fast can optimized shared-memory based algorithms run? Cache performace vs. communication performance –Can these optimizations be automated through necessary language/compiler support

Locality-central optimization Improve locality behavior of the algorithm –The key performance issues are communication and cache performance –Determined by locality Many prior cache-friendly results, but no tangible practical evidence –Fine-grain parallelism makes it hard to optimize for temporal locality –Focus on spatial locality To take advantage of large cache lines, hardware prefetching, software prefetching

Scheduling of the memory accesses in a parallel loop Typical loop in CC Generic loop

An example

Mapping to the distributed environments All remote accesses are consecutive in our scheduling If the runtime provides remote prefetching or coalescing, then communication efficiency can be improved If not, coalescing can be easily done at the program level as shown on right

Performance improvement due to communication efficency

Applying the approach to single-node for cache-friendly design Apply as many levels of recursions as necessary Simulate the recursions with virtual threads Assuming a large-enough, one level, fully associative cache Original execution time Optimized execution time

Graph-specific optimization Compact edge list –the size of the list determines the number of elements to request from remote nodes –edges within components no longer contribute to the merging of connected components, and can be filtered out Avoid communication hotspot –Grafting in CC shoots a pointer from a vertex with larger numbering to one with smaller numbering. –Thread thr0 owns vertex 0, and may quickly become a communication hotspot –Avoid querying thr0 about D[0]

UPC specific optimization Avoid runtime cost on local data –After optimization, all direct access to the shared arrays are local –Yet the compiler is not able to recognize –With UPC, we use private pointer arithmetics for Avoid intrinsics –It is costly to invoke compiler intrinsics to determine the target thread id –Computing target thread ids is done for every iteration. –we compute these ids directly instead of invoking the intrinsics. –Noticing that the target ids do not change across iteration, we compute them once and store them in a global buffer.

Performance Results

So, how helpful is UPC Straightforward mapping of shared-memory algorithm is easy –quick prototyping –Quick profiling –Incremental optimization (10 versions for CC) All other optimizations are manual Many of them can be automated, though UPC is not flexible enough to expose the hierarchy of nodes and processors to the programmer

Conclusion and future work We show that with appropriate optimizations, shared-memory graph algorithms can be mapped to the PGAS environment with high performance. On inputs that fit in the main memory on one node, our implementation achieves good speedups over the best SMP implementation and the best sequential implementation. Our results suggest that effective use of processors and caches can bring better performance than simply reducing the communication rounds Automating these optimizations is our future work