Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Kernel-Kernel Communication in a Shared- memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Introduction CS 524 – High-Performance Computing.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Evaluating the Performance Limitations of MPMD Communication Chi-Chao Chang Dept. of Computer Science Cornell University Grzegorz Czajkowski (Cornell)
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Parallel Architectures
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Shared Memory Systems and Programming
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
DISTRIBUTED COMPUTING
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Outline Why this subject? What is High Performance Computing?
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Sunpyo Hong, Hyesoon Kim
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
CRL (C Region Library) Chao Huang, James Brodman, Hassan Jafri CS498LVK.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
CMSC 611: Advanced Computer Architecture
Lecture 1: Parallel Architecture Intro
The Stanford FLASH Multiprocessor
Architectural Interactions in High Performance Clusters
Parallel Processing Architectures
Latency Tolerance: what to do when it just won’t go away
Presentation transcript:

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University Joint work with Beng-Hong Lim (IBM), Grzegorz Czajkowski and Thorsten von Eicken

Framework l l Parallel computing on clusters of workstations l l Hardware communication primitives are message-based l l Global addressing of data structures Problem l l Tolerating high network latencies and overheads when accessing remote data Mechanisms for tolerating latencies and overheads l l Caching: coherent data replication l l Bulk transfers: amortizes fixed cost of a single message l l Split-phase: overlaps computation with communication l l Push-based: sender-controlled communication 2

Objective Global Addressing “Languages” l l DSM: cache-coherent access to shared data l l C Region Library (CRL) [Johnson et. al. 95] 4 4 Caching l l Global pointers and arrays: explicit access to remote data l l Split-C [Culler et. al. 93] 4 4 Bulk transfers 4 4 Split-phase communication 4 4 Push-based communication Which of the two languages is easier to program? Which of the two yields better performance? l l Which mechanisms are more “effective?” 3

Approach Develop comparable implementations of CRL and Split-C l l Same compiler: GCC l l Common communication layer: Active Messages Analyze the performance implications of caching, bulk, split- phase and push-based communication mechanisms l l with five applications l l on the IBM SP, Meiko CS-2, and two simulated architectures 4

CRL versus Split-C // CRL rid_t r; double *x, w = 0; if (MYPROC == 0) { r = rgn_create(100*8); x = rgn_map(r); for(i=0;i<100;i++) x[i] = i; rgn_bcast_send(&r); } else { rgn_bcast_recv(&r); y = rgn_map(r); rgn_start_read(y); for(i=0;i<100;i++) w += y[i]; rgn_end_read(y); } // Split-C double x[100]; if (MYPROC == 0) { for(i=0;i<100;i++) x[i] = i; barrier(); } else { double *global y; double w = 0, z[100]; barrier(); y = toglobal(0,x); for(i=0;i<100;i++) w += y[i]; bulk_read(z, y, 100*8); } CRL: Caching (regions), implicit bulk xfers, size fixed at creation Split-C: No caching, global pointers, explicit bulk xfers, variable size 5

CRL versus Split-C // Split-C int i; int *global gp; i := *gp; // split-phase get *gp := 5 // split-phase store sync(); // wait until til completion CRL: No explicit communication Split-C: Split-phase/push-based communication with special assignments and explicit synchronization 6

Hardware Platforms 7

Applications 8

Overall Observations Some applications benefit from caching: l l MM, Barnes Others benefit from explicit communication: l l FFT, LU, Water CRL and Split-C applications have similar performance l l if right mechanisms are used, l l if programmer spends comparable effort, and l l if underlying CRL and SC implementations are comparable 9

Sample: Matrix Multiply MM 16x16, 128x128 blk, 8 procs 10

Caching in CRL Benefits applications with sufficient temporal and spatial locality Key parameter: Region Size l l Small regions increase coherence protocol overhead l l Large regions increase communication overhead Tuning region sizes can be difficult in many cases l l Trade-off depends on communication latency l l Regions tend to correspond to static data structures (e.g. matrix blocks, molecule structures) l l Re-designing data structures can be time consuming 11

Caching: Region Size ¬Small regions can hurt caching, especially if latency is high LU 4x4: CRL much slower than SC ­Large regions usually improve caching LU 16x16: CRL closes performance gap LU 4x4, 16x16 blk, 8 procs 12

Caching: Latency ®Advantages of caching diminish as communication latency decreases Barnes: Split-C closes performance gap on Meiko and is faster on RMC1 Barnes 512 bds, 8 procs 13

Caching vs. Bulk Transfer ¯Large regions are harmful to caching when region size doesn’t match the actual amount of data used (a.k.a. false sharing) Water 512: CRL is much slower than SC °The ability to specify the transfer size is a plus for bulk transfers Water 512: Selective prefetching reduces SC time substantially Water 512 mols, 8 procs 14

Caching vs. Bulk Transfer ±Caching harmful if lack of temporal locality FFT: SC faster than CRL on all platforms FFT 2M pts, 8 procs 15

Split-Phase and Push-Based Two observations: l Bandwidth is not a limitation l Split-phase/Push-based allow pipelined communication phases ³Split-phase/Push-based outperforms caching LU 16x16: Base-SC is substantially faster than CRL LU 16x16 blk, 8 procs 16

Related Work l l Previous research (WindTunnel, Alewife, FLASH, TreadMark) shows: l l the benefits of explicit bulk communication with shared-memory l l that overhead in shared-memory systems is proportional to the amount of cache/page/region misses l l Split-C shows the benefits of explicit communication without caching l l Scales and Lam demonstrate the benefits of caching and push- based communication with caching in SAM l l First study that compares and evaluates the performance of the four communication mechanisms in global address space systems 17

Conclusions Split-C and CRL applications have comparable performances l l if a carefully controlled study is conducted Programming experience: “what” versus “when” l l CRL Regions: Programmer optimizes what to transfer l l Split-C: Programmer optimizes when to transfer... l l Pipelining communication phases with explicit synchronization l l Managing local copies of remote data Paper contains detailed results for: l l multiple versions of 5 applications l l running on 4 machines 18