Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Synchronization and Communication in the T3E Multiprocessor.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Background Computer System Architectures Computer System Software.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
CS5102 High Performance Computer Systems Thread-Level Parallelism
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Distributed Systems CS
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252 Class Project December 10, 2003

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB Unified Parallel C (UPC) UPC is an explicitly parallel global address space language with SPMD parallelism -An extension of ISO C -User level shared memory, partitioned by threads -One-sided (bulk and fine-grained) communication through reads/writes of shared variables Shared Global address space X[0] Private X[1] X[P]

Unified Parallel C at LBNL/UCB Shared Arrays and Pointers in UPC Cyclicshared int A[n]; Block Cyclicshared [2] int B[n]; Indefiniteshared [0] int * C = (shared [0] int *) upc_alloc(n); Use pointer-to-shared to access shared data -Block size part of the pointer type -A generic pointer-to-shared contains: -Address, Thread id, Phase -Cyclic and Indefinite pointers are phaseless A[0] A[2] A[4] … B[0] B[1] B[4] B[5]… C[0] C[1] C[2] … A[1] A[3] A[5] … B[2] B[3] B[6] B[7]… T0 T1

Unified Parallel C at LBNL/UCB Thread 1Thread N -1 Address ThreadPhase 0 2addr Phase Shared Memory Thread 0 block size start of array object … … Accessing Shared Memory in UPC start of block

Unified Parallel C at LBNL/UCB UPC Programming Model Features Block cyclically distributed arrays Shared and private pointers Global synchronization -- barriers Pair-wise synchronization – locks Parallel loops Dynamic shared memory allocation Bulk Shared Memory accesses Strict vs. Relaxed memory consistency models

Unified Parallel C at LBNL/UCB Overview of the Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Lower UPC code into ANSI-C code Shared Memory Management and pointer operations Uniform get/put interface for underlying networks

Unified Parallel C at LBNL/UCB A Layered Design Portable: -C is our intermediate language -Can run on top of MPI (with performance penalty) -GASNet has a layered design with a small core High-Performance: -Native C compiler optimizes serial code -Translator can perform high-level communication optimizations -GASNet can access network hardware directly, provides a rich set of communication / synchronization primitives

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB The Cray X1 Architecture All Gets/Puts must be loads/stores (directly or shmem interface) Only puts are “non-blocking”, gets are blocking Vectorization is crucial Vector pipeline 2x faster than scalar Utilization of memory bandwidth Strided accesses, scatter-gather, reduction, etc. New line of Vector Architecture Two modes of operation SSP up to 16 CPUs/node MSP multistreams long loops Single-node UMA, multi-node NUMA (no caching remote data) Global pointers Low latency, high bandwidth

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB GASNet Communication System- Architecture 2-Level architecture to ease implementation: Core API -Based heavily on Active Messages -Compatibility layer -Port to X1 in 2 days, new algorithm to manipulate queues in Shared Memory Extended API -Wider interface that includes more complicated operations (puts, gets) -A reference implementation of the extended API in terms of the core API is provided -Current revision is tuned especially for the X1 with shared memory as the primary focus (minimal overhead) Compiler-generated code Compiler-specific runtime system GASNet Extended API GASNet Core API Network Hardware

Unified Parallel C at LBNL/UCB GASNet Extended API – Remote memory operations GASNet offers expressive Put/Get primitives -All Gets/Puts can be blocking and non-blocking -Non-blocking can be explicit (handle-based) -Non-blocking can be implicit (global or region-based) -Synchronization can poll or block -Paves the way for complex split-phase communication (compiler optimizations) Cray X1 uses exclusively shared memory -All Gets/Puts must be loads/stores -Only puts are “non-blocking”, gets are blocking -Very limited synchronization mechanisms -Efficient communication only through vectors (one order of magnitude between scalar and vector communication) -Vectorization instead of split-phase?

Unified Parallel C at LBNL/UCB GASNetCray X1 InstructionComment Bulk operationsVector bcopy()Fully vectorized, suitable for GASNet/UPC Non-bulk blocking putsStore + gsyncNo vectorization Non-bulk blocking getsLoad Non-bulk Non-blocking explicit puts/gets Store/load + gsyncNo vectorization if sync done in the loop Non-bulk Non-blocking implicit puts/gets Store/load + gsyncNo vectorization if sync done in the loop GASNet and Cray X1 Remote memory operations Flexible communications provides no benefit without vectorization (factor of 10 between vector and scalar) Difficult to expose vectorization through a layered software stack: Native C compiler now has to optimize parallel code! Cray X1 “big hammer” gsync() prevents interesting communication optimizations

Unified Parallel C at LBNL/UCB GASNet/X1 Performance GASNet/X1 improves small message performance Minimal overhead as “portable network assembly language” Core API (Active Messages) solves Cray problem of upc_global_alloc (non-collective memory allocation) Synthetic benchmarks show no GASNet interference, but not necessarily the case for application benchmarks

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB Shared Pointer Representations Cray X1 “memory centrifuge” useful for UPC Possible to manipulate UPC phaseless pointers directly as X1 global pointers allocated by the symmetric heap Heavy function inlining and macros remove all traces of UPC Runtime and GASNet calls

Unified Parallel C at LBNL/UCB Cost of Shared Pointer Arithmetic and Accesses

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB Serial Performance It’s all about vectorization Cray C highly sensitive to changes in inner loop Want translator’s output as vectorizable as original C source. Strategy: Keep translated code syntactically close to the source Preserve high level loops a[exp] becomes *(a+exp) Multidimensional arrays linearized Preserve restrict qualifier and ivdep pragmas

Unified Parallel C at LBNL/UCB Livermore Loop Kernels

Unified Parallel C at LBNL/UCB Evaluating Communication Optimizations on Cray X1 Message Aggregation LogGP model: fewer messages means less overhead Techniques: message vectorization, coalescing, bulk prefetching Still true for Cray X1? Remote access latency comparable to local accesses Vectorization should hide most overhead of small messages Remote data not cache coherent – may still help to store them into local buffers Essentially, a question of fine-grained vs. coarse- grained programming model

Unified Parallel C at LBNL/UCB NAS CG: OpenMP style vs. MPI style Fine-grained (OpenMP style) version still slower shared memory programming style leads more overhead (redundant boundary computation) UPC’s hybrid programming model can really help

Unified Parallel C at LBNL/UCB More Optimizations Overlapping Communication and Computation -Hides communication latencies with independent computation -Examples: communication scheduling, message pipelining -Requires split-phase operations: try to separate sync() as far as possible from non-blocking get/put -But Cray X1 lacks support for nonblocking gets -No user/compiler level overlapping -All communication optimizations rely on vectorization (e.g., gups) -Vectorization is too restrictive in our opinion: gives up on pointer code and sync(), bulk synchronous programs, etc

Unified Parallel C at LBNL/UCB Conclusion We have an efficient UPC implementation on Cray X1 Evaluation of Cray X1 for GAS languages: +Great latency/bandwidth for both local and remote memory operations +Remote communication transparent with global loads and stores -Lack of split-phase gets means losing optimization opportunities -Poor user-level support for communication and synchronization of remote operations (no prefetching, no non-binding or per-operation completion mechanisms) -Heavy reliance on vectorization for performance – great when it happens, not much so otherwise (slow scalar processor) -First platform more sensitive to translated code and less to communication/computation scheduling ±First possible mismatch for GASNet between semantics and platform – we’re hoping the X2 can address our concerns