Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
Evaluation of Memory Consistency Models in Titanium.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Programming in UPC Burt Gordon HPN Group, HCS lab Taken in part from a Presentation by Tarek El-Ghazawi at SC2003.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Background Computer System Architectures Computer System Software.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Department of Computer Science University of California, Santa Barbara
UPC and Titanium Kathy Yelick University of California, Berkeley and
Presentation transcript:

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group

Unified Parallel C at LBNL/UCB Outline An Overview of UPC Design and Implementation of the Berkeley UPC Compiler Preliminary Performance Results Communication Optimizations

Unified Parallel C at LBNL/UCB Unified Parallel C (UPC) UPC is an explicitly parallel global address space language with SPMD parallelism -An extension of C -Shared memory is partitioned by threads -One-sided (bulk and fine-grained) communication through reads/writes of shared variables Collective Efforts by industry, academia, and government - Shared Global address space X[0] Private X[1] X[P]

Unified Parallel C at LBNL/UCB UPC Programming Model Features Block cyclically distributed arrays Shared and private pointers Global synchronization -- barriers Pair-wise synchronization – locks Parallel loops dynamic shared memory allocation Bulk Shared Memory accesses Strict vs. Relaxed memory consistency models

Unified Parallel C at LBNL/UCB Design and Implementation of the Berkeley UPC Compiler

Unified Parallel C at LBNL/UCB Overview of the Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Lower UPC code into ANSI-C code Shared Memory Management and pointer operations Uniform get/put interface for underlying networks

Unified Parallel C at LBNL/UCB A Layered Design Portable: -C is our intermediate language -Can run on top of MPI (with performance penalty) -GASNet has a layered design with a small core High-Performance: -Native C compiler optimizes serial code -Translator can perform high-level communication optimizations -GASNet can access network hardware directly

Unified Parallel C at LBNL/UCB Implementing the UPC to C Translator Preprocessed File C front end Whirl w/ shared types Backend lowering Whirl w/ runtime calls Whirl2c ANSI-compliant C Code Based on the Open64 compiler Source to source transformation Convert shared memory operations into runtime library calls Designed to incorporate existing optimization framework in open64 Communicate with runtime via a standard API

Unified Parallel C at LBNL/UCB Shared Arrays and Pointers in UPC Cyclicshared int A[n]; Block Cyclicshared [2] int B[n]; Indefiniteshared [0] int * C = (shared [0] int *) upc_alloc(n); Use global pointers (pointer-to-shared) to access shared (possibly remote) data -Block size part of the pointer type -A generic pointer-to-shared contains: -Address, Thread id, Phase A[0] A[2] A[4] … B[0] B[1] B[4] B[5]… C[0] C[1] C[2] … A[1] A[3] A[5] … B[2] B[3] B[6] B[7]… T0 T1

Unified Parallel C at LBNL/UCB Thread 1Thread N -1 Address ThreadPhase 0 2addr Phase Shared Memory Thread 0 block size start of array object … … Accessing Shared Memory in UPC start of block

Unified Parallel C at LBNL/UCB Phaseless Pointer-to-Shared A pointer needs a “phase” to keep track of where it is in a block -Source of overhead for pointer arithmetic Special case for “phaseless” pointers: Cyclic + Indefinite -Cyclic pointers always have phase 0 -Indefinite pointers only have one block -Don’t need to keep phase in pointer operations for cyclic and indefinite -Don’t need to update thread id for indefinite pointer arithmetic

Unified Parallel C at LBNL/UCB Pointer-to-Shared Representation Pointer Size -Want to allow pointers to reside in a register -But very large machines may require a longer representation Datatype -Use of scalar types (long) rather than a struct may improve backend code quality -Faster pointer manipulation, e.g., ptr+int as well as dereferencing Portability and performance balance in UPC compiler -8-byte scalar vs. struct format (configuration time option) -The pointer representation is hidden in the runtime layer -Modular design means easy to add new representations

Unified Parallel C at LBNL/UCB Performance Results

Unified Parallel C at LBNL/UCB Performance Evaluation Testbed -HP AlphaServer (1GHz processor), with Quadrics interconnect -Compaq C compiler for the translated C code -Compare with HP UPC 2.1 Cost of Language Features -Shared pointer arithmetic, shared memory accesses, parallel loops, etc Application Benchmarks -EP: no communication -IS: large bulk memory operations - MG: bulk memget -CG: fine-grained vs. bulk memput Potentials of Optimizations -Measure the effectiveness of various communication optimizations

Unified Parallel C at LBNL/UCB Performance of Shared Pointer Arithmetic Phaseless pointer an important optimization Packed representation also helps. 1 cycle = 1ns Struct: 16 bytes

Unified Parallel C at LBNL/UCB Cost of Shared Memory Access Local accesses somewhat slower than private accesses Layered design does not add additional overhead Remote accesses a few magnitude worse than local

Unified Parallel C at LBNL/UCB Parallel Loops in UPC UPC has a “forall” construct for distributing computation shared int v1[N], v2[N], v3[N]; upc_forall(i=0; i < N; i++; &v3[i]) v3[i] = v2[i] + v1[i]; Affinity tests performed on every iteration to decide if it should execute Two kinds of affinity expressions: -Integer (compare with thread id) -Shared address (check the affinity of address) Exp TypenoneintegerShared address # Cycles61710

Unified Parallel C at LBNL/UCB Application Performance

Unified Parallel C at LBNL/UCB NAS Benchmarks (EP and IS) EP shows the backend C compiler can still successfully optimize translated C code IS shows Berkeley UPC compiler is effective for communication operations

Unified Parallel C at LBNL/UCB NAS Benchmarks (CG and MG) Berkeley UPC compiler scales well

Unified Parallel C at LBNL/UCB Performance of Fine-grained Applications Doesn’t scale well due to nature of the benchmark (lots of small reads) HP UPC’s software caching helps performance

Unified Parallel C at LBNL/UCB Observations on the Results Acceptable worst-case overhead for shared memory access latencies -< 10 cycle overhead for shared local accesses -~ 1.5 usec overhead compared to end-to-end network latency Optimizations on pointer-to-shared representation are effective -Both phaseless pointers and packed 8-byte format Good performance compared to HP UPC 2.1

Unified Parallel C at LBNL/UCB Communication Optimizations for UPC

Unified Parallel C at LBNL/UCB Communication Optimizations Hiding Communication Latencies Use of non-blocking operations Possible placement analysis, by separating get(), put() as far as possible from sync() Message pipelining to overlap communication with more communication Optimizing Shared Memory Accesses Eliminating locality test for local shared pointers: flow- and context-sensitive analysis Transforming forall loops into equivalent for loops Eliminate redundant pointer arithmetic for pointers with same thread and phase

Unified Parallel C at LBNL/UCB More Optimizations Message Vectorization and Aggregation -Scatter/gather techniques -Packing generally pays off for small (< 500 byte) messages Software Caching and Prefetching -A prototype implementation: -Local knowledge only – no coherence messages -Cache remote reads and buffer outgoing writes -Based on the weak ordering consistency model

Unified Parallel C at LBNL/UCB Example: Optimizing Loops for(…) { init 1; sync1; compute1; write1; init 2; sync 2; compute 2; write 2; ……. } (base) for (…) { init 1; init 2; init 3; …. sync_all; compute all; } (read pipelining) for (…) { init 1; init 2; sync 1; compute 1; …. } (Communication/ Computation Overlap)

Unified Parallel C at LBNL/UCB Experimental Results Computation/communication overlap better than communication/communication overlap, for Quadrics Results likely different for other networks

Unified Parallel C at LBNL/UCB Example: Optimizing Local Shared Accesses shared [B] int a1[N], a2[N], a3[N]; forall (i = 0; i < N; i++; &a[i]) { a[i] = a1[i] + a2[i]; } (base) int *l1, *l2, *l3; for (i = MYTHREAD; i < N/THREADS; i+=THREADS) { l1 = (int*) a1[i*B]; l2 = (int*) a2[i*B]; l3 = (int*) a3[i*B]; for (j = 0; j < B; j++) l1[j] = l2[j] + l3[j]; } (Use of private pointers)

Unified Parallel C at LBNL/UCB Experimental Results Neither compiler performs well for naïve version Culprit: pointer-to-shared operations + affinity tests Privatizing local shared accesses improves performance by an order of magnitude

Unified Parallel C at LBNL/UCB Compiler Status A fully UPC 1.1 compliant public release in April Supported Platforms: -HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC Supported Networks: -Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI A release this summer will include: -Pthreads/System V shared memory support -GASNet Infiniband support

Unified Parallel C at LBNL/UCB Conclusion The Berkeley UPC Compiler achieves both portability and good performance -Layered, modular design -Effective pointer-to-shared optimizations -Good performance compared to commercially available UPC compiler -Still lots of opportunities for communication optimizations -Available for download at: