Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group
Unified Parallel C at LBNL/UCB Outline An Overview of UPC Design and Implementation of the Berkeley UPC Compiler Preliminary Performance Results Communication Optimizations
Unified Parallel C at LBNL/UCB Unified Parallel C (UPC) UPC is an explicitly parallel global address space language with SPMD parallelism -An extension of C -Shared memory is partitioned by threads -One-sided (bulk and fine-grained) communication through reads/writes of shared variables Collective Efforts by industry, academia, and government - Shared Global address space X[0] Private X[1] X[P]
Unified Parallel C at LBNL/UCB UPC Programming Model Features Block cyclically distributed arrays Shared and private pointers Global synchronization -- barriers Pair-wise synchronization – locks Parallel loops dynamic shared memory allocation Bulk Shared Memory accesses Strict vs. Relaxed memory consistency models
Unified Parallel C at LBNL/UCB Design and Implementation of the Berkeley UPC Compiler
Unified Parallel C at LBNL/UCB Overview of the Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Lower UPC code into ANSI-C code Shared Memory Management and pointer operations Uniform get/put interface for underlying networks
Unified Parallel C at LBNL/UCB A Layered Design Portable: -C is our intermediate language -Can run on top of MPI (with performance penalty) -GASNet has a layered design with a small core High-Performance: -Native C compiler optimizes serial code -Translator can perform high-level communication optimizations -GASNet can access network hardware directly
Unified Parallel C at LBNL/UCB Implementing the UPC to C Translator Preprocessed File C front end Whirl w/ shared types Backend lowering Whirl w/ runtime calls Whirl2c ANSI-compliant C Code Based on the Open64 compiler Source to source transformation Convert shared memory operations into runtime library calls Designed to incorporate existing optimization framework in open64 Communicate with runtime via a standard API
Unified Parallel C at LBNL/UCB Shared Arrays and Pointers in UPC Cyclicshared int A[n]; Block Cyclicshared [2] int B[n]; Indefiniteshared [0] int * C = (shared [0] int *) upc_alloc(n); Use global pointers (pointer-to-shared) to access shared (possibly remote) data -Block size part of the pointer type -A generic pointer-to-shared contains: -Address, Thread id, Phase A[0] A[2] A[4] … B[0] B[1] B[4] B[5]… C[0] C[1] C[2] … A[1] A[3] A[5] … B[2] B[3] B[6] B[7]… T0 T1
Unified Parallel C at LBNL/UCB Thread 1Thread N -1 Address ThreadPhase 0 2addr Phase Shared Memory Thread 0 block size start of array object … … Accessing Shared Memory in UPC start of block
Unified Parallel C at LBNL/UCB Phaseless Pointer-to-Shared A pointer needs a “phase” to keep track of where it is in a block -Source of overhead for pointer arithmetic Special case for “phaseless” pointers: Cyclic + Indefinite -Cyclic pointers always have phase 0 -Indefinite pointers only have one block -Don’t need to keep phase in pointer operations for cyclic and indefinite -Don’t need to update thread id for indefinite pointer arithmetic
Unified Parallel C at LBNL/UCB Pointer-to-Shared Representation Pointer Size -Want to allow pointers to reside in a register -But very large machines may require a longer representation Datatype -Use of scalar types (long) rather than a struct may improve backend code quality -Faster pointer manipulation, e.g., ptr+int as well as dereferencing Portability and performance balance in UPC compiler -8-byte scalar vs. struct format (configuration time option) -The pointer representation is hidden in the runtime layer -Modular design means easy to add new representations
Unified Parallel C at LBNL/UCB Performance Results
Unified Parallel C at LBNL/UCB Performance Evaluation Testbed -HP AlphaServer (1GHz processor), with Quadrics interconnect -Compaq C compiler for the translated C code -Compare with HP UPC 2.1 Cost of Language Features -Shared pointer arithmetic, shared memory accesses, parallel loops, etc Application Benchmarks -EP: no communication -IS: large bulk memory operations - MG: bulk memget -CG: fine-grained vs. bulk memput Potentials of Optimizations -Measure the effectiveness of various communication optimizations
Unified Parallel C at LBNL/UCB Performance of Shared Pointer Arithmetic Phaseless pointer an important optimization Packed representation also helps. 1 cycle = 1ns Struct: 16 bytes
Unified Parallel C at LBNL/UCB Cost of Shared Memory Access Local accesses somewhat slower than private accesses Layered design does not add additional overhead Remote accesses a few magnitude worse than local
Unified Parallel C at LBNL/UCB Parallel Loops in UPC UPC has a “forall” construct for distributing computation shared int v1[N], v2[N], v3[N]; upc_forall(i=0; i < N; i++; &v3[i]) v3[i] = v2[i] + v1[i]; Affinity tests performed on every iteration to decide if it should execute Two kinds of affinity expressions: -Integer (compare with thread id) -Shared address (check the affinity of address) Exp TypenoneintegerShared address # Cycles61710
Unified Parallel C at LBNL/UCB Application Performance
Unified Parallel C at LBNL/UCB NAS Benchmarks (EP and IS) EP shows the backend C compiler can still successfully optimize translated C code IS shows Berkeley UPC compiler is effective for communication operations
Unified Parallel C at LBNL/UCB NAS Benchmarks (CG and MG) Berkeley UPC compiler scales well
Unified Parallel C at LBNL/UCB Performance of Fine-grained Applications Doesn’t scale well due to nature of the benchmark (lots of small reads) HP UPC’s software caching helps performance
Unified Parallel C at LBNL/UCB Observations on the Results Acceptable worst-case overhead for shared memory access latencies -< 10 cycle overhead for shared local accesses -~ 1.5 usec overhead compared to end-to-end network latency Optimizations on pointer-to-shared representation are effective -Both phaseless pointers and packed 8-byte format Good performance compared to HP UPC 2.1
Unified Parallel C at LBNL/UCB Communication Optimizations for UPC
Unified Parallel C at LBNL/UCB Communication Optimizations Hiding Communication Latencies Use of non-blocking operations Possible placement analysis, by separating get(), put() as far as possible from sync() Message pipelining to overlap communication with more communication Optimizing Shared Memory Accesses Eliminating locality test for local shared pointers: flow- and context-sensitive analysis Transforming forall loops into equivalent for loops Eliminate redundant pointer arithmetic for pointers with same thread and phase
Unified Parallel C at LBNL/UCB More Optimizations Message Vectorization and Aggregation -Scatter/gather techniques -Packing generally pays off for small (< 500 byte) messages Software Caching and Prefetching -A prototype implementation: -Local knowledge only – no coherence messages -Cache remote reads and buffer outgoing writes -Based on the weak ordering consistency model
Unified Parallel C at LBNL/UCB Example: Optimizing Loops for(…) { init 1; sync1; compute1; write1; init 2; sync 2; compute 2; write 2; ……. } (base) for (…) { init 1; init 2; init 3; …. sync_all; compute all; } (read pipelining) for (…) { init 1; init 2; sync 1; compute 1; …. } (Communication/ Computation Overlap)
Unified Parallel C at LBNL/UCB Experimental Results Computation/communication overlap better than communication/communication overlap, for Quadrics Results likely different for other networks
Unified Parallel C at LBNL/UCB Example: Optimizing Local Shared Accesses shared [B] int a1[N], a2[N], a3[N]; forall (i = 0; i < N; i++; &a[i]) { a[i] = a1[i] + a2[i]; } (base) int *l1, *l2, *l3; for (i = MYTHREAD; i < N/THREADS; i+=THREADS) { l1 = (int*) a1[i*B]; l2 = (int*) a2[i*B]; l3 = (int*) a3[i*B]; for (j = 0; j < B; j++) l1[j] = l2[j] + l3[j]; } (Use of private pointers)
Unified Parallel C at LBNL/UCB Experimental Results Neither compiler performs well for naïve version Culprit: pointer-to-shared operations + affinity tests Privatizing local shared accesses improves performance by an order of magnitude
Unified Parallel C at LBNL/UCB Compiler Status A fully UPC 1.1 compliant public release in April Supported Platforms: -HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC Supported Networks: -Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI A release this summer will include: -Pthreads/System V shared memory support -GASNet Infiniband support
Unified Parallel C at LBNL/UCB Conclusion The Berkeley UPC Compiler achieves both portability and good performance -Layered, modular design -Effective pointer-to-shared optimizations -Good performance compared to commercially available UPC compiler -Still lots of opportunities for communication optimizations -Available for download at: