Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Intermediate Code Generation
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
1 Compiler Construction Intermediate Code Generation.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Run time vs. Compile time
Intermediate Code. Local Optimizations
Improving Code Generation Honors Compilers April 16 th 2002.
Prof. Fateman CS164 Lecture 211 Local Optimizations Lecture 21.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Loop Induction Variable Canonicalization. Motivation Background: Open64 Compilation Scheme Loop Induction Variable Canonicalization Project Tracing and.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Synchronization and Communication in the T3E Multiprocessor.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
What’s in an optimizing compiler?
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Compiler Chapter# 5 Intermediate code generation.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 1 Developed By:
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Benjamin Perry and Martin Swany University of Delaware Computer Information Science.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Assembly - Arrays תרגול 7 מערכים.
ECE 526 – Network Processing Systems Design Microengine Programming Chapter 23: D. E. Comer.
1 Structure of a Compiler Source Language Target Language Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator.
3/6/20161 WHIRL SSA: A New Optimization Infrastructure for Open64 Keqiao Yang, Zhemin Yang Parallel Processing Institute, Fudan University, Shanghai Hui.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
LLVM IR, File - Praakrit Pradhan. Overview The LLVM bitcode has essentially two things A bitstream container format Encoding of LLVM IR.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Code Optimization Overview and Examples
High-level optimization Jakub Yaghob
Code Optimization.
Optimization Code Optimization ©SoftMoore Consulting.
A Practical Stride Prefetching Implementation in Global Optimizer
Intermediate Code Generation
The SGI Pro64 Compiler Infrastructure
Presentation transcript:

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance

Unified Parallel C at LBNL/UCB UPC Translation Process Preprocessed File C front end Whirl w/ shared types Backend lowering Whirl w/ runtime calls Whirl2c ISO-compliant C Code PREOPT LoopNestOpt Whirl w/ analysis info M Whirl “Whirl” is the intermediate form Open64

Unified Parallel C at LBNL/UCB Preopt Phase Enabling other optimization phases -Loop Nest Optimization (LNO) -Whirl Optimizations (WOPT) Cleanup the control flows -Eliminate gotos (convert to loops, ifs) -Setup CFG, Def-Use chain, SSA -Intraprocedural alias analysis -Identifies DO_LOOPS (includes forall loops) -Perform high level optimizations (next slide) -Convert CFG back to whirl -Rerun alias analysis

Unified Parallel C at LBNL/UCB Optimizations in PREOPT In their order of application: -Dead store elimination -Induction variable recognition -Global value numbering -Copy propagation (multiple pass) -Simplify boolean expression -Dead code elimination (multiple pass) Lots of effort in teaching optimizer to work with UPC code -Preserve casts involving shared types -Patch the high-level types for whirl2c use -Convert shared pointer arithmetic into array accesses -Various bug fixes

Unified Parallel C at LBNL/UCB Loop Nest Optimizer (LNO) Operates on H whirl -Has structured control flow, arrays Intraprocedural Converts pointer expression into 1D array accesses Optimizes DO_LOOP nodes -single index variable -integer comparison end condition -invariant loop increment -No function calls/break/continue in loop body

Unified Parallel C at LBNL/UCB Loop Optimizations Separate representation from preopt -Access vectors -Array dependence graphs -Dependence vectors -Region for array accesses -Cache model for tiling loops, changing loop order Long list of optimizations -Unroll, interchange, fission/fusion, tiling, parallelization, prefetching, etc. -May need performance model for distributed environment

Unified Parallel C at LBNL/UCB Motivation for the Optimizer Translator optimizations necessary to improve UPC performance -Backend C compiler cannot optimize communication code -One step closer to user program Goal is to extend the code base to build UPC- specific optimizations/analysis -PRE on shared pointer arithmetic/casts -Communication scheduling -Forall loop optimizations -Message Coalescing

Unified Parallel C at LBNL/UCB Message Coalescing Implemented in a number of parallel Fortran compilers (e.g., HPF, F90) Idea: replace individual puts/gets with bulk calls to move remote data to a private buffer Targets memget/memput interface, as well as the new UPC runtime memory copy library functions Goal is to speed up shared memory style code shared [0] int * r; … for (i = L; i < U; i++) exp1 = exp2 + r[i]; Unoptimized loop int lr[U-L]; … upcr_memget(lr, &r[L], U-L); for (i = L; i < U; i++) exp1 = exp2 + lr[i-L]; Optimized Loop

Unified Parallel C at LBNL/UCB Analysis for Coalescing Handles multiple loop nests For each array referenced in the loop: -Compute a bounding box (lo, up) of its index value -Handles multiple accesses to the same array (e.g., ar[i] and ar[i+1] get same (lo, up) pair) -Loop bounds must be loop-invariant -Indices must be affine -No “bad” memory accesses (pointers) -Catch for strict access / synchronization ops in loop – reordering is illegal Current limitations: -Bounds cannot have field accesses -e.g., a + b ok, but not a.x -Base address either pointer or array variable -No array of structs, array fields in structs

Unified Parallel C at LBNL/UCB Basic Case: 1D Indefinite Array Use memget to fetch contiguous elements from source thread Change shared array accesses into private ones -with index translation if (lo != 0) Unit-stride writes are coalesced the same as reads, except that memput() is called at loop exit dst (private) src (shared) lo up up – lo + 1

Unified Parallel C at LBNL/UCB Coalescing Block Cyclic Arrays May need communication with multiple threads Call memgets on individual threads to get contiguous data -Copy in units of blocks to simplify the math -No. blks per thread = ceil(total_blk / THREADS) -Temporary buffer: dst_tmp[threads][blk_per_thread] -Overlapped memgets to fill dst_tmp from each thread -Pack content of dst_tmp into the dst array, following shared pointer arithmetic rule: -first block of T0, second block of T1, and so on P0P1P2 dst (private) src (shared)

Unified Parallel C at LBNL/UCB Coalescing 2D Indefinite Arrays Fetch the entire rectangular box at once Use upc_memget_fstrided(), which takes address, stride, length of source and destination Alternative scheme: -Optimize the inner loop by fetching one row at a time -Pipeline the outer loop to overlap the memgets on each row l1 l2 u1 u2 (U1–L1+1)(U2–L2+1) for (i = L1; I <=U1; i++) for (j = L2; j <= U2; j++) exp = ar[i][j]; ar

Unified Parallel C at LBNL/UCB Handling Strided Accesses Want to avoid fetching unused elements Indefinite array: -A special case for upc_memget_fstrided Block cyclic array: -Use the upc_memget_ilist interface -Send a list of fix-sized (in this case 1 element) regions to the remote threads -Alternatively, use strided memcpy function on each thread -messy pointer arithmetic, but maybe faster dst (private) src (shared) loup

Unified Parallel C at LBNL/UCB Preliminary Results -- Performance Use a simple parallel matrix-vector multiply Row distributed cyclically Two configurations for the vector -Indefinite array (on thread 0) -Cyclic layout Compare performance of the three setup -Naïve fine-grained accesses -Message coalesced output -Bulk style code -indefinite: call upc_memget before outer loop -cyclic: like message coalesced code, except read from the 2D tmp array directly (avoids the flattening of the 2D array)

Unified Parallel C at LBNL/UCB Message Coalescing vs. Fine- grained One thread per node Vector is 100K elements, number of rows is 100*threads Message coalesced code more than 100X faster Fine-grained code also does not scale well -Network contention

Unified Parallel C at LBNL/UCB Message Coalescing vs. Bulk Message coalescing and bulk style code have comparable performance - For indefinite array the generated code is identical - For cyclic array, coalescing is faster than manual bulk code on elan -memgets to each thread are overlapped

Unified Parallel C at LBNL/UCB Preliminary Results -- Programmability Evaluation Methodology -Count number of loops that can be coalesced -Count number of memgets that can be coalesced if converted to fine-grained loops -Use the NAS UPC benchmarks MG (Berkeley): -4/4 memget can be converted to loops that can be coalesced CG (UMD): -2 fine-grained loops copying the contents of cyclic arrays locally can be coalesced FT (GWU): -One loop broadcasting elements of a cyclic array can be coalesced IS (GWU): -3/3 memgets can be coalesced if converted to loop

Unified Parallel C at LBNL/UCB Conclusion Message coalescing can be a big win in programmability -Can offer comparable performance to bulk style code -Great for shared memory style code Many choices of code generation -Bounding box vs. strided vs. variable-size -Performance is platform dependent Lots of research/future work can be done in this area -Construct a performance model -Handling more complicated access patterns -Add support for arrays in structs -Optimize for special cases (e.g. constant bound/strides)