Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University.

Slides:

Advertisements

Similar presentations

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.

IBM TJ Watson Research Center © 2006 IBM Corporation Compiler Workshop - CASCON 2006 October 16, 2006 Productivity and Performance using the IBM XLUPC.

Data Dependences CS 524 – High-Performance Computing.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

2005/6/2 IWOMP05 1 IWOMP05 panel “OpenMP 3.0” Mitsuhisa Sato ( University of Tsukuba, Japan)

New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Chapter 1 Algorithm Analysis

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Threaded Programming Lecture 4: Work sharing directives.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Optimization of C Code The C for Speed

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

My Coordinates Office EM G.27 contact time:

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

1 Removing Impediments to Loop Fusion Through Code Transformations Bob Blainey 1, Christopher Barton 2, and Jos’e Nelson Amaral 2 1 IBM Toronto Software.

Code Optimization Overview and Examples

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Programming Models for SimMillennium

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Introduction to OpenMP

Optimization Code Optimization ©SoftMoore Consulting.

Parallel Programming in C with MPI and OpenMP

High Performance Computing (CS 540)

MATLAB HPCS Extensions

Introduction to OpenMP

MATLAB HPCS Extensions

Parallel Programming in C with MPI and OpenMP

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Optimizing single thread performance

What is a Thread? A thread is similar to a process, but it typically consists of just the flow of control. Multiple threads use the address space of a.

Presentation transcript:

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research

UPC : Unified Parallel C THREADS = 6 Partitioned Global Address Space

Shared arrays Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF cyclic(k)

Vector addition example #include shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }

Outline of talk upc_forall loops syntax and uses Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results

upc_forall and affinity tests upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){ //loop body } “Affinity test” expression determines which thread executes which iteration. Affinity test expression

Affinity test elimination : naive shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=0; i<M; i++){ if(upc_threadof(&A[i])==MYTHREAD){ //loop body }

Affinity test elimination : optimized shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){ for(j=i; j<i+BF; j++){ //loop body }

Integer Affinity Tests upc_forall(i=0;i<M;i++; i){ //loop body } for(i=MYTHREAD; i<M; i+=THREADS){ //loop body }

Data distributions for shared arrays UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic fashion More general than UPC spec, but not as general as ScaLAPACK or HPF

Multidimensional Blocking shared [2][2] double A[5][5];

Locality analysis and privatization Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){ upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j]; } What code should we generate for references A[i][j] and B[i+1][j]?

Shared access code generation for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = shared_deref(B,i+1,j); shared_assign(A,i,j,val); } for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Shared access code generation Do we really need the function calls? A[i][j] should only be a memory load/store?? What about B[i+1][j] on SMP? This should be just a load? On hybrids? for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Locality Analysis Area belonging to thread 0 Area referenced by thread 0 for B[i+1][j] for(i=0;i<4;i++) upc_forall(j=0;j<4;j++;&A[i][j]) A[i][j] = B[i+1][j];

Locality Analysis : Intuition The locality can only change if index (i+1) crosses block boundaries in a direction Block boundaries : 0, BF, 2*BF... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to find places where locality can change! for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Locality Analysis Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF = (BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut' for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Shared access code generation for(i=0;i<4;i++){ if((i%2<1){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = memory_load(B,i+1,j); memory_store(A,i,j,val); } }else{ upc_forall(j=0;j<4;j++; &A[i][j]){ val = shared_deref(B,i+1,j); memory_store(A,i,j,val); }

Locality analysis : algorithm For each shared reference in loop:  Check if blocking factor matches  Check if distance vector is constant  If reference is eligible: Generate cut expressions Put cut in a sorted “cut list” Replicate loop body as necessary Insert memory load/store if local reference otherwise insert RTS call

Improvements of locality analysis in isolation

Improvements of affinity test elimination in isolation

Results : Vector addition

Matrix-vector multiplication

Matrix-vector scalability

Conclusions UPC requires extensive compiler support  upc_forall is a challenging construct to compile efficiently  Shared access implementation requires compiler support Optimizations working together produce good results  Compiler optimizations can produce >80x speedup over unoptimized code  If one optimization fails, then results can still be bad