Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research
UPC : Unified Parallel C THREADS = 6 Partitioned Global Address Space
Shared arrays Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF cyclic(k)
Vector addition example #include shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }
Outline of talk upc_forall loops syntax and uses Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results
upc_forall and affinity tests upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){ //loop body } “Affinity test” expression determines which thread executes which iteration. Affinity test expression
Affinity test elimination : naive shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=0; i<M; i++){ if(upc_threadof(&A[i])==MYTHREAD){ //loop body }
Affinity test elimination : optimized shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){ for(j=i; j<i+BF; j++){ //loop body }
Integer Affinity Tests upc_forall(i=0;i<M;i++; i){ //loop body } for(i=MYTHREAD; i<M; i+=THREADS){ //loop body }
Data distributions for shared arrays UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic fashion More general than UPC spec, but not as general as ScaLAPACK or HPF
Multidimensional Blocking shared [2][2] double A[5][5];
Locality analysis and privatization Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){ upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j]; } What code should we generate for references A[i][j] and B[i+1][j]?
Shared access code generation for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = shared_deref(B,i+1,j); shared_assign(A,i,j,val); } for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
Shared access code generation Do we really need the function calls? A[i][j] should only be a memory load/store?? What about B[i+1][j] on SMP? This should be just a load? On hybrids? for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
Locality Analysis Area belonging to thread 0 Area referenced by thread 0 for B[i+1][j] for(i=0;i<4;i++) upc_forall(j=0;j<4;j++;&A[i][j]) A[i][j] = B[i+1][j];
Locality Analysis : Intuition The locality can only change if index (i+1) crosses block boundaries in a direction Block boundaries : 0, BF, 2*BF... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to find places where locality can change! for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
Locality Analysis Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF = (BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut' for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
Shared access code generation for(i=0;i<4;i++){ if((i%2<1){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = memory_load(B,i+1,j); memory_store(A,i,j,val); } }else{ upc_forall(j=0;j<4;j++; &A[i][j]){ val = shared_deref(B,i+1,j); memory_store(A,i,j,val); }
Locality analysis : algorithm For each shared reference in loop: Check if blocking factor matches Check if distance vector is constant If reference is eligible: Generate cut expressions Put cut in a sorted “cut list” Replicate loop body as necessary Insert memory load/store if local reference otherwise insert RTS call
Improvements of locality analysis in isolation
Improvements of affinity test elimination in isolation
Results : Vector addition
Matrix-vector multiplication
Matrix-vector scalability
Conclusions UPC requires extensive compiler support upc_forall is a challenging construct to compile efficiently Shared access implementation requires compiler support Optimizations working together produce good results Compiler optimizations can produce >80x speedup over unoptimized code If one optimization fails, then results can still be bad