Download presentation
Presentation is loading. Please wait.
Published bySuzanna Barker Modified over 9 years ago
1
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research
2
UPC : Unified Parallel C 012345 THREADS = 6 Partitioned Global Address Space
3
Shared arrays Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF cyclic(k) 012345678
4
Vector addition example #include shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }
5
Outline of talk upc_forall loops syntax and uses Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results
6
upc_forall and affinity tests upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){ //loop body } “Affinity test” expression determines which thread executes which iteration. Affinity test expression
7
Affinity test elimination : naive shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=0; i<M; i++){ if(upc_threadof(&A[i])==MYTHREAD){ //loop body }
8
Affinity test elimination : optimized shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){ for(j=i; j<i+BF; j++){ //loop body }
9
Integer Affinity Tests upc_forall(i=0;i<M;i++; i){ //loop body } for(i=MYTHREAD; i<M; i+=THREADS){ //loop body }
10
Data distributions for shared arrays UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic fashion More general than UPC spec, but not as general as ScaLAPACK or HPF
11
Multidimensional Blocking shared [2][2] double A[5][5]; 00 00 11 11 2 2 33001 33001 22330
12
Locality analysis and privatization Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){ upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j]; } What code should we generate for references A[i][j] and B[i+1][j]?
13
Shared access code generation for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = shared_deref(B,i+1,j); shared_assign(A,i,j,val); } for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
14
Shared access code generation Do we really need the function calls? A[i][j] should only be a memory load/store?? What about B[i+1][j] on SMP? This should be just a load? On hybrids? for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
15
Locality Analysis Area belonging to thread 0 Area referenced by thread 0 for B[i+1][j] for(i=0;i<4;i++) upc_forall(j=0;j<4;j++;&A[i][j]) A[i][j] = B[i+1][j];
16
Locality Analysis : Intuition The locality can only change if index (i+1) crosses block boundaries in a direction Block boundaries : 0, BF, 2*BF... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to find places where locality can change! for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
17
Locality Analysis Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF = (BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut' for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }
18
Shared access code generation for(i=0;i<4;i++){ if((i%2<1){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = memory_load(B,i+1,j); memory_store(A,i,j,val); } }else{ upc_forall(j=0;j<4;j++; &A[i][j]){ val = shared_deref(B,i+1,j); memory_store(A,i,j,val); }
19
Locality analysis : algorithm For each shared reference in loop: Check if blocking factor matches Check if distance vector is constant If reference is eligible: Generate cut expressions Put cut in a sorted “cut list” Replicate loop body as necessary Insert memory load/store if local reference otherwise insert RTS call
20
Improvements of locality analysis in isolation
21
Improvements of affinity test elimination in isolation
22
Results : Vector addition
23
Matrix-vector multiplication
24
Matrix-vector scalability
25
Conclusions UPC requires extensive compiler support upc_forall is a challenging construct to compile efficiently Shared access implementation requires compiler support Optimizations working together produce good results Compiler optimizations can produce >80x speedup over unoptimized code If one optimization fails, then results can still be bad
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.