Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research

UPC : Unified Parallel C 012345 THREADS = 6 Partitioned Global Address Space

Shared arrays Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF cyclic(k) 012345678

Vector addition example #include shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }

Outline of talk upc_forall loops syntax and uses Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results

upc_forall and affinity tests upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){ //loop body } “Affinity test” expression determines which thread executes which iteration. Affinity test expression

Affinity test elimination : naive shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=0; i<M; i++){ if(upc_threadof(&A[i])==MYTHREAD){ //loop body }

Affinity test elimination : optimized shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){ for(j=i; j<i+BF; j++){ //loop body }

Integer Affinity Tests upc_forall(i=0;i<M;i++; i){ //loop body } for(i=MYTHREAD; i<M; i+=THREADS){ //loop body }

Data distributions for shared arrays UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic fashion More general than UPC spec, but not as general as ScaLAPACK or HPF

Multidimensional Blocking shared [2][2] double A[5][5]; 00 00 11 11 2 2 33001 33001 22330

Locality analysis and privatization Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){ upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j]; } What code should we generate for references A[i][j] and B[i+1][j]?

Shared access code generation for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = shared_deref(B,i+1,j); shared_assign(A,i,j,val); } for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Shared access code generation Do we really need the function calls? A[i][j] should only be a memory load/store?? What about B[i+1][j] on SMP? This should be just a load? On hybrids? for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Locality Analysis Area belonging to thread 0 Area referenced by thread 0 for B[i+1][j] for(i=0;i<4;i++) upc_forall(j=0;j<4;j++;&A[i][j]) A[i][j] = B[i+1][j];

Locality Analysis : Intuition The locality can only change if index (i+1) crosses block boundaries in a direction Block boundaries : 0, BF, 2*BF... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to find places where locality can change! for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Locality Analysis Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF = (BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut' for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

Shared access code generation for(i=0;i<4;i++){ if((i%2<1){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = memory_load(B,i+1,j); memory_store(A,i,j,val); } }else{ upc_forall(j=0;j<4;j++; &A[i][j]){ val = shared_deref(B,i+1,j); memory_store(A,i,j,val); }

Locality analysis : algorithm For each shared reference in loop:  Check if blocking factor matches  Check if distance vector is constant  If reference is eligible: Generate cut expressions Put cut in a sorted “cut list” Replicate loop body as necessary Insert memory load/store if local reference otherwise insert RTS call

Improvements of locality analysis in isolation

Improvements of affinity test elimination in isolation

Results : Vector addition

Matrix-vector multiplication

Matrix-vector scalability

Conclusions UPC requires extensive compiler support  upc_forall is a challenging construct to compile efficiently  Shared access implementation requires compiler support Optimizations working together produce good results  Compiler optimizations can produce >80x speedup over unoptimized code  If one optimization fails, then results can still be bad

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University.

Similar presentations

Presentation on theme: "Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Similar presentations

Presentation on theme: "Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University."— Presentation transcript:

Similar presentations

About project

Feedback

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University.

Presentation on theme: "Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin Cascaval Gheorghe Almasi, Jose Nelson Amaral* *University."— Presentation transcript: