Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya Unnikrishnan IBM Toronto Lab CASCON 2005
Software Group © 2005 IBM Corporation October 2005 Overview Parallelization in IBM XL compilers Outlining Automatic parallelization Cost analysis Controlled parallelization Future work
Software Group © 2005 IBM Corporation October 2005 Parallelization IBM XL compilers support Fortran 77/90/95, C and C++ Implements both OpenMP and Auto-parallelization. Both target SMP (shared memory parallel) machines Non-threadsafe code generated by default –Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code
Software Group © 2005 IBM Corporation October 2005 Parallelization options -qsmp=nooptParallelizes code with minimal optimization to allow for better debugging of OpenMP applications. -qsmp=ompParallelizes code containing OpenMP directives -qsmp=autoAutomatically parallelizes loops -qsmp=noautoNo auto-parallelization. Processes IBM and OpenMP parallel directives.
Software Group © 2005 IBM Corporation October 2005 Outlining Parallelization transformation
Software Group © 2005 IBM Corporation October 2005 Outlining long =_xlsmpInitializeRTE(); if (n > 0) then endif return main; } int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void unsigned =0; do{ + CIV1] = const; + 1; < return; } + Runtime call Outlined routine
Software Group © 2005 IBM Corporation October 2005 SMP parallel runtime The outlined function is parameterized – can be invoked for different ranges in the iteration space
Software Group © 2005 IBM Corporation October 2005 Auto-parallelization Integrated framework for OpenMP and auto-parallelization Auto-parallelization is restricted to loops. Auto-parallelization is done in the link step when possible. This allows us to perform various interprocedural analysis and optimizations before automatic parallelization
Software Group © 2005 IBM Corporation October 2005 Auto-parallelization transformation int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } + int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… } Outlining
Software Group © 2005 IBM Corporation October 2005 We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!! int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } + Outlining int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }
Software Group © 2005 IBM Corporation October 2005 Pre-parallelization phase Loop Normalization (normalize countable loops) Scalar privatization Array privatization Reduction variable analysis Loop interchange (that helps parallelization)
Software Group © 2005 IBM Corporation October 2005 Cost Analysis Automatic parallelization tests –Dependence analysis : Is it safe to parallelize ?? –Cost analysis : Is it worthwhile to parallelize ?? Cost analysis: Estimates the total workload of the loop LoopCost = ( IterationCount * ExecTimeOfLoopBody ) Cost known at compile time – trivial Runtime cost analysis is more complex
Software Group © 2005 IBM Corporation October 2005 Conditional Parallelization long =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ } else endif return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime check
Software Group © 2005 IBM Corporation October 2005 Runtime cost analysis challenges Runtime checks should be –Light weight : should not introduce large overhead in applications that are mostly serial –Overflow problems : leads to incorrect decision – costly!! loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* … –Restricted to integer operations –Should be accurate Balance all the above factors
Software Group © 2005 IBM Corporation October 2005 Runtime dependence test long =_xlsmpInitializeRTE(); if (n > 0) then if( && loop_cost>threshold){ } else endif return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime dependence Work by Peng Zhao
Software Group © 2005 IBM Corporation October 2005
Software Group © 2005 IBM Corporation October 2005 Controlled parallelization Cost analysis selects big loops Controlled parallelization –Selection is not enough –Parallel performance dependent on ( amount of work + number of processors used) –Using large number of processors for a small loop huge degradations !!
Software Group © 2005 IBM Corporation October 2005 Measured on a 64-way Power5 processor Small is good !!!
Software Group © 2005 IBM Corporation October 2005 Controlled parallelization Introduce another runtime parameter IPT (minimum iterations per thread) The IPT is passed to the SMP runtime SMP runtime limits the number of threads working on the parallel loop based on IPT IPT = function( loop_cost, mem access info.. )
Software Group © 2005 IBM Corporation October 2005 Controlled Parallelization long =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) endif } else } return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime parameter
Software Group © 2005 IBM Corporation October 2005 SMP parallel runtime { threadsUsed = IterCount/IPT if (threadsUsed > threadsAvailable) threadsUsed = threadsAvailable ….. }
Software Group © 2005 IBM Corporation October 2005 Controlled parallelization for OpenMP Improves performance and scalability Allows fine grained control at loop level granularity Can be applied to OpenMP loops as well Adjust number of threads when ENV variable OMP_DYNAMIC is turned on. Issues with threadprivate data Encouraging results in galgel
Software Group © 2005 IBM Corporation October 2005 Measured on a 64-way Power5 processor
Software Group © 2005 IBM Corporation October 2005 Future work Improve cost analysis algorithm and fine tune heuristics Implement interprocedural cost analysis. Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability Implement interprocedural dependence analysis