Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces Arun Kejariwal (Yahoo!Inc. Santa Clara, CA) Alexandru Nicolau, Utpal Banerjee, Alexander V. Veidenbaum (UC Irvine, CA) Constantine D. Polychronopoulos (UI Urbana, IL) Presenter: Olga Golovanevsky
Outline Motivation Motivating examples The Techniques Results Conclusion
Motivation Multi-cores becoming ubiquitous Examples – Intel’s Sandybridge, IBM’s Cell and POWER Sun’s UltraSPARC T* family Number of cores is expected to increase Large-scale hardware parallelism available Software challenges Thread-level application parallelization How to map threads on different cores Load balancing Data affinity
Application Parallelization Loops account for most of application run-time Loop classification DOALL: No loop-carried dependence Amenable to auto-parallelization Execute iterations in parallel on different threads Non-DOALL Thread synchronization support needed for parallelization
Parallel Execution of DOALL Loops Auto-parallelized Directive-driven parallelization Example – OpenMP pragmas Issue with parallel execution Load balancing How to partition the iteration space for best performance? Naïve way: Partition the iteration space uniformly amongst the different threads Doesn’t yield the best performance!
Iteration Space Partitioning Why is it non-trivial? Non-rectangular geometry of iteration space
Iteration Space Partitioning (contd.) Why is it non-trivial? Use of indirect referencing Non-uniform cache miss profile Variation in L1 cache misses 462.libquantum 435.gromacs do k=nj0,nj1 jnr = jjnr(k)+1 j3 = 3*jnr-2 … faction(j3) = faction(j3)-tx11 faction(j3+1) = faction(j3+1)-ty11 faction(j3+2) = faction(j3+2)-tz11 end do
Iteration Space Partitioning (contd.) Why is it non-trivial? Non-perfect multi-way loops Outermost loop may have multiple loops at the same nesting level Conditional execution of inner loops do k=2, nk -1 … do j=1, ny first loop do i=1, nx read A[k,j,i] end do do j=2, ny-1 second loop do i=2, nx-1 write A[k,j,i] T1 k=1 T2 k=4
Iteration Space Partitioning (contd.) Why is it non-trivial? Presence of conditionals in the loop body Non-uniform workload distribution Variation in Inst Retired 403.gcc 434.zeusmp do 90 k=ks,ke do 80 j=js,je do 70 i=is,ie … if ((rin .lt. rsq) .and. (rout .le. rsq)) then endif if ((rin .lt. rsq) .and. (rout .gt. rsq)) then 70 continue 80 continue 90 continue
How to partition? Guiding factors Partition the outermost loop Minimizes scheduling overhead Geometry-aware Model the iteration space as a convex polytope Loop indices are affine functions of outer loop indices Cache-Aware Account for non-uniform cache miss profile across the iteration space Account for non-uniform workload distribution across the iteration space
Algorithm High-level steps Obtain the cache miss profile Obtain the workload distribution Compute the total volume of iteration space Weighted by cache misses and instructions retired Given n threads Compute n-1 breakpoints along the axis corresponding to the outermost loop wherein Each breakpoint delimits a set Each set has equal weighted volume Map each set on to a different thread
Experimental Setup Use in-built hardware performance counters MEM_LOAD_RETIRED.L1D_MISS Obtain cache miss profile INST_RETIRED.ANY Obtain instructions retired profile
Kernel Set
Results (contd.) Compute two metrics: Speedup = (tco – tca) * 100 tca tco = cache-oblivious tca = cache-aware Deviation Difference between proposed technique and worst case
Thank You!
Results (contd.) Performance variation with different partitioning planes 3 threads
Results Performance variation with different partitioning planes A kernel from 178.galgel Nested non-perfect multiway DOALL loop 2 threads