Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999
Motivation and Solutions Memory system is the bottleneck in ILP-based system –Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality Lack of enough independent load misses in a single instruction window –Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam Automate code transformation –Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)
Unroll-and-jam
Apply code transformations in a compiler –Automatic unroll-and-jam transformation –Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) –Dependence analysis of limit memory parallelism Cache-line dependences Address dependences Window constraints Experimental methodology –Environment: Rice Simulator for ILP Multiprocessors –Workload: Latbench,five scientific applications –Incorporate miss clustering by hand Results –9-39% reduction in multiprocessor execution time –11-48% reduction in uniprocessor execution time
Strengths –Good performance Weaknesses –Transformations is lack of validity
Questions to discuss: –What hardware supports are needed to overlap multiple read misses? –Why use unroll-and-jam instead of strip-mine and interchange code transformation? –How do you think of the future work? V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism.