The Challenge of Teaching Program Performance Tuning David Padua Department of Computer Science University of Illinois at Urbana-Champaign
1. Parallel Programming and Performance a. Expressiveness Parallel programming models are most convenient to express certain classes of problems: Simulations (real word is parallel) Programming of reactive codes – “dining philosophers” These computations can be represented in sequential form, but less clearly.
1. Parallel Programming and Performance b. Performance However, the reason why parallel programming is increasingly popular is performance. Physical limitations slowed performance improvements and led (among other factors) to the advent of multicores The idea is to use (Coarse grain) parallelism for continuing gains in performance. Scalability. No software scalability no reason to buy new machines For fixed performance, parallelism can reduce energy costs
1. Parallel Programming and Performance b. Performance However, the reason why parallel programming is increasingly popular is performance. Physical limitations slowed performance improvements and led (among other factors) to the advent of multicores The idea is to use (Coarse grain) parallelism for continuing gains in performance. Scalability. No software scalability no reason to buy new machines For fixed performance, parallelism can reduce energy costs D. Yen, “Chip multithreading processors enable reliable high throughput computing,” Keynote speech at International Symposium on Reliability Physics (IRPS), April 2005. From Pradip Bose. Power Wall. Encyclopedia of Parallel Computing Springer Verlag. Forthcoming.
1. Parallel Programming and Performance c. A difficult option Industry reluctantly moved towards parallelism. A necessary evil. Parallel programming is hard in good part (mainly?) because of performance. The most serious challenge they face is that obtaining performance is perhaps more difficult than exposing parallelism. In fact, performance does not increase linearly with parallelism. Locality, locality, locality Communication costs Redundant computation is sometime desirable. Exposing more parallelism is sometimes harmful
1. Parallel Programming and Performance d 1. Parallel Programming and Performance d. Too much parallelism considered harmful For example, non consistent algorithms suffer of too much parallelism as the input size increases. A vector algorithm for solving a problem of size n is consistent with the best serial algorithms for the same problem if the redundancy is bounded as n → ∞.
Matrix-matrix Multiplication Intel Xeon 1. Parallel Programming and Performance e. It is not a second order effect Intel MKL (hand-tuned assembly) Matrix-matrix Multiplication Intel Xeon 60X In this plot we compare the performance of hand-tuned matrix-matrix multiplication program and compiler generated code on Intel Xeon. The x-axis is matrix size. The y-axis is MFLOPS. So the higher the better. The lower line is the triply-nested loop plus compiler optimizations. The higher line is Intel MKL, which is hand-tuned assembly code. Hand-tuned code is about 60 times faster than compiler generated code. Big gap. What about other programs? Triply-nested loop+ icc optimizations David Padua Matrix Size
1. Parallel Programming and Performance e 1. Parallel Programming and Performance e. It is not a second order effect: FFT Intel Xeon Best available implementation (Intel MKL) 10X reasonable implementation (Numerical recipes. GNU scientific library) In this plot we compare the performance of hand-tuned FFT and compiler generated code on Intel Xeon. Again the higher, the better. The lower line is the reasonable implementation of FFT, which is the code you can find in numerical recipe or GNU scientific library. The FFT program is compiled. The higher line is the best hand-tuned FFT code, including the code found in Intel MKL, Spiral and FFTW. Again, hand-tuned code is about 10 times faster than compiler generated code. These are two examples, but it has been seen for other applications.
2. Parallel Programming and Program Tuning The two topics are therefore highly related. To teach parallel programming, we must also teach performance tuning. Like parallel programming, performance tuning must permeate the entire CS curriculum
3. The Art of Program Tuning Goal is the obvious one: to develop/transform codes for Fast execution Low energy consumption Scalability on A machine A class of machines A range of classes of machines
3. The Art of Program Tuning It is of practical importance. Faster programs are good for users and systems It integrates knowledge of Algorithms, Data structures Complexity Is the algorithm optimal ? Given the algorithm, is the code optimal ? Machines, Tools, Program Optimization Techniques - Autotuning Performance Prediction Techniques So, its teaching should contribute to a better understanding of CS topics. It is good for education.
4. Challenges There is much to be done in education in program tuning. CS students of today are typically performance illiterate. There are numerous obstacles that have led to this situation. It is not always clear how to overcome these obstacles. What is clear is they most be overcome
4. Challenges a. It is the correctness, stupid Focus of programming education today is correctness. Understandable since writing correct programs, debugging is hard. Correctness must be the priority. However, at the same time, performance is not a focus. Although algorithm courses focus on asymptotic complexity. Computer Architecture is mainly about performance.
4. Challenges a. It is the correctness, stupid Automate performance ? In an ideal world, programmers would focus on correctness and performance is delivered automatically by Libraries Compilers But … Many programs cannot be cast in term of libraries Compilers help but are not reliable
4. Challenges d. Cannot reason about it 4. Challenges d. Cannot reason about it. A first tale from compilerland (1/2) S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers. In preparation. 2011.
4. Challenges d. Cannot reason about it 4. Challenges d. Cannot reason about it. A first tale from compilerland (1/2) Appl XLC ICC GCC Automatic Manual JPEG Enc - 1.33 1.39 2.13 1.57 JEPG Dec 1.14 1.13 H263 Enc 1.25 2.28 2.06 H263 Dec 1.31 1.45 MPEG2 Enc 1.06 1.96 2.43 MPEG2 Dec 1.15 1.37 1.55 MPEG4 Enc 1.44 1.81 1.74 MPEG4 Dec 1.12 1.18 Table shows whole program speedups measured against unvectorized application S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers. In preparation. 2011.
4. Challenges b. It cramps my style The topic has not been fashionable since the days of the bit tweaking programmers in the 1950s. The best and the brightest prefer AI, Formal Models, …
4. Challenges c. Too much diversity too little consensus There is no good model that can be used to study performance in the abstract. RAM machines do not reflect reality of memory hierarchy. More accurate modes have been developed to capture memory hierarchies, synchronization and communication costs, etc. But there are numerous models to represent the numerous ways of implementing coarse grain parallelism at different degrees of detail. Real machines have idiosyncrasies that are not reflected by these models. Complex reality is not fertile ground for mathematical studies.
4. Challenges d. Cannot reason about it. In many cases, it is not known how machines are implemented. Nobody knows how compilers will react. Also, lack of performance abstractions for many runtime systems. Impact on performance of different algorithmic and programming choices is difficult to predict before hand. This also implies that evaluating programs by graders would be difficult.
Contrary to what we would expect. The first form is faster. 4. Challenges d. Cannot reason about it. Another tale from compilerland Contrary to what we would expect. The first form is faster. The compiler inlines and interchanges the iter loop in the first case. This improves locality. The compiler coalesces the second loop and then does not interchange. for (iter=0;iter<num_iterations;iter++){ matrix_sum1(a); } matrix_sum2(a); } void matrix_sum1(double a[N][N]){ int i,j; for (j=0;j<N;j++){ for (i=0;i<N;i++){ a[i][j] += 1; }} void matrix_sum2(double a[N][N]){
5 Revamping education for performance a. Permeate the curriculum Students should learn how to use performance monitoring tools and how to access hardware counters early on. Most programming assignment request performance measurements. Programming assignment request explanation of the performance. Algorithm complexity and architecture courses should include programming assignments. This may also help researchers focus on performance issues and impact on real needs. Perhaps this will encourage architects to develop machines we can reason about.
5 Revamping education for performance b. Case studies The best way to learn performance tuning is to do numerous case studies. Diverse class of applications Variety of machines Different types of optimizations. The challenges to find the right class of teachers The difficulty is assessing the quality of the result A capstone courses on program optimization would be a place to do these case studies. Master lectures on performance programming ?
Need to teach program optimization 5. Revamping education for performance c. Old and new topics in the curriculum Need to teach program optimization Compiler transformation techniques are among the best for this purpose. Understood well enough to be optimized. Have formalized to the point of having a program transformation algebra. But not sufficient. Need to develop methodology to search for best strategy. Need to also teach autotuning. Search strategies Multialgorithm programs
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning Automatically generate highly efficient code for each machine. Much work at first, but no need to retune for new machines within the same “class” Domain-specific.
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning Algorithm description Generator / Search space explorer performance High-level code Selected code Source-to-source optimizer High-level code Native compiler Execution Object code
Library synthesizers (LS) usually handle a fixed set of algorithms. 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning Examples: In linear algebra: ATLAS, PhiPAC In signal processing: FFTW, SPIRAL Library synthesizers (LS) usually handle a fixed set of algorithms. Exception: SPIRAL accepts formulas and rewriting rules as input.
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning: Three projects Spiral Joint project with CMU and Drexel. M. Püschel, J. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE special issue on "Program Generation, Optimization, and Platform Adaptation”. Vol. 93, No. 2, pp. 232-275. February 2005. ATLAS and analytical model Joint project with Cornell. K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill. Is Search Really Necessary to Generate High-Performance BLAS? Proceedings of the IEEE special issue on "Program Generation, Optimization, and Platform Adaptation”. Vol. 93, No. 2, pp. 358-386. February 2005. Sorting and adaptation to the input In all cases results are surprisingly good. Competitive or better than the best manual results.
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning notations Objective is to develop language extensions to implement parameterized programs. Values of the parameters are a function of the target machine and execution environment. Program synthesizers could be implemented using autotuning extensions.
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning notations Example C program with autotuning extensions #pragma search (1<=m<=10, 0<= a <=1) #pragma unroll m for(i=1;i<n;i++) { … } %if (a) {algorithm 1 } %else {algorithm 2 }
6. Conclusions The advent of parallelism introduces new challenges The most difficult is education Programmers of the future must know more Parallelism of course But equally as important is performance tuning One of the great challenges is that performance tuning is not a mature field. Emphasis in teaching may help.
Best selection of sorting routine and implementation depends on 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning: Sorting generation Best selection of sorting routine and implementation depends on Architectural features Different from platform to platform Input characteristics Only known at runtime As we have seen, the best selection of sorting routines depends on two factors. First, the selection is different from platform to platform due to various architectural features. Second, the selection must adapt to input characteristics. The key challenge is that we only know inputs at runtime. Because of the interaction between the two factors, it is really difficult to formulate the selection, like we calculate the size for loop tiling. Our solution is to use machine learning techniques to learn the best selection.
Factors: input statistics Intel Xeon CC-Radix Quicksort Merge Sort Performance (keys per cycle) AMD Athlon MP CC-Radix Furthermore, we observe that the two platforms have different numbers and different locations of the crossover points. The knowledge we learn from the comparison is that there is no single universal best sorting algorithm. If we want to achieve better performance, we need to make selections at runtime. Quicksort Merge Sort Standard Deviation
input data ➔ best algorithm 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning: Synthesis strategy for sorting Training inputs Learning Mechanism Target Machine This is the framework how we learn the selection of sorting algorithms. A set of training inputs will be generated to simulate inputs in real world. Our machine learning mechanism will sort each train input using a small set of sorting algorithms. The performance of the sorting on each input depends on the architectural features of the target machine and the input characteristics. The learning mechanism can learn from the results the mapping from input data to the selection of the best algorithm on the target machine. The mapping will be used at run time to predict the best algorithm for a specific input. First let’s look at our algorithm candidates. Mapping input data ➔ best algorithm Used at runtime
5. Revamping education for performance c 5. Revamping education for performance c. Old and new topics in the curriculum Autotuning:Sorting routine performance on Power4