Autotuning at Illinois María Jesús Garzarán University of Illinois
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems
Why autotuning? In the era of parallelism… Applications and software must maintain high efficiency as machines evolve. – Otherwise, no reason for new machines. Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources Would like to automate tuning.
Compilers One way is compilers, but compilers have limitations. – Lack semantic information → fewer choices – Must target all applications – Must be reasonably fast
Compiler vs. Manual Tuning Discrete Fourier Transform
Compiler vs. Manual Tuning Matrix Matrix Multiplication 20x MFLOPS Matrix Size Intel MKL icc -O3 -xT icc -O3
Compiler vs. Manual Tuning Matrix Matrix Multiplication loop 1 c[i*N+j] += a[i*N+k]*b[k*N+j] loop 2 c[i][j] += a[i][k]*b[k][j] loop 3 C += a[i][k]*b[k][j]
Compilers … Can and should improve But we will need other strategies (at least in the short term)
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems
What is Autotuning An emerging strategy: empirical search – Goal: Automatically generate highly efficient code for each target machine (and input set). – Programmers develop metaprograms (a program that generates programs) that search the space of possible algorithms/implementations
Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram:Decription of the space of versions Object code Execution performance Selected code High-level code Input data (training) Autotuning with empirical search
Autotuning More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results – Can afford to search more extensively → better results
Examples of Existing Autotuning Systems ATLAS: Whaley, Petite, Dongarra (Tennessee) BeBop: Demmel, Yelick, Im, Vuduc (Berkeley) Datamining: Jian, Garzar á n, Snir (Illinois) FFTW: Frigo (MIT) Illinois Sorting: Li, Garzar á n, Padua (Illinois) Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois) Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley) Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng, Stratton, Hwu (Illinois) SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzar á n, Padua (Illinois) SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems
Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram: Decription of the version space Object code Execution Selected code High-level code Input data (training) Autotuning with empirical search What to do when performance depends on the input How to specify the search space? performance What is performance (execution time, power)? How to drive the search?
Research Issues 1.What to do when performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for Very promising, but much to learn
Issue 1: Performance depends on input When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines [CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004. [CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.
Issue 1: Sorting Different algorithms to perform sorting – Radix sort – Quick sort – Merge sort No single algorithm is the best for all inputs and platforms
Our Contribution Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics. Result: – Generation of the fastest sorting routines for sequential and parallel execution
20 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Same input different performance Standard Deviation
21 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Standard Deviation
22 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Hybrid sorting for dynamic adaptation
23 Input Divide with pivot Select with entropy Divide by digit Divide into block < theta≥ theta Example of hybrid sorting
24 Divide with pivot Select with entropy Divide into block Input < theta≥ theta Divide by digit Example of hybrid sorting
25 Divide with pivot Select with entropy Divide into block Pivot Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting
26 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting
27 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting
28 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting
29 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting
30 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting
31 Target Machine Learning Mechanism Used at runtime Training inputs Mapping input data ➔ best algorithm Learning: Algorithm Selection
32 IBM Power3 26% Classifier Sort IBM ESSL C++ STL Results: Sequential Sorting
Results: Parallel Sorting Intel Quad Core
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for
Issue 2: Modeling/Search When the search space is too big we must use models or better search mechanisms. Illustrated with: 1. An analytical model and hybrid approach for ATLAS [PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A Comparison of Empirical and Model-driven Optimization. In PLDI, [Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, Genetic search for sorting [CGO04, CG005]
36 ATLAS Modeling ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine
37 Our Contribution Development of methods to speed-up the search process. – Analytical models that replace the search – Hybrid models that combine models with empirical search [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 The result – Same performance – Faster generation
38 ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch)
39 Modeling for Optimization Parameters Our Modeling Engine Optimization parameters – NB: Hierarchy of Models (later) – MU, NU: – KU: maximize subject to L1 Instruction Cache – Latency, MulAdd: from hardware parameters – xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model
40 Modeling for Tile Size (NB) Models of increasing complexity – 3*NB 2 ≤ C Whole work-set fits in L1 – NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement
41 MMM Performance SGI R12000Sun UltraSparc III Intel Pentium III BLAS COMPILER ATLAS MODEL MFLOPS
42 Models/Search Models reduce search time to 0. However, search is still necessary when a model does not exist.
43 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Genetic search for sorting Genetic operators are used to derive new offsprings: -Mutation (add, remove subtrees, change params) -Cross-over
Issue 2: Modeling/Search We need tools to guide models and search: P-Ray: Characterization of hardware [LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.
45 Characterize Hardware P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size
46 Our Contribution P-Ray: Tool to measure. – Block Size – Cache Mapping – Processor Mapping – Effective Bandwidth The result – Correct results for 3 different platforms (Intel Xeon Haperton, Sun UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)
P-Ray:Processor Mapping L2L2 L2L2 L2 Core 1 Core 3 L2L2 L2L2 L2 Core 5 Core 7 L2L2 L2L2 L2 Core 2 Core 4 L2L2 L2L2 L2 Core 6 Core 8 8 Core Intel Hapertown Chip 1 Chip 2
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for
Issue 3:Description of the Space ATLAS generator is written in C We need more effective notations to implement a generator (describe the search space) Two possibilities: – Domain Specific Languages – General Purpose Languages
Issue 3:Description of the Space Illustrated with: 1.SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong, Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, Metalanguage (General Purpose Language) [LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program Versions. In LCPC 2005.
SPIRAL SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …) SPIRAL uses empirical search to generate routines that adapt to the target machine: – Sequential, parallel, SIMD, …
SPIRAL Contribution Declarative domain-specific language and rewriting rules to specify the search space. The result – Generation of routines that run faster than IPP (manually tuned) – Intel has started to use SPIRAL to generate parts of the IPP library
SPIRAL Search based on breakdown and re-writing rules: This is SPL, SPIRAL metalanguage
54 SPIRAL Program Generation Transform Rule SPL Formula parameterized matrix a breakdown strategy (Cooley Tukey) product of sparse matrices Ruletree (a)(b) (a) (b) CT
SPIRAL Program Generation
SPIRAL Why is search important? – Different formulas (algorithms) have different execution times They differ in the memory access pattern Have different ILP
SPIRAL Performance Results
Metaprogramming General-purpose programming of autotuned libraries and applications. A metaprogram contains a compact description of the space of program versions and how to proceed with the search.
Metaprogram example %try s in {2,4,8} for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = … for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … for j=1 to 128 by 2 a(j) = … a(j+1) = … for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = … Search strategy Program shape for each value
Research Issues 1.Performance depends on input 2.Modelling/Search 3.Description of the space 4.What to tune 5.What to tune for
Issue 4: What to tune 1.Kernels (MMM, FFT, sorting, …) 2.Codelets 3.Primitives
Codelets A class of (short) code sequences that appear often in an application domain The set of codelets should cover much of the execution domain Applications are decomposed into codelets Codelets are autotuned
Codelets Need a database of codelets – Each codelet in the database contains a set of compiler optimizations Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations of the matched codelet in the database Collaboration with David Kuck and David Wong, INTEL
Primitive Operations Same as codelets, but not identified automatically by the compiler The user is expected to write the application using primitives The primitives operations are tuned for each target platform
Example of Primitive Operations HTA : Hierarchically Tiled Arrays [PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, [PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.
Hierarchically Tiled Arrays (HTAs) HTA is a data type where tiles are explicit HTAs are manipulated with data parallel primitives – HTA programs look sequential programs where parallelism is encapsulated into the data parallel primitives Result – Programs that run as fast as MPI (test with NAS benchmarks) – Fewer lines of code – Portable codes
FFT using HTA parallel primitives Can be autotuned
Data Parallel Primitives Challenge: Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of options/space search 4.What to tune 5.What to tune for
Issue 5: What to tune for 1.Execution Time (All the previous systems) 2.Power (Preliminary data in next slides) 3.Space 4.Reliability
71 Power in SPIRAL Processors allow software control of operating frequency and voltage e.g. Intel Pentium M 770 has 6 settings – 2.13 GHz at volt(max performance) – 800MHz at volt (min power/energy)
72 Experimental Setup Intel Pentium M model 770 –,,,,, Measurements – HW: Agilent 34134A current probe and Agilent 34401A DMM – SW: SPIRAL controlled automatic runtime and energy measurement routine Optimization space – voltage-frequency scaling
73 Dynamic voltage-frequency scaling Use of voltage scaling instructions – CPU bound region --> run at high frequency – Memory bound region --> run at low frequency Minimum impact on execution time and significant reduction in energy consumption
74 Dynamic voltage-frequency scaling: memory profile Time Cache miss ratio Each point shows the cache miss ratio every 100 seconds WHT-2 19 (out-of-cache) Zoom
75 Dynamic voltage-frequency scaling: memory profile Cache miss ratio Each point shows the cache miss ratio every 100 seconds WHT-2 19 (out-of-cache) Time low frequency high frequency
76 Dynamic voltage-frequency scaling: results Energy (Joules) WHT-2 19 Execution Time (Seconds ) Energy versus execution time
77 Same exec. time 10% less energy Dynamic voltage-frequency scaling: results Energy (Joules) Execution Time (Seconds ) Energy versus execution time Dynamic Voltage Scaling Same energy less execution time WHT-2 19
78 Compiler Optimizations (Future work) Iterations Cache miss ratio Apply dependence analysis and group together iterations with similar cache miss ratio increases the benefit of dynamic voltage scaling Iterations
Research Agenda 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for