Autotuning at Illinois María Jesús Garzarán University of Illinois.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
2. Getting Started Heejin Park College of Information and Communications Hanyang University.
Analysis of Computer Algorithms
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Copyright © 2013 Elsevier Inc. All rights reserved.
Introduction to Algorithms 6.046J/18.401J
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.
Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.
Configuration management
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
Database Performance Tuning and Query Optimization
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
Program analysis and synthesis for parallel computing David Padua University of Illinois at Urbana-Champaign.
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Short Vector SIMD Code Generation for DSP Algorithms
Computer Performance Computer Engineering Department.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
High Performance Linear Transform Program Generation for the Cell BE
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Reflections on Dynamic Languages and Parallelism David Padua University of Illinois at Urbana-Champaign 1.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Empirical Search and Library Generators
Automatic Performance Tuning
Optimizing MMM & ATLAS Library Generator
A Comparison of Cache-conscious and Cache-oblivious Codes
The Challenge of Teaching Program Performance Tuning
What does it take to produce near-peak Matrix-Matrix Multiply
Presentation transcript:

Autotuning at Illinois María Jesús Garzarán University of Illinois

Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

Why autotuning? In the era of parallelism… Applications and software must maintain high efficiency as machines evolve. – Otherwise, no reason for new machines. Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources Would like to automate tuning.

Compilers One way is compilers, but compilers have limitations. – Lack semantic information → fewer choices – Must target all applications – Must be reasonably fast

Compiler vs. Manual Tuning Discrete Fourier Transform

Compiler vs. Manual Tuning Matrix Matrix Multiplication 20x MFLOPS Matrix Size Intel MKL icc -O3 -xT icc -O3

Compiler vs. Manual Tuning Matrix Matrix Multiplication loop 1 c[i*N+j] += a[i*N+k]*b[k*N+j] loop 2 c[i][j] += a[i][k]*b[k][j] loop 3 C += a[i][k]*b[k][j]

Compilers … Can and should improve But we will need other strategies (at least in the short term)

Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

What is Autotuning An emerging strategy: empirical search – Goal: Automatically generate highly efficient code for each target machine (and input set). – Programmers develop metaprograms (a program that generates programs) that search the space of possible algorithms/implementations

Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram:Decription of the space of versions Object code Execution performance Selected code High-level code Input data (training) Autotuning with empirical search

Autotuning More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results – Can afford to search more extensively → better results

Examples of Existing Autotuning Systems ATLAS: Whaley, Petite, Dongarra (Tennessee) BeBop: Demmel, Yelick, Im, Vuduc (Berkeley) Datamining: Jian, Garzar á n, Snir (Illinois) FFTW: Frigo (MIT) Illinois Sorting: Li, Garzar á n, Padua (Illinois) Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois) Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley) Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng, Stratton, Hwu (Illinois) SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzar á n, Padua (Illinois) SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)

Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram: Decription of the version space Object code Execution Selected code High-level code Input data (training) Autotuning with empirical search What to do when performance depends on the input How to specify the search space? performance What is performance (execution time, power)? How to drive the search?

Research Issues 1.What to do when performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for Very promising, but much to learn

Issue 1: Performance depends on input When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines [CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004. [CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.

Issue 1: Sorting Different algorithms to perform sorting – Radix sort – Quick sort – Merge sort No single algorithm is the best for all inputs and platforms

Our Contribution Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics. Result: – Generation of the fastest sorting routines for sequential and parallel execution

20 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Same input different performance Standard Deviation

21 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Standard Deviation

22 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Hybrid sorting for dynamic adaptation

23 Input Divide with pivot Select with entropy Divide by digit Divide into block < theta≥ theta Example of hybrid sorting

24 Divide with pivot Select with entropy Divide into block Input < theta≥ theta Divide by digit Example of hybrid sorting

25 Divide with pivot Select with entropy Divide into block Pivot Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

26 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

27 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

28 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

29 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

30 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

31 Target Machine Learning Mechanism Used at runtime Training inputs Mapping input data ➔ best algorithm Learning: Algorithm Selection

32 IBM Power3 26% Classifier Sort IBM ESSL C++ STL Results: Sequential Sorting

Results: Parallel Sorting Intel Quad Core

Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for

Issue 2: Modeling/Search When the search space is too big we must use models or better search mechanisms. Illustrated with: 1. An analytical model and hybrid approach for ATLAS [PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A Comparison of Empirical and Model-driven Optimization. In PLDI, [Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, Genetic search for sorting [CGO04, CG005]

36 ATLAS Modeling ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine

37 Our Contribution Development of methods to speed-up the search process. – Analytical models that replace the search – Hybrid models that combine models with empirical search [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 The result – Same performance – Faster generation

38 ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch)

39 Modeling for Optimization Parameters Our Modeling Engine Optimization parameters – NB: Hierarchy of Models (later) – MU, NU: – KU: maximize subject to L1 Instruction Cache – Latency, MulAdd: from hardware parameters – xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model

40 Modeling for Tile Size (NB) Models of increasing complexity – 3*NB 2 ≤ C Whole work-set fits in L1 – NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement

41 MMM Performance SGI R12000Sun UltraSparc III Intel Pentium III BLAS COMPILER ATLAS MODEL MFLOPS

42 Models/Search Models reduce search time to 0. However, search is still necessary when a model does not exist.

43 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Genetic search for sorting Genetic operators are used to derive new offsprings: -Mutation (add, remove subtrees, change params) -Cross-over

Issue 2: Modeling/Search We need tools to guide models and search: P-Ray: Characterization of hardware [LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.

45 Characterize Hardware P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size

46 Our Contribution P-Ray: Tool to measure. – Block Size – Cache Mapping – Processor Mapping – Effective Bandwidth The result – Correct results for 3 different platforms (Intel Xeon Haperton, Sun UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)

P-Ray:Processor Mapping L2L2 L2L2 L2 Core 1 Core 3 L2L2 L2L2 L2 Core 5 Core 7 L2L2 L2L2 L2 Core 2 Core 4 L2L2 L2L2 L2 Core 6 Core 8 8 Core Intel Hapertown Chip 1 Chip 2

Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for

Issue 3:Description of the Space ATLAS generator is written in C We need more effective notations to implement a generator (describe the search space) Two possibilities: – Domain Specific Languages – General Purpose Languages

Issue 3:Description of the Space Illustrated with: 1.SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong, Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, Metalanguage (General Purpose Language) [LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program Versions. In LCPC 2005.

SPIRAL SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …) SPIRAL uses empirical search to generate routines that adapt to the target machine: – Sequential, parallel, SIMD, …

SPIRAL Contribution Declarative domain-specific language and rewriting rules to specify the search space. The result – Generation of routines that run faster than IPP (manually tuned) – Intel has started to use SPIRAL to generate parts of the IPP library

SPIRAL Search based on breakdown and re-writing rules: This is SPL, SPIRAL metalanguage

54 SPIRAL Program Generation Transform Rule SPL Formula parameterized matrix a breakdown strategy (Cooley Tukey) product of sparse matrices Ruletree (a)(b) (a) (b) CT

SPIRAL Program Generation

SPIRAL Why is search important? – Different formulas (algorithms) have different execution times They differ in the memory access pattern Have different ILP

SPIRAL Performance Results

Metaprogramming General-purpose programming of autotuned libraries and applications. A metaprogram contains a compact description of the space of program versions and how to proceed with the search.

Metaprogram example %try s in {2,4,8} for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = … for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … for j=1 to 128 by 2 a(j) = … a(j+1) = … for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = … Search strategy Program shape for each value

Research Issues 1.Performance depends on input 2.Modelling/Search 3.Description of the space 4.What to tune 5.What to tune for

Issue 4: What to tune 1.Kernels (MMM, FFT, sorting, …) 2.Codelets 3.Primitives

Codelets A class of (short) code sequences that appear often in an application domain The set of codelets should cover much of the execution domain Applications are decomposed into codelets Codelets are autotuned

Codelets Need a database of codelets – Each codelet in the database contains a set of compiler optimizations Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations of the matched codelet in the database Collaboration with David Kuck and David Wong, INTEL

Primitive Operations Same as codelets, but not identified automatically by the compiler The user is expected to write the application using primitives The primitives operations are tuned for each target platform

Example of Primitive Operations HTA : Hierarchically Tiled Arrays [PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, [PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.

Hierarchically Tiled Arrays (HTAs) HTA is a data type where tiles are explicit HTAs are manipulated with data parallel primitives – HTA programs look sequential programs where parallelism is encapsulated into the data parallel primitives Result – Programs that run as fast as MPI (test with NAS benchmarks) – Fewer lines of code – Portable codes

FFT using HTA parallel primitives Can be autotuned

Data Parallel Primitives Challenge: Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?

Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of options/space search 4.What to tune 5.What to tune for

Issue 5: What to tune for 1.Execution Time (All the previous systems) 2.Power (Preliminary data in next slides) 3.Space 4.Reliability

71 Power in SPIRAL Processors allow software control of operating frequency and voltage e.g. Intel Pentium M 770 has 6 settings – 2.13 GHz at volt(max performance) – 800MHz at volt (min power/energy)

72 Experimental Setup Intel Pentium M model 770 –,,,,, Measurements – HW: Agilent 34134A current probe and Agilent 34401A DMM – SW: SPIRAL controlled automatic runtime and energy measurement routine Optimization space – voltage-frequency scaling

73 Dynamic voltage-frequency scaling Use of voltage scaling instructions – CPU bound region --> run at high frequency – Memory bound region --> run at low frequency Minimum impact on execution time and significant reduction in energy consumption

74 Dynamic voltage-frequency scaling: memory profile Time Cache miss ratio Each point shows the cache miss ratio every 100  seconds WHT-2 19 (out-of-cache) Zoom

75 Dynamic voltage-frequency scaling: memory profile Cache miss ratio Each point shows the cache miss ratio every 100  seconds WHT-2 19 (out-of-cache) Time low frequency high frequency

76 Dynamic voltage-frequency scaling: results Energy (Joules) WHT-2 19 Execution Time (Seconds ) Energy versus execution time

77 Same exec. time 10% less energy Dynamic voltage-frequency scaling: results Energy (Joules) Execution Time (Seconds ) Energy versus execution time Dynamic Voltage Scaling Same energy less execution time WHT-2 19

78 Compiler Optimizations (Future work) Iterations Cache miss ratio Apply dependence analysis and group together iterations with similar cache miss ratio  increases the benefit of dynamic voltage scaling Iterations

Research Agenda 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for