Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Open Archive Forum ECDL2001, (c) Susanne Dobratz, Humboldt-University1 European Support for Open Archives Susanne Dobratz Humboldt-University.
2/12/01 Professor Richard Fikes Representing Time Computer Science Department Stanford University CS222 Winter 2001 Knowledge Systems Laboratory, Stanford.
1/31/01 Professor Richard Fikes Frame Representation of Electronic Circuit Example Computer Science Department Stanford University CS222 Winter 2001 Knowledge.
2/13/2014Bucknell University1 Health Care Issues BOF Cindy Bilger,
T. S. Eugene Ng Mellon University1 Towards Global Network Positioning T. S. Eugene Ng and Hui Zhang Department of Computer.
1/23 Learning from positive examples Main ideas and the particular case of CProgol4.2 Daniel Fredouille, CIG talk,11/2005.
Tori Bowman CSSE 375, Rose-Hulman October 22, Code Tuning (Chapters of Code Complete)
Randomized Algorithms Randomized Algorithms CS648 1.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Best PracticesUSCA Fall 2010: Baylor University3.
…We have a large reservoir of engineers (and scientists) with a vast background of engineering know-how. They need to learn statistical methods that can.
1 CS 391L: Machine Learning: Rule Learning Raymond J. Mooney University of Texas at Austin.
9th October 2003James Loken – Oxford University1 SCT X-ray Alignment Software A First Look.
Rob Walker, May 2008Student Learning Unit Victoria University1 Essay Writing A workshop for ASW 3102: Critical Social Work Theories.
RMIT University1 Southern United H.C. Coaching Workshop Nov 2007 Welcome Stop, Reflect and Grow.
Identifying Novice Difficulties in Object Oriented Design Benjy Thomasson Mark Ratcliffe Lynda Thomas Aberystwyth University, Wales, UK.
An Introduction to eXtreme Programming Michael L. Collard, Ph.D. Department of Computer Science Kent State University.
2009 – E. Félix Security DSL Toward model-based security engineering: developing a security analysis DSML Véronique Normand, Edith Félix, Thales Research.
Spatial Views: Space-aware Programming for Networks of Embedded Systems Yang Ni, Ulrich Kremer, and Liviu Iftode Rutgers University This work has been.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
BALANCED CACHE Ayşe BAKIR, Zeynep ZENGİN. Ayse Bakır,CMPE 511,Bogazici University2 Outline  Introduction  Motivation  The B-Cache Organization  Experimental.
Benoit Pigneur and Kartik Ariyur School of Mechanical Engineering Purdue University June 2013 Inexpensive Sensing For Full State Estimation of Spacecraft.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
Introductory Courses in High Performance Computing at Illinois David Padua.
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
TAGUCHI’S CONTRIBUTIONS TO EXPERIMENTAL DESIGN & QUALITY ENGINEERING.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
כמה מהתעשייה? מבנה הקורס השתנה Computer vision.
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
High-Level Transformations for Embedded Computing
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Design-Space Exploration
Dependence Analysis Important and difficult
Register Pressure Guided Unroll-and-Jam
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Predicting Unroll Factors Using Supervised Classification
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06Rice University2 Recap from Last Year A framework for automatic tuning of applications –Fine grain control of transformations –Feedback beyond whole program execution time –Parameterized Search Engine –Target: Whole Applications –Search Space Multi-loop transformations: e.g Loop Fusion Numerical Parameters

AutoTuning Workshop 06Rice University3 Recap from Last Year Experiments with Direct Search –Direct Search able to find suitable tile sizes and unroll factors by exploring only a small fraction of the search space 95% of best performance obtained by exploring 5% of the search space –Search space pruning needed for making search more efficient Wandered into regions containing mostly bad values Direct search required more than 30 program evaluations in many cases

AutoTuning Workshop 06Rice University4 Todays talk Search Space Pruning

AutoTuning Workshop 06Rice University5 Search Space Pruning Key Idea: Search for architecture-dependent model parameters rather than transformation parameters Fundamentally different way of looking at the optimization search space Implemented for loop fusion and tiling [Qasem and Kennedy, ICS06]

6 Register Set F 0 … … … F 2 L L1 Cache Tile Size Fusion Config. (L + 1) dimensions N L x 2 L points T 01 … … … T 0N T 11 … … … T 1N T 21 … … … T 2N T 31 … … … T 3N Gray Code Representation Cost Models Effective Register Set Effective Cache Capacity Cost Model Estimates of architectural parameters Search Space Architectural Parameters (L + 1) dimensions (N- p) L x (2 L -q) points New Search Space has only two dimensions!

AutoTuning Workshop 06Rice University7 Our Approach Build a combined cost model for fusion and tiling to capture the interaction between the two transformations –Use reuse analysis to estimate trade-offs Expose architecture-dependent parameters within the model for tuning through empirical search –Pick T such that Working Set Effective Cache Capacity –Search for suitable Effective Cache Capacity

AutoTuning Workshop 06Rice University8 Tuning Parameters Use a tolerance term to determine how much of a resource we can use at each tuning step Effective Register Set = T x Register Set Size [0 < T 1] Effective Cache Capacity = E(a, s, T) [0.01 T 0.20] Miss Rate

AutoTuning Workshop 06Rice University9 Search Strategy Start off conservatively with a low tolerance value and increase tolerance at each step Each tuning parameter constitutes a single search dimension Search is sequential and orthogonal –stop when performance starts to worsen –use reference values for other dimensions when searching a particular dimension

AutoTuning Workshop 06Rice University10 Benefits of Pruning Strategy Reduce the size of the exploration search space –Single parameter captures the effects of multiple transformations Effective cache capacity for fusion and tiling choices Register pressure for fusion and loop unrolling –Search Space does not grow with program size One parameter for all tiled loops in the application Correct for inaccuracies in model

AutoTuning Workshop 06Rice University11 Performance Across Architectures

AutoTuning Workshop 06Rice University12 Performance Comparison with Direct Search

AutoTuning Workshop 06Rice University13 Tuning Time Comparison with Direct Search

AutoTuning Workshop 06Rice University14 Conclusions and Future Work Approach of tuning for architectural parameters can significantly reduce the optimization search space, while incurring only a small performance penalty Extend pruning strategy to cover more –transformations Unroll-and-jam Array Padding –architectural parameters TLB

Questions

Extra Slides Begin Here

AutoTuning Workshop 06Rice University17 Performance Improvement Comparison

AutoTuning Workshop 06Rice University18 Tuning Time Comparison

AutoTuning Workshop 06Rice University19 Feedback Next Iteration Parameters Framework Overview Performance Measurement Tools Transformed Source Native Compiler Architectural Specs Parameterized Search Engine LoopTool SourceBinary User Tuning Level Search Space

AutoTuning Workshop 06Rice University20 Why Direct Search? Search decision based solely on function evaluations –No modeling of the search space required Provides approximate solutions at each stage of the calculation –Can stop the search at any point when constrained by tuning time Flexible –Can tune step sizes in different dimensions Parallelizable Relatively easy to implement

21 L A : do j = 1, N do i = 1, M b(i,j) = a(i,j) + a(i,j-1) enddo L B : do j = 1, N do i = 1, M c(i,j) = b(i,j) + d(j) enddo outer loop reuse of a() cross-loop reuse of b() (a) code before transformations inner loop reuse of d()

22 L AB : do j = 1, N do i = 1, M b(i,j) = a(i,j) + a(i,j-1) c(i,j) = b(i,j) + d(j) enddo lost reuse of a() saved loads of b() (b) code after two-level fusion increased potential for conflict misses

23 do i = 1, M, T do j = 1, N do ii = i, i + T - 1 b(ii,j) = a(ii,j)+ a(ii,j-1) c(ii,j) = b(ii,j) + d(j) enddo reduced reuse of d() How do we pick T? Not too difficult if caches are fully associative Can use models to estimate effective cache size for set-associative caches Model unlikely to be totally accurate - Need a way to correct for inaccuracies regained reuse of a()

24 Register Set T 01 … … … T 0N T 11 … … … T 1N T 21 … … … T 2N L1 Cache Tile Size Fusion Config. (L + 1) dimensions (N- p) L x (2 L -q) points Cost Models F 0 … … … F 2 L T 31 … … … T 3N

25 Register Set T 01 … … … T 0N T 11 … … … T 1N T 21 … … … T 2N L1 Cache Tile Size Fusion Config. (L + 1) dimensions (N- p) L x (2 L -q) points 2 dimensions F 0 … … … F 2 L T 31 … … … T 3N Effective Register Set Effective Cache Capacity Cost Model Estimates of machine parameters