The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Slides:

Advertisements

Similar presentations

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.

Advertisements

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Mitigating the Compiler Optimization Phase- Ordering Problem using Machine Learning.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Code Generation Simple Register Allocation Mooly Sagiv html:// Chapter

Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger Register Bank Assignment For Spatially Partitioned Processors.

CSRD, University of Illinois at Urbana-Champaign 1 A Complete Compilation System.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

1 Chapter 2 Program Performance. 2 Concepts Memory and time complexity of a program Measuring the time complexity using the operation count and step count.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Empirical Search and Library Generators

Automatic Performance Tuning

High Performance Computing (CS 540)

Introduction to Code Generation

STUDY AND IMPLEMENTATION

Register Pressure Guided Unroll-and-Jam

Optimizing MMM & ATLAS Library Generator

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

What does it take to produce near-peak Matrix-Matrix Multiply

CMSC 611: Advanced Computer Architecture

Cache-oblivious Programming

An analytical model for ATLAS

Presentation transcript:

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, uiuc.edu University of Illinois at Urbana Champaign

LCPC20032 Motivation Long basic blocks  Compiler optimizations  Library generators  SPIRAL, ATLAS Speedup obtained from unrolling FFTs The effectiveness of register allocation on long basic blocks

LCPC20033 Contributions Apply Belady ’ s MIN algorithm to long basic blocks Compared with MIPSPro, Belady ’ s MIN algorithm performs  10% faster on Matrix Multiplication code  12% faster on FFT code of size 32  33% faster on FFT code of size 64

LCPC20034 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

LCPC20035 Belady ’ s MIN Algorithm Chooses the farthest next use Guarantees the minimum number of reloads but not the stores. 1 c = a + b 2e = a + d 3f = a - b 4h = d + c 3 FP registers load R1, a load R2, b add R3, R1, R2 store R3, c load R3, d … …… … RegR1R2R3 Varabc Next use234 Statusclean dirty spill d 4 clean

LCPC20036 Belady’s algorithm A Simple Compiler Long Basic Blocks Parsing Register Allocation MIPS assembly code Target code generation SPIRAL after optimizations ATLAS after optimizations

LCPC20037 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

LCPC20038 SPIRAL and FFT Code Make use of formula transformations and intelligent search to generate optimized DSP libraries Search for the best degree of unrolling  Small size (2 to 64)  Straight line code  FFT64: stmts  large size (128+)  use small size results as components. FrFr FsFs

LCPC20039 Interesting Patterns in FFT Codes … y32 = y2 - y18 y33 = y3 - y19 y34 = y2 + y18 y35 = y3 + y19 y36 = y10 - y26 y37 = y11 - y27 y38 = y10 + y26 y39 = y11 + y27 y40 = y34 - y38 y41 = y35 - y39 y42 = y34 + y38 y43 = y35 + y39 y44 = y32 - y37 y45 = y33 + y36 … One define Two uses Close uses Simplify: One-define, one-use program It can be proved that Belady ’ s MIN algorithm generates the minimum number of reloads and stores!

LCPC Performance Evaluation: FFT Speedup: FFT 32: 12% FFT 64: 33% no spill spills Performance of the best formula for FFT 4-64 Performance of all the formulas for FFT 64

LCPC Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

LCPC ATLAS and Matrix Multiplication An empirical optimizer searching for best parameters (degree of unrolling, etc.) Study: innermost loop body  When KU=64 NU:MU=2:2 to 8: LOC A MU KU B NU KU Tile Size C for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C 1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers load B 1..2, k into registers C 1 += A k, 1 * B 1, k C 2 += A k, 1 * B 2, k … C 8 += A k, 4 * B 2,k Repeat * for k+1, k+KU-1 store C back to memory *

LCPC Performance Evaluation for MM no spill spills Spills for NU:MU = 4:8 MIPSPro: 438 MIN: 892

LCPC Explanation Keep spilling c elements  More stores Long dependency chain 1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 * b64 6 c 65 += a1 * b64 7 c 66 += a2 * b64 8 c 67 += a3 * b64 9 c 0 += a64* b1 10 c 1 += a65* b1 11 c 2 += a66* b1 12 c 3 += a67* b1 13 c 64 += a64* b65 14 c 65 += a65* b65 15 c 66 += a66 * b65 16 c 67 += a67 * b65 Original load R0, a0 load R1, b0 load R2, c0 1 madd R2, R2, R0, R1 load R3, a1 load R4, c1 2 madd R4, R4, R3, R1 load R5, a2 store R4, c1 load R4, c2 3 madd R4, R4, R5, R1 store R4, c2 load R4, a3 store R2, c0 load R2, c3 4 madd R2, R2, R4, R1 … …… …

LCPC Solution Spill a, b, c elements  Less stores Break dependency chain Use the instruction scheduling from MIPSPro c 0 += a0 * b0 c 0 += a64* b1 c 1 += a65* b1 c 1 += a1 * b0 c 2 += a2 * b0 c 3 += a3 * b0 c 2 += a66* b1 c 3 += a67* b1 c 66 += a2 * b64 c 64 += a0 * b64 c 66 += a67 * b65 c 64 += a64* b65 c 65 += a1 * b64 c 67 += a3 * b64 c 67 += a68 * b65 c 65 += a65* b65 Scheduled by MIPSPro

LCPC Performance Evaluation for MM 10% better when mu:nu are larger than 4:6 no spill spills Spills for NU:MU = 4:8 MINSched: 356 MIPSPro: 438 MIN: 892

LCPC Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

LCPC Conclusions Belady ’ s MIN algorithm  Generates minimum number of reloads and stores for one-define and one-use problem  Performs better than the state of the art compilers like MIPSPro and GCC  Speedup: 12% (FFT32), 33%(FFT64), 10%(MM) Further benchmarks to be tested

LCPC Code Size after Register Allocation

LCPC Speedup from Fully Unrolling