Inter-Iteration Scalar Replacement in the Presence of Control-Flow Mihai Budiu – Microsoft Research, Silicon Valley Seth Copen Goldstein – Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with.
Advertisements

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.
Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CS 201 Compiler Construction
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.
Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.
Compiler Challenges for High Performance Architectures
Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.
School of Computer Science A Global Progressive Register Allocator David Ryan Koes Seth Copen Goldstein Carnegie Mellon University
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
BitValue: Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Majd Sakr, Kip.
Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.
Introduction to Program Optimizations Chapter 11 Mooly Sagiv.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
2005 International Symposium on Code Generation and Optimization Progressive Register Allocation for Irregular Architectures David Koes
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.
Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Seth Copen Goldstein.
Презентація за розділом “Гумористичні твори”
Центр атестації педагогічних працівників 2014
Галактики і квазари.
Характеристика ІНДІЇ.
Процюк Н.В. вчитель початкових класів Боярської ЗОШ І – ІІІ ст №4
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.
CS745: Register Allocation© Seth Copen Goldstein & Todd C. Mowry Register Allocation.
Datta1 Routing for Reliability in Molecular Diode-based Programmable Nanofabrics Kushal Datta, Arindam Mukherjee and Arun Ravindran Department of Electrical.
Performance Optimization Getting your programs to run faster.
Духовні символи Голосіївського району
FPGA Hardware Synthesis Jessica Baxter. Reference M. Haldar, A. Nayak, N. Shenoy, A. Choudhary and P. Banerjee, “FPGA Hardware Synthesis from MATLAB”,
Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.
Optimization Simone Campanoni
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
Introduction To Computer Systems
[ ] [ ] [ ] [ ] EXAMPLE 3 Scalar multiplication Simplify the product:
Parallel Programming in C with MPI and OpenMP
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Проф. д-р Васил Цанов, Институт за икономически изследвания при БАН
ЗУТ ПРОЕКТ на Закон за изменение и допълнение на ЗУТ
О Б Щ И Н А С И Л И С Т Р А П р о е к т Б ю д ж е т г.
Електронни услуги на НАП
Боряна Георгиева – директор на
РАЙОНЕН СЪД - БУРГАС РАБОТНА СРЕЩА СЪС СЪДЕБНИТЕ ЗАСЕДАТЕЛИ ПРИ РАЙОНЕН СЪД – БУРГАС 21 ОКТОМВРИ 2016 г.
Сътрудничество между полицията и другите специалисти в България
Съобщение Ръководството на НУ “Христо Ботев“ – гр. Елин Пелин
НАЦИОНАЛНА АГЕНЦИЯ ЗА ПРИХОДИТЕ
ДОБРОВОЛЕН РЕЗЕРВ НА ВЪОРЪЖЕНИТЕ СИЛИ НА РЕПУБЛИКА БЪЛГАРИЯ
Съвременни софтуерни решения
ПО ПЧЕЛАРСТВО ЗА ТРИГОДИШНИЯ
от проучване на общественото мнение,
Васил Големански Ноември, 2006
Програма за развитие на селските райони
ОПЕРАТИВНА ПРОГРАМА “АДМИНИСТРАТИВЕН КАПАЦИТЕТ”
БАЛИСТИКА НА ТЯЛО ПРИ СВОБОДНО ПАДАНЕ В ЗЕМНАТА АТМОСФЕРА
МЕДИЦИНСКИ УНИВЕРСИТЕТ – ПЛЕВЕН
Стратегия за развитие на клъстера 2015
Моето наследствено призвание
Правна кантора “Джингов, Гугински, Кючуков & Величков”
Безопасност на движението
Matrix Addition
Mihai Budiu Monday seminar, Apr 12, 2004
Parallel Programming in C with MPI and OpenMP
CS 201 Compiler Construction
Presentation transcript:

Inter-Iteration Scalar Replacement in the Presence of Control-Flow Mihai Budiu – Microsoft Research, Silicon Valley Seth Copen Goldstein – Carnegie Mellon University ODES 2005

2 Summary What: compiler optimization Where: dense regular matrix codes –FORTRAN –some media processing Goal: reduce number of memory accesses How: allocate array elements to registers New: optimal algorithm based on predication

3 Outline Scalar Replacement Predicated PRE Combining the two Results

4 Scalar Replacement a[i] = a[i] + 2; a[i] <<= 4; tmp = a[i]; tmp += 2; tmp <<= 4; a[i] = tmp; Back-end ld a[i] arith... st a[i] ld a[i] arith … st a[i] ld a[i] arith … st a[i] Front-end

5 Inter-Iteration Scalar Replacement for (i=0; i < N; i++) a[i] += a[i+1]; ld a[0] ld a[1] st a[0] ld a[1] ld a[2] st a[1] Runtime tmp0 = a[0]; for (i=0; i < N; i++) { tmp1 = a[1]; a[i] = tmp0 + tmp1; tmp0 = tmp1; } i=0 i=1 ld a[0] ld a[1] st a[0] ld a[2] st a[1] i=0 i=1 tmp1

6 Rotating Scalars for (i=0; i < N; i++) a[i] += a[i+3]; Invariant: tmp0 = a[i+0] tmp1 = a[i+1] tmp2 = a[i+2] tmp3 = a[i+3] for (…) { …. tmp0 = tmp1; tmp1 = tmp2; tmp2 = tmp3; tmp3 = a[i+4]; } Itanium has hardware support for rotating registers.

7 Control-Flow for (i=0; i < N; i++) if (i & 1) a[i] += a[i+3];

8 Outline Scalar Replacement Predicated PRE Combining the two Results

9 Availability y y = a[i];... if (x) { = a[i]; }

10 Conservative Analysis if (x) {... y = a[i]; } = a[i]; y?y?

11 Predicated PRE flag = false; if (x) {... y = a[i]; flag = true; } = flag ? y : a[i]; Invariant: flag = true y = a[i]

12 Outline Scalar Replacement Predicated PRE Combining the two Results

13 Scalars and Flags for (i=0; i < N; i++) if (i & 1) a[i] += a[i+3]; (valid 0 = true) tmp 0 = a[i+0] (valid 1 = true) tmp 1 = a[i+1] (valid 2 = true) tmp 2 = a[i+2] (valid 3 = true) tmp 3 = a[i+3] bool scalar Invariant:

14 Scalar Replacement Algorithm if (! valid k ) { ld a[i+k] tmp k = a[i+k]; valid k = true; } Can be implemented with predication or conditional moves st a[i+k], v tmp k = v; valid k = true;

15 Optimality No scalarized memory location is read or written two times The resulting program touches exactly the same memory locations as the original program Proof: trivial based on valid flags invariant [given perfect dependence analysis and enough registers]

16 Additional Details Initialize valid k to false Rotate scalars and valid flags Use dirty k flags to avoid extra stores Postlude for missing stores: if (valid k ) a[N+k] = tmp k Lift loop-invariant accesses (finding loop-invariant predicates) Hardware support (see paper) (for rotating registers and flags).

17 Outline Scalar Replacement Predicated PRE Combining the two Results

18 Redundant Stores % reduction

19 Redundant Loads % reduction

20 Performance Impact % reduction running time [target: Spatial Computation] Removed accesses tend to be cache hits: small contribution to running time.

21 Conclusions Use predicates to dynamically detect redundant memory accesses Simple algorithm gives optimal result even with un-analyzable control flow Can dramatically reduce memory accesses

22 Related Work Carr & Kennedy, PLDI 1990 Scalar Replacement - Arrays, no control flow - Carr & Kennedy, SPE 1994 Generalized Scalar Replacement - Restricted control-flow - Scholz, Europar 2003 Predicated PRE - Single iteration, no writes - This work, ODES 2005 PPRE across iterations - Optimal - Morel & Renvoise, CACM 1979 Partial Redundancy Elimination - Not across remote iterations - Non-speculative promotion Speculative promotion