CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral

Slides:



Advertisements
Similar presentations
A Brief Introduction to the Draft English Language Arts 8-12 IRP
Advertisements

Your performance improvement partner 2/25/
The following 5 questions are about VOLTAGE DIVIDERS. You have 20 seconds for each question What is the voltage at the point X ? A9v B5v C0v D10v Question.
Unit-iv.
From RegentsEarth.com How to play Earth Science Battleship Divide the class into two teams, Red and Purple. Choose which team goes first. The main screen.
Advanced Concepts in Scheduling SCH02 Stephen Rando.
Inside the binary adder. Electro-mechanical relay A solid state relay is a switch that is controlled by a current. When current flows from A to B, the.
1 Child Nutrition Services Understanding the Child Nutrition Tech Online Application/Agreement Step-By-Step REGION 3 Policy Update Meeting Thursday, February.
Constraint Satisfaction Problems
PLAN DU COLLEGE JEAN MONNET RDC1 er étage. Prendre la feuille de papier millimétré dans le sens de la largeur :
PLAN DU COLLEGE JEAN MONNET RDC1 er étage. Prendre la feuille de papier millimétré dans le sens de la largeur :
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Internal Auditing in the Government of Ontario, Canada Stuart Campbell, CIA, CGA, CISA Director, Internal Audit Government of Ontario, Canada.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
4 th grade Math CRCT Spring 2002Houston County and GA 1. Which number is a multiple of 7? A. 12 B. 16 C. 20 D. 28 Response System % State % A B
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
Agents & Intelligent Systems Dr Liz Black
Statens senter for arkiv, bibliotek og museum Indicators for Public Libraries Libraries in Knowledge Society – Strategies for.
CSC 4181 Compiler Construction Code Generation & Optimization.
SE-292 High Performance Computing
Heuristic Search The search techniques we have seen so far...
1 A New Multiplication Technique for GF(2 m ) with Cryptographic Significance Athar Mahboob and Nassar Ikram National University of Sciences & Technology,
SEQUENCING PROBLEMS.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Free Macro Download from i-present.co.uk by GMARK Ltd.i-present.co.ukGMARK my text Lorem for more information :
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
An Application of Linear Programming Lesson 12 The Transportation Model.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
1 Comnet 2010 Communication Networks Recitation 4 Scheduling & Drop Policies.
CMPUT Compiler Design and Optimization1 CMPUT680 - Fall 2003 Topic J: Wavefront Scheduling José Nelson Amaral
CMPUT Compiler Design and Optimization1 CMPUT680 - Fall 2003 Topic B: Open Research Compiler José Nelson Amaral
20S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Spreadsheets Lesson: SS-L4 Creating Spreadsheet Formulas Creating Spreadsheet Formulas Learning.
Compass Practice B Algebra Test. B1.Which of these is the product of (a + 2b) and (c - d)?  A.ac + ad + bc - 2bd  B.ac - ad + bc - 2bd  C.ac - ad +
Economics (H) Chapter 1 Review Game Factors of Production Production Possibilities Goods & Services Productivity & Growth Value & Wealth MISC
Microsoft Office Grade 10 A / B Cahaya Bangsa Classical School (C) 2010 Digital Media Production Facility 14 Microsoft Excel – 05.
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
Addition 1’s to 20.
25 seconds left…...
CMPUT Compiler Design and Optimization1 Borrowed from J. N. Amaral, slightly modified LIVE-IN: k j.
Week 1.
GCSE Sawston VC Gary Whitton – Head of Science.
(8) I. Word Processing Package (Mail Merge) Ex: MS Word, Open Office II. Spread sheet package (function) Ex: MS Excel III. Presentation software (picture,
We will resume in: 25 Minutes.
©2004 Brooks/Cole FIGURES FOR CHAPTER 12 REGISTERS AND COUNTERS Click the mouse to move to the next page. Use the ESC key to exit this chapter. This chapter.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Copyright © Cengage Learning. All rights reserved.
Computer Science: A Structured Programming Approach Using C Stacks A stack is a linear list in which all additions and deletions are restricted to.
SOL 4.21 Patterns By, Jennifer Sagendorf ITRT-Suffolk Public Schools.
Tarun Bansal*, Karthik Sundaresan+,
Shortest Paths (1/11)  In this section, we shall study the path problems such like  Is there a path from city A to city B?  If there is more than one.
Compiler Construction
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
CMPUT Compiler Design and Optimization
Section 3.4 The Traveling Salesperson Problem Tucker Applied Combinatorics By Aaron Desrochers and Ben Epstein.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Application of Ensemble Models in Web Ranking
Generic Software Pipelining at the Assembly Level Markus Pister
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Loop Scheduling and Software Pipelining
Instruction Level Parallelism (ILP)
Presentation transcript:

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral

CMPUT Compiler Design and Optimization2 Reading List zTiger book: chapter 20 zOther papers such as: GovindAltmanGao97, RutenbergAtAl97

CMPUT Compiler Design and Optimization3 Software Pipeline Software Pipeline is a technique that reduces the execution time of important loops by interweaving operations from many iterations to optimize the use of resources time ldf fadds stf sub cmp bg

CMPUT Compiler Design and Optimization4 Software Pipeline What limits the speed of a loop? Data dependencies: recurrence initiation interval (rec_mii) Processor resources: resource initiation interval (res_mii) Memory accesses: memory initiation interval (mem_mii) time ldf fadds stf sub cmp bg Initiation interval

CMPUT Compiler Design and Optimization5 Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M. Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no schedule is faster than S. Note: There may be more than one time- optimal schedule.

CMPUT Compiler Design and Optimization6 Example: The Inner Product Q = 0.0 DO k = 1, N Q = Q+Z(k)*X(k) ENDDO z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,N u k  load z k-1 v k  load x k-1 w k  u k * v k q k  q k-1 + w k z k  z k x k  x k END DO (Dehnert, J. and Towle, R. A., “Compiling for Cidra 5”) Dynamic Single Assignment (DSA): Uses an expanded virtual register (EVR) that is an infinite, linearly ordered, set of virtual registers. A program in DSA has no anti-dependencies and no output dependencies.

CMPUT Compiler Design and Optimization7 Machine Model and Resource Constraints z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,N u k  load z k-1 MEM v k  load x k-1 MEM w k  u k * v k FMULT q k  q k-1 + w k FADD z k  z k-1 + 4ADDR x k  x k-1 + 4ADDR END DO What unit each operation in the loop uses? UnitLatency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 Machine Model Without instruction level parallelism. How long does the loop take to execute? ( )*N=18*N

CMPUT Compiler Design and Optimization8 The Resource Minimum Initiation Interval of a loop is given by: Resource Minimum Initiation Interval (resMII) Each processor resource defines a minimum initiation interval for the execution of the loop. For instance in the machine model in the previous example, a loop that requires the computation of 6 addresses has a ResMII(ADDR) = 6*1/2 = 3.

CMPUT Compiler Design and Optimization9 ResMII z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,N u k  load z k-1 MEM v k  load x k-1 MEM w k  u k * v k FMULT q k  q k-1 + w k FADD z k  z k-1 + 4ADDR x k  x k-1 + 4ADDR END DO UnitLatency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 Machine Model There are enough units to schedule all the instructions of the loop in the same cycle. Therefore ResMII = 1. Can we execute the loop in N+C cycles (C = a small constant)?

CMPUT Compiler Design and Optimization10 Recurrence Minimum Initiation Interval (RecMII) z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,N (a)u k  load z k-1 (b)v k  load x k-1 (c)w k  u k * v k (d)q k  q k-1 + w k (e)z k  z k (f)x k  x k END DO k=1 ab c d e f k=2 ab c d e f k=3 ab c d e f ab c d e f (1)

CMPUT Compiler Design and Optimization11 Recurrence Minimum Initiation Interval (RecMII) ab c d e f (1,2) (1,1) z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,NUnit Lat. (a)u k  load z k-1 MEM (6) (b)v k  load x k-1 MEM (6) (c)w k  u k * v k FMULT (2) (d)q k  q k-1 + w k FADD (2) (e)z k  z k-1 + 4ADDR (1) (f)x k  x k-1 + 4ADDR (1) END DO (dist,lat)

CMPUT Compiler Design and Optimization12 Recurrence Minimum Initiation Interval (RecMII) ab c d e f (1,2) (1,1) (dist,lat) The recursive minimum initiation interval (rec_mii) is given by: Quiz: What is the rec_mii for the example?

CMPUT Compiler Design and Optimization13 Minimum Initiation Interval The Minimum Initiation Interval (MII) for a loop is constrained both by resources and recurrences, therefore, it is given by: In our example we have MII = max(1,2) = 2. Therefore the best that we can do without transforming the loop is to execute it in 2*N+C.

CMPUT Compiler Design and Optimization14 Module Schedule In module scheduling, we: (1) start with the first instruction (2) schedule as many instructions as we can in every cycle, limited only by the resources available and by the dependences. When a pattern emerges, we adopt the pattern as our module schedule. Instructions before this pattern form the loop prologue. Instructions after this pattern form the loop epilogue.

Recurrence Minimum Initiation Interval (RecMII) z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,NLat. (a)u k  load z k-1 (6) (b)v k  load x k-1 (6) (c)w k  u k * v k (2) (d)q k  q k-1 + w k (2) (e)z k  z k-1 + 4(1) (f)x k  x k (1) END DO

CMPUT Compiler Design and Optimization16 Why an eager scheduler fails in our example Cycles b1 0 b2 1 b3 2 b4 3 b5 4 b6 5 b7 6 c1b8 7 c2 8 b9 d1c3 9 c4 10 d2c Iterations b10 b11 b d3c7 13 c8 14 d4 15 c9 16 d d d d8 23 c6 12 b14 b15 b16 b17c10 c11b18 c12 c13 c14 c15 c16 c17 b13 Cycles

CMPUT Compiler Design and Optimization17 Why an eager scheduler fails in our example Cycles b1 0 1 b2 2 3 b3 4 5 b4 6 c1 7 b5 8 d1c2 9 b6 10 d2c Iterations d3c4 13 b8 14 d4c b9 d5c d6c d7c d8 23 c9 b7 12 b10 b11 b12 Cycles Therefore we can do it in 2*N+9 cycles.

CMPUT Compiler Design and Optimization18 Collision vectors Given the reservation tables for two operations A and B, the set of forbidden intervals, i.e., intervals at which distance the operations A and B cannot be issued is called the collision vector for the reservation tables.

CMPUT Compiler Design and Optimization19 A Simplistic Module Scheduling Algorithm 1. Compute MII as discussed 2. Use a modified list scheduling algorithm to generate a module schedule. The scheduling algorithm must obey the following restriction: If an operation P is scheduled at time t, it cannot be scheduled at any time t  k*II for any k  0. The Module Reservation Table has II rows, representing the cycles of the initiation interval, and as many columns as the resources that it needs to keep track of.

CMPUT Compiler Design and Optimization20 Heuristic Method for Modulo Scheduling Why a simple variant of list scheduling may not work? Problem: Generate a module schedule of a loop by scheduling instructions until a pattern emerge.

CMPUT Compiler Design and Optimization21 AC BD (0,4) (0,2) (1,2) Counter Example I: List Scheduling May Fail There is only one cycle in the dependence graph, therefore RecMII is given by: Therefore, in a machine with infinite resources, we must be able to schedule the loop in 4 cycles.

CMPUT Compiler Design and Optimization22 Counter Example I: List Scheduling May Fail AC BD (0,4) (0,2) (1,2) CA D D AC List Scheduling: a greedy algorithm that schedules each operation at its earliest possible time B must be scheduled after the A of the current iteration and before the C of the next iteration. We are deadlocked!!! B B ???

CMPUT Compiler Design and Optimization23 Counter Example I: List Scheduling May Fail AC BD (0,4) (0,2) (1,2) CA D B D(0)A(0)C(0) A(1)B(0)C(1)……………D(N)B(N) The solution is to create a kernel with operations from different iterations, and use a prologue and an epilogue. prologue epilogue kernel

CMPUT Compiler Design and Optimization24 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) A1, A3, and A4 are non-pipelined adds that take two cycles at the adder M5 and M6 are non-pipelined multiply operations that take three cycles each on the multiplier C2 is a copy operation that uses the bus for one cycle What is the ResMII for these operations in a machine that has one adder, one multiplier and one bus? ResMII(Adder) = 6; ResMII(Multiplier) = 6 ResMII(Bus) = 1 ResMII = 6 Counter Example II: List Scheduling May Fail

CMPUT Compiler Design and Optimization25 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) Counter Example II: List Scheduling May Fail Adder Mult Bus A1 A3 A4 M6 C2 A4 ??? We cannot schedule A4 and achieve an MII = ResMII = 6!!!

CMPUT Compiler Design and Optimization26 A1 C2 A3 A4 M5 M6 (0,2) (0,1) (0,2) (0,3) Counter Example II: List Scheduling May Fail Adder Mult Bus A1 A3 A4 M6 M5 M6 M5 C2 A4 Although it seems counter-intuitive we obtain a module schedule with MII = 6 if we initially schedule both M6 and A3 one cycle later than the earliest possible time for these operations.

CMPUT Compiler Design and Optimization27 Complex Reservation Tables Consider three independent operations with the reservation tables shown below A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus What is the MII for a loop formed by this three operations? ResMII(Add) = = 2 Res MII(Mult) = = 2 ResMII(Bus) = = 2 ResMII = 2

CMPUT Compiler Design and Optimization28 Is the MII = 2 Feasible?? A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus A Adder Mult Bus A1 M2 Deadlocked. Cannot allocate MA3. Even though MII = max(ResMII, RecMII) = 2, MII = 2 is not feasible!!!!

CMPUT Compiler Design and Optimization29 Increasing MII to 3 helps? A1M2MA3 (0,2) (0,3)(0,4) Add Mult Bus A1M Adder Mult Bus A1 M2 MA3 We find a module schedule with MII = 3!!

CMPUT Compiler Design and Optimization30 Iteration Between Recurrence Constraints and Resource Constraints A1 A2 A3 A4 (0,2) (2,2) (0,2) A Add Mult Bus What is the RecMII for this loop? RecMII = ( )/2 = 4 What is the ResMII for the loop? ResMII(Add) = = 4 ResMII(Mult) = = 0 ResMII(Bus) = = 4 ResMII = 4 Therefore MII = max(ResMII,RecMII) = 4

CMPUT Compiler Design and Optimization31 Is the MII = 4 feasible? A1 A2 A3 A4 (0,2) (2,2) (0,2) A Add Mult Bus A1 A Adder Mult Bus A1 In order to finish A4 in time to produce the result for two iterations later, A3 must be scheduled at time 4. But 4 module 4 = 0, which conflicts with A1. Therefore there is no feasible schedule with MII = 4.

CMPUT Compiler Design and Optimization32 Scheduling Strategy An exhaustive search will eventually reveal that the MII calculated is not feasible, but it might take too long. In practice, we compute the MII and spend a pre-allocated budget of time trying to find a schedule with the MII. If we don’t find one, we increase the MII. In some commercial compilers, the search for the smallest feasible II is a binary search, where the II is doubled at each step until a feasible one is found, at which point a linear search between the last unfeasible II and the feasible one is conducted.

CMPUT Compiler Design and Optimization33 Previous Approaches zApproach I (Operational): y“Emulate” the loop execution under the machine model and a “pattern” will eventually occur [AikenNic88, EbciogluNic89, GaoEtAl91] zApproach II (Periodic scheduling): ySpecify the scheduling problem into a periodical scheduling problem and find optimal solution [Lam88, RauEtAl81,GovindAltmanGao94]

Software Pipelining Operational Approach Periodic Scheduling (Modulo Scheduling) Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc) Formal Model (GaoWonNin91) Non-Exact Method (Heuristic) ( RauGla81, Lam88, RauEtA192, Huff93, DehnertTow93, Rau94, WanEis93) Exact Method Basic Formulation (DongenGao92) ILP based Exhausitive Search ( Altman95, AltmanGao96) Register Optimal (NingGao91, NingGao93, Ning93) Resource Constrained (GovindAltGao94) Resource & Register (GovindAltGao95, Altman95, EichenbergerDav95) “Showdown” (RuttenbergGao StouchininWoody96)