Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Lecture 4: CPU Performance
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Generic Software Pipelining at the Assembly Level Markus Pister
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.
© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH-2 32-bit RISC.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Introduction to Computer Organization Pipelining.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Chapter Six.
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
ECE/CS 552: Pipelining to Superscalar
CDA 3101 Spring 2016 Introduction to Computer Organization
Henk Corporaal TUEindhoven 2009
Design of the Control Unit for Single-Cycle Instruction Execution
Flow Path Model of Superscalars
The fetch-execute cycle
Superscalar Processors & VLIW Processors
EE 382N Guest Lecture Wish Branches
A Multiple Clock Cycle Instruction Implementation
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Chapter Six.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
EECS 583 – Class 14 Modulo Scheduling Reloaded
CS203 – Advanced Computer Architecture
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently with the System Technology Lab at Intel Corporation

2 Introduction to Deterministic Predicate- aware Scheduling (DPAS) [Smelyanskiy03]  Predication eliminates branch instructions but increases resource requirements  Predicate-aware scheduling oversubscribes resources reduces resource requirements reduces schedule length A br cond B D C F T TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1 3 C if p2 4 D TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1C if p2 3 D

3 Motivation for Probabilistic Predicate-aware Scheduling (PPAS)  DPAS can only combine A5 with A2, A3 and A4  What about combining A2 with A3 ? A3 with A4 ? A2 with A6 ?  PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict A2 … A3 … A4 … A6 M2 … br A5 … A1 M1 … 2

4 Characteristics of Predicated Code  52% of time is spent in cyclic regions  Cyclic PPAS might eliminate up to 38% of all dynamic operations from cyclic regions

5 Outline  Motivation  Resource Pressure Problem in Predicated Code  Probabilistic Predicate Aware Architecture  Probabilistic Predicate-aware Modulo Scheduling  Performance Results  Conclusions

6 Modulo Scheduling Example +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq=  This control path is taken 30% of the time  Assumed machine: 1 ALU, 1 MEMORY and 1 BRANCH units T

7 Traditional Modulo Schedule (Rau 94) TimeIteration i Iteration i p1=cmpp if p1 br if p1 6 st p1=cmpp if p1 br if p1 10 st Modulo Schedule Modulo Scheduled Loop Kernel ALUMEMBR I I1 + 3 if p1 I2 p1=cmpp st I3 + 2 if p1 br II=4 II=5

8 Probabilistic Predicate-Aware Modulo Scheduling Deterministic Predicate-Aware Modulo Schedule TimeAMB if p1 2 p1=cmpp st if p1 br Probabilistic Predicate-Aware Modulo Schedule TimeAMB if p1+ 3 if p1 2p1 = cmppstbr 0.18 expected delay due to conflicts +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq=0.3 1 / 2 II = 4II = 3.18 Baseline Modulo Schedule TimeAMB if p1 2 p1=cmpp st if p1 br II = 4

9 Must-use ResourcesMay-use Baseline Architecture Model  Predicate Register File is only accessed in EXECUTE stage  Resources from FETCH to EXECUTE are unconditionally reserved FETCHDISPATCH DECODE REGISTER READ WRITE BACK Predicate Register File PRED READ & EXECUTE

10 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK Predicate Register File (PRF) EXECUTE Extended Predicate-Aware Architecture Resource Conflict Detection and Recovery Unit stall conflict detection conflict recovery  Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles

11 Expected Delay Model  ev is execution vector delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1 P(ev) is probability of occurrence of ev  P(ev) is computed using disjointness and implication, and assuming independence otherwise  Example (assume 3 operations, one FU and CDRL=1) ED cfl (op1 if p1, op2 if p2, op3 if p3) =( ) × P(p1=T, p2=T, p3=T) + ( ) × P(p1=T, p2=T, p3=F) + ( ) × P(p1=F, p2=T, p3=T) + ( ) × P(p1=T, p2=F, p3=T)

12 Modulo Scheduling using Expected Delay Model (scheduling operation + 3 if p1) +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq= if p1 0brst p1=cmpp if p  0.3  0.3 =  1.0  0.3 =  1.0  0.3 =  P conf (+ 2, + 3 ) = 1  P conf (+ 1, + 3 ) = 1  P conf ( p1=pred, + 3 ) = Expected Delay due to Conflicts (CDRL = 1) 3 br p1=cmpp if p if p if p17 st BR may MEM may A may Time total expected delay due to conflicts 0.18 SRT MRT

13 Modulo Scheduling using Expected Delay Model (Finding Expected Initiation Interval, II exp )  More than one way to achieve the same (eg. 3.2) TimeALUMEMBR if p1+ 3 if p1 2 p1=cmpp stbr < 0.2total expected conflict delay TimeALUMEMBR if p1st 1 p1=cmpp + 3 if p1br < 1.2total expected conflict delay start with and increase till or sched. found of schedule found becomes new upper bound becomes new lower bound if no schedule found  Use binary search to find upper bound = lower bound = 13

14 Performance Results  Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling  Compiler Support Trimaran and ELCOR [Trimaran99]  Mediabench [Lee97] benchmark suite was evaluated  Processor Models (BA – base, PA – predicate-aware) Fetch WidthInt ALUcmpp latencyMemoryCDRL BASE4211- DPAS4231- PPAS42310 and 1 BASE6412- DPAS6432- PPAS64320 and 1 6-wide 4-wide

15 Cyclic PPAS Speedup over BASE (4-wide machine)  4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS  Increased CDRL has degraded performance

16 Various Scheduling Measurements (4-wide machine, CDRL = 0)  Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS  Expected delay mode accurately predicts delay due to conflict  Predicate-aware scheduling increases the epilogue size and required more rotating registers than BASE % PPAS %23.5 DPAS %27.6 BASE # Rotating Registers Epilogue Size Absolute Error II runtime II compile

17 Overall Speedup over BASE with Cyclic PPAS  Only 52% of regions are scheduled with cyclic PPAS  Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4% better than base

18 Summary of PPAS  PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts compiler maximizes sharing in view of expected conflict PPAS architecture detects and recovers from conflicts  PPAS improves performance by  For further discussion, see Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Ph.D. Dissertation, University of Michigan, 2004 Overall (cmpplat=3, CDRL=0) Cyclicvs. Basevs. DPAS 4-wide20%10%6% 6-wide8%4%3%

Questions?

Backup Foils

21 Resource Conflict Detection and Recovery Unit A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 one operation per assigned FU  Design alternatives to dispatch conflicting operations  Conflict Detection and Recovery Latency (CDRL) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2A3 ALU1ALU2 one operation per any FU (not evaluated) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 CDRL = 0 A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0conflict detected (dispatch bubble) 1A1A5 2A2 3A3 ALU1ALU2 CDRL = 1

22 Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)