Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University.

Slides:

Advertisements

Similar presentations

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

EECS 583 – Class 13 Software Pipelining University of Michigan October 24, 2011.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Generic Software Pipelining at the Assembly Level Markus Pister

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

EKT303/4 Superscalar vs Super-pipelined.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Flow Path Model of Superscalars

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Michael Chu, Kevin Fan, Scott Mahlke

Superscalar Processors & VLIW Processors

EECS 583 – Class 13 Software Pipelining

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Yingmin Li Ting Yan Qi Zhao

Register Pressure Guided Unroll-and-Jam

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

EECS 583 – Class 14 Modulo Scheduling Reloaded

EECS 583 – Class 13 Modulo Scheduling

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 11: Machine-Dependent Optimization

Presentation transcript:

Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University of Michigan Hsien-Hsin (Sean) Lee School of ECE Georgia Institute of Technology

2 Motivation  Predication eliminates branch instructions but increases resource requirements  Predicate-aware scheduling oversubscribes resources reduces resource requirements reduces schedule length A br cond 0: A 1: p1,p2=pred_def(cond) 2: B if p1 3: C if p2 4: D B D C F T 0: A 1: p1,p2=pred_def(cond) 2: B if p1 C if p2 3: D

3 Potential for Disjoint Operations  Combining reduces dynamic operation count by 13%

4 Outline  Motivation  Resource Pressure Problem in Predicated Code  PRAVO: PRedicate-Aware VLIW Processor  Predicate-aware Scheduling  Performance Results  Conclusion and Future Work

5 Modulo Scheduling Example for(i=0; i < im_size; i++) { if (q_im[i] ≥ 1) res[i] = q_im[i] * bin_size – correction; else if (q_im[i] ≤ -1) res[i] = q_im[i] * bin_size + correction; else res[i] = bin_size + correction; } op1:t1 = load(i1, q_im) if T op2:p1,p2=pred_def (t1 ≥ 1) if T op3:t2 = multsub(t1, tbs, tcor) if p1 op4:store(i1, res, t2) if p1 op5:p3,p4 = pred_def (t1 ≤ -1) if p2 op6:t2 = multadd(t1, tbs, tcor) if p3 op7:store(i1, res, t2) if p3 op8:t2 = add(tbs, tcor) if p4 op9:store(i1++, res, t2) if p4 op10:if (i++ < im_size) goto op1 if T Source Code Predicated Code  Three control paths: P T, P FT, P FF

6 Traditional Modulo Schedule (Rau 94) TimeIteration i Iteration i + 1 0op1 1 2op2 3op5 4op3 op10 5op6op1 6op8 7op4op2 8op7op5 9op9op3 op10 10op6 11op8 12op4 13op7 14op9 Modulo Schedule Modulo Scheduled Loop Kernel ALUMEMBR I0op6op1 I1op8 I2op2op4 I3op5op7 I4op3op9op10 II=5

7 Two Predicate-Aware Modulo Schedules Modulo Scheduled Loop Kernel 1 ALUMEMBR op3 op6op1 op8op7 op2op9 op5op4op10 FW = 3II = 4 Modulo Scheduled Loop Kernel 2 ALUMEMBR op3 op6 op8op1 op5op4 op7 op2op9op10 FW = 4II = 3  Resource oversubscription can produce more efficient schedules (if colored operations can share entry)  Larger Fetch Width (FW) allows more oversubscription and faster schedule

8 Must-use ResourcesMay-use Baseline Architecture Model  Predicate Register File is only accessed in EXECUTE stage  Resources from FETCH to EXECUTE are unconditionally reserved FETCHDISPATCH DECODE REGISTER READ WRITE BACK Predicate Register File PRED READ & EXECUTE

9 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK Predicate Register File (PRF) EXECUTE Predicate-aware Architecture (PRAVO)  PRF is accessed early in DISPATCH stage increases predicate defining operation latency

10 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK EXECUTE Predicate-aware Architecture (PRAVO)  DECODE and DISPATCH are reversed Predicate Register File (PRF)

11  Predicate defining operation edge latency adjustment  ResMII computation  Predicate-Aware Reservation Table Three Main Changes to Conventional Scheduler Build DDG Cyclic Scheduler Acyclic Scheduler Compute ResMII / RecMII Reservation Tables

12 Data Dependence Graph Latency Adjustment TimeAM 0p1,p2= pred_def 1+ 1 if p1ld if p if p if p if p1 TimeAM 0p1,p2= pred_def if p1ld if p if p2 + 2 if p if p2 TimeAM 0p1,p2= pred_def 1ld if p if p1 + 3 if p if p1 + 4 if p2 OriginalBrute forceSelective p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p

13 M may Computation of Resource-Constrained Lower Bound  Predicate-aware ResMII computation “first-fit” combining Fetch Width (FW) resource constraint FW must Original (ResMII=5)Predicate-Aware (ResMII=3) MA + 3 if p2 + 4 if p2 + 1 if p1 + 2 if p1 A may p1,p2= + 3 if p2 + 4 if p2 ld if p FW ld if p2 + 2 if p1 p,p=p,p= + 1 if p1 p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p

14 Reservation Table (similar to [Warter 92])  One operation per RT entry TimeRes 1…Res nRes n+1 0op1 1op2 … rop3 TimeRes1 may …Res n must Res n+1 must 0op1op2p1 | p2op1op2 1 … rop3TRUEop3  Multiple disjoint operations per RT entry  Check disjointness (using PQS [Johnson96])

15 Performance Results  Compare the performance of baseline and predicate-aware scheduling  Compiler Support Trimaran and ELCOR [Trimaran99]  Mediabench [Lee97] benchmark suite was evaluated  Processor Models (BA – base, PA – predicate-aware) Fetch WidthInt ALUcmpp latencyMemory BA PA42421 / 2 / 31 BA PA64641 / 2 / 32

16 Predicate-aware Speedup over Baseline (PA42 vs. BA42)  Speedup is only due to improvable PA regions  Speedup decreases for higher latency and wider machine average

17 Average Speedup Breakdown  Only 68% of regions are PA scheduled  PA is more effective in modulo scheduled loops

18 Summary and Future Work  Summary Predicate-aware Scheduling reduces resource constraints in predicated code is supported by PRAVO architecture is effective in cyclic regions (16% speedup on 4-wide PRAVO)  Future work More resource sharing can be achieved by combining probabalistically disjoint operations

Q&A and Suggestions

Backup Foils

21 Modulo Scheduling Using PART TimeA may M may B may must IW1 must IW2 must IW3 0 op1 P T | P FT | P FF op1 1 2 op2 P T | P FT | P FF op2 3 op10 P T | P FT | P FF op10 4 op5 P T | P FT | P FF op5 5 op3PTPT 6 7 op6 op8 P FT | P FF op6 op8 8 9 op4 op9 P T | P FF op4 op9 10 op7 P FT op7 11

22 Speedup Analysis PA Potential ▬ Base Sched. Length ▬ PA Sched. Length ▬ PA Critical Path Length ▬ PA Resource Bound Predicate-Aware Acyclic RegionPredicate-Aware Cyclic Region 0 Cycles Cycles wide cmpplat=2 Case 1 6-wide cmpplat=2 Case 3 6-wide cmpplat=2 Case 6 4-wide cmpplat=3 Case 2 Case 5 4-wide cmpplat=3 4-wide cmpplat=2 Case 4