Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University of Michigan Hsien-Hsin (Sean) Lee School of ECE Georgia Institute of Technology
2 Motivation Predication eliminates branch instructions but increases resource requirements Predicate-aware scheduling oversubscribes resources reduces resource requirements reduces schedule length A br cond 0: A 1: p1,p2=pred_def(cond) 2: B if p1 3: C if p2 4: D B D C F T 0: A 1: p1,p2=pred_def(cond) 2: B if p1 C if p2 3: D
3 Potential for Disjoint Operations Combining reduces dynamic operation count by 13%
4 Outline Motivation Resource Pressure Problem in Predicated Code PRAVO: PRedicate-Aware VLIW Processor Predicate-aware Scheduling Performance Results Conclusion and Future Work
5 Modulo Scheduling Example for(i=0; i < im_size; i++) { if (q_im[i] ≥ 1) res[i] = q_im[i] * bin_size – correction; else if (q_im[i] ≤ -1) res[i] = q_im[i] * bin_size + correction; else res[i] = bin_size + correction; } op1:t1 = load(i1, q_im) if T op2:p1,p2=pred_def (t1 ≥ 1) if T op3:t2 = multsub(t1, tbs, tcor) if p1 op4:store(i1, res, t2) if p1 op5:p3,p4 = pred_def (t1 ≤ -1) if p2 op6:t2 = multadd(t1, tbs, tcor) if p3 op7:store(i1, res, t2) if p3 op8:t2 = add(tbs, tcor) if p4 op9:store(i1++, res, t2) if p4 op10:if (i++ < im_size) goto op1 if T Source Code Predicated Code Three control paths: P T, P FT, P FF
6 Traditional Modulo Schedule (Rau 94) TimeIteration i Iteration i + 1 0op1 1 2op2 3op5 4op3 op10 5op6op1 6op8 7op4op2 8op7op5 9op9op3 op10 10op6 11op8 12op4 13op7 14op9 Modulo Schedule Modulo Scheduled Loop Kernel ALUMEMBR I0op6op1 I1op8 I2op2op4 I3op5op7 I4op3op9op10 II=5
7 Two Predicate-Aware Modulo Schedules Modulo Scheduled Loop Kernel 1 ALUMEMBR op3 op6op1 op8op7 op2op9 op5op4op10 FW = 3II = 4 Modulo Scheduled Loop Kernel 2 ALUMEMBR op3 op6 op8op1 op5op4 op7 op2op9op10 FW = 4II = 3 Resource oversubscription can produce more efficient schedules (if colored operations can share entry) Larger Fetch Width (FW) allows more oversubscription and faster schedule
8 Must-use ResourcesMay-use Baseline Architecture Model Predicate Register File is only accessed in EXECUTE stage Resources from FETCH to EXECUTE are unconditionally reserved FETCHDISPATCH DECODE REGISTER READ WRITE BACK Predicate Register File PRED READ & EXECUTE
9 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK Predicate Register File (PRF) EXECUTE Predicate-aware Architecture (PRAVO) PRF is accessed early in DISPATCH stage increases predicate defining operation latency
10 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK EXECUTE Predicate-aware Architecture (PRAVO) DECODE and DISPATCH are reversed Predicate Register File (PRF)
11 Predicate defining operation edge latency adjustment ResMII computation Predicate-Aware Reservation Table Three Main Changes to Conventional Scheduler Build DDG Cyclic Scheduler Acyclic Scheduler Compute ResMII / RecMII Reservation Tables
12 Data Dependence Graph Latency Adjustment TimeAM 0p1,p2= pred_def 1+ 1 if p1ld if p if p if p if p1 TimeAM 0p1,p2= pred_def if p1ld if p if p2 + 2 if p if p2 TimeAM 0p1,p2= pred_def 1ld if p if p1 + 3 if p if p1 + 4 if p2 OriginalBrute forceSelective p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p
13 M may Computation of Resource-Constrained Lower Bound Predicate-aware ResMII computation “first-fit” combining Fetch Width (FW) resource constraint FW must Original (ResMII=5)Predicate-Aware (ResMII=3) MA + 3 if p2 + 4 if p2 + 1 if p1 + 2 if p1 A may p1,p2= + 3 if p2 + 4 if p2 ld if p FW ld if p2 + 2 if p1 p,p=p,p= + 1 if p1 p1,p2=pred_def + 1 if p1 ld if p2 + 3 if p2 + 4 if p if p
14 Reservation Table (similar to [Warter 92]) One operation per RT entry TimeRes 1…Res nRes n+1 0op1 1op2 … rop3 TimeRes1 may …Res n must Res n+1 must 0op1op2p1 | p2op1op2 1 … rop3TRUEop3 Multiple disjoint operations per RT entry Check disjointness (using PQS [Johnson96])
15 Performance Results Compare the performance of baseline and predicate-aware scheduling Compiler Support Trimaran and ELCOR [Trimaran99] Mediabench [Lee97] benchmark suite was evaluated Processor Models (BA – base, PA – predicate-aware) Fetch WidthInt ALUcmpp latencyMemory BA PA42421 / 2 / 31 BA PA64641 / 2 / 32
16 Predicate-aware Speedup over Baseline (PA42 vs. BA42) Speedup is only due to improvable PA regions Speedup decreases for higher latency and wider machine average
17 Average Speedup Breakdown Only 68% of regions are PA scheduled PA is more effective in modulo scheduled loops
18 Summary and Future Work Summary Predicate-aware Scheduling reduces resource constraints in predicated code is supported by PRAVO architecture is effective in cyclic regions (16% speedup on 4-wide PRAVO) Future work More resource sharing can be achieved by combining probabalistically disjoint operations
Q&A and Suggestions
Backup Foils
21 Modulo Scheduling Using PART TimeA may M may B may must IW1 must IW2 must IW3 0 op1 P T | P FT | P FF op1 1 2 op2 P T | P FT | P FF op2 3 op10 P T | P FT | P FF op10 4 op5 P T | P FT | P FF op5 5 op3PTPT 6 7 op6 op8 P FT | P FF op6 op8 8 9 op4 op9 P T | P FF op4 op9 10 op7 P FT op7 11
22 Speedup Analysis PA Potential ▬ Base Sched. Length ▬ PA Sched. Length ▬ PA Critical Path Length ▬ PA Resource Bound Predicate-Aware Acyclic RegionPredicate-Aware Cyclic Region 0 Cycles Cycles wide cmpplat=2 Case 1 6-wide cmpplat=2 Case 3 6-wide cmpplat=2 Case 6 4-wide cmpplat=3 Case 2 Case 5 4-wide cmpplat=3 4-wide cmpplat=2 Case 4