Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently with the System Technology Lab at Intel Corporation
2 Introduction to Deterministic Predicate- aware Scheduling (DPAS) [Smelyanskiy03] Predication eliminates branch instructions but increases resource requirements Predicate-aware scheduling oversubscribes resources reduces resource requirements reduces schedule length A br cond B D C F T TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1 3 C if p2 4 D TimeFU 0 A 1 p1,p2=cmpp(cond) 2 B if p1C if p2 3 D
3 Motivation for Probabilistic Predicate-aware Scheduling (PPAS) DPAS can only combine A5 with A2, A3 and A4 What about combining A2 with A3 ? A3 with A4 ? A2 with A6 ? PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict A2 … A3 … A4 … A6 M2 … br A5 … A1 M1 … 2
4 Characteristics of Predicated Code 52% of time is spent in cyclic regions Cyclic PPAS might eliminate up to 38% of all dynamic operations from cyclic regions
5 Outline Motivation Resource Pressure Problem in Predicated Code Probabilistic Predicate Aware Architecture Probabilistic Predicate-aware Modulo Scheduling Performance Results Conclusions
6 Modulo Scheduling Example +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq= This control path is taken 30% of the time Assumed machine: 1 ALU, 1 MEMORY and 1 BRANCH units T
7 Traditional Modulo Schedule (Rau 94) TimeIteration i Iteration i p1=cmpp if p1 br if p1 6 st p1=cmpp if p1 br if p1 10 st Modulo Schedule Modulo Scheduled Loop Kernel ALUMEMBR I I1 + 3 if p1 I2 p1=cmpp st I3 + 2 if p1 br II=4 II=5
8 Probabilistic Predicate-Aware Modulo Scheduling Deterministic Predicate-Aware Modulo Schedule TimeAMB if p1 2 p1=cmpp st if p1 br Probabilistic Predicate-Aware Modulo Schedule TimeAMB if p1+ 3 if p1 2p1 = cmppstbr 0.18 expected delay due to conflicts +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq=0.3 1 / 2 II = 4II = 3.18 Baseline Modulo Schedule TimeAMB if p1 2 p1=cmpp st if p1 br II = 4
9 Must-use ResourcesMay-use Baseline Architecture Model Predicate Register File is only accessed in EXECUTE stage Resources from FETCH to EXECUTE are unconditionally reserved FETCHDISPATCH DECODE REGISTER READ WRITE BACK Predicate Register File PRED READ & EXECUTE
10 PRED READ & DISPATCH DECODE Must-use Resources May-use Resources FETCH REGISTER READ WRITE BACK Predicate Register File (PRF) EXECUTE Extended Predicate-Aware Architecture Resource Conflict Detection and Recovery Unit stall conflict detection conflict recovery Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles
11 Expected Delay Model ev is execution vector delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1 P(ev) is probability of occurrence of ev P(ev) is computed using disjointness and implication, and assuming independence otherwise Example (assume 3 operations, one FU and CDRL=1) ED cfl (op1 if p1, op2 if p2, op3 if p3) =( ) × P(p1=T, p2=T, p3=T) + ( ) × P(p1=T, p2=T, p3=F) + ( ) × P(p1=F, p2=T, p3=T) + ( ) × P(p1=T, p2=F, p3=T)
12 Modulo Scheduling using Expected Delay Model (scheduling operation + 3 if p1) +1+1 p1=cmpp + 2 if p1 + 3 if p1 st br freq= if p1 0brst p1=cmpp if p 0.3 0.3 = 1.0 0.3 = 1.0 0.3 = P conf (+ 2, + 3 ) = 1 P conf (+ 1, + 3 ) = 1 P conf ( p1=pred, + 3 ) = Expected Delay due to Conflicts (CDRL = 1) 3 br p1=cmpp if p if p if p17 st BR may MEM may A may Time total expected delay due to conflicts 0.18 SRT MRT
13 Modulo Scheduling using Expected Delay Model (Finding Expected Initiation Interval, II exp ) More than one way to achieve the same (eg. 3.2) TimeALUMEMBR if p1+ 3 if p1 2 p1=cmpp stbr < 0.2total expected conflict delay TimeALUMEMBR if p1st 1 p1=cmpp + 3 if p1br < 1.2total expected conflict delay start with and increase till or sched. found of schedule found becomes new upper bound becomes new lower bound if no schedule found Use binary search to find upper bound = lower bound = 13
14 Performance Results Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling Compiler Support Trimaran and ELCOR [Trimaran99] Mediabench [Lee97] benchmark suite was evaluated Processor Models (BA – base, PA – predicate-aware) Fetch WidthInt ALUcmpp latencyMemoryCDRL BASE4211- DPAS4231- PPAS42310 and 1 BASE6412- DPAS6432- PPAS64320 and 1 6-wide 4-wide
15 Cyclic PPAS Speedup over BASE (4-wide machine) 4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS Increased CDRL has degraded performance
16 Various Scheduling Measurements (4-wide machine, CDRL = 0) Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS Expected delay mode accurately predicts delay due to conflict Predicate-aware scheduling increases the epilogue size and required more rotating registers than BASE % PPAS %23.5 DPAS %27.6 BASE # Rotating Registers Epilogue Size Absolute Error II runtime II compile
17 Overall Speedup over BASE with Cyclic PPAS Only 52% of regions are scheduled with cyclic PPAS Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4% better than base
18 Summary of PPAS PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts compiler maximizes sharing in view of expected conflict PPAS architecture detects and recovers from conflicts PPAS improves performance by For further discussion, see Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Ph.D. Dissertation, University of Michigan, 2004 Overall (cmpplat=3, CDRL=0) Cyclicvs. Basevs. DPAS 4-wide20%10%6% 6-wide8%4%3%
Backup Foils
21 Resource Conflict Detection and Recovery Unit A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 one operation per assigned FU Design alternatives to dispatch conflicting operations Conflict Detection and Recovery Latency (CDRL) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2A3 ALU1ALU2 one operation per any FU (not evaluated) A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0A1A5 1A2 2A3 ALU1ALU2 CDRL = 0 A1 if 1A2 if 1A3 if 1A4 if 0A5 if 1 0conflict detected (dispatch bubble) 1A1A5 2A2 3A3 ALU1ALU2 CDRL = 1
22 Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)