Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin
2 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
3 Predicated Execution Convert control flow dependence to data dependence (normal branch code) CB D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 A B C B C D A (predicated code) A B C if (cond) { b = 0; } else { b = 1; } p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0
4 Fetch Decode Rename Schedule RegisterRead Execute Benefit of Predicated Execution Predicated Execution can be high performance and energy-efficient. A B C D A E F Predicated Execution Branch Prediction Pipeline flush!! EDBF nop Fetch Decode Rename Schedule RegisterRead Execute A B A C BA CB D A DCBEAEDCFB A FEDC BAAFBCDE F EDABCFEABCD FED CBA FE DCAB EDC BAFAFBCDE
5 Limitations/Problems of Predication ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser ’ 98] can solve this problem but it is only applicable to simple hammocks. Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim ’ 05] Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region, and complex CFGs Hyperblock[Mahlke ’ 92] cannot adapt to frequently-executed paths dynamically.
6 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
7 Diverge-Merge Processor (DMP) DMP can dynamically predicate complex branches (in addition to simple hammocks). The compiler identifies Diverge branches Control-flow merge (CFM) points The microarchitecture decides when and what to predicate dynamically.
8 select-µops (φ-nodes in SSA) Dynamic Predication A B C H Klauser et al.[PACT’98]: Dynamic-hammock predication CB H A T N mov R1, 1 jmp JOIN TARGET: mov R1, 0 A B C p1 = (cond) branch p1, TARGET (mov R1, 1) PR10 = 1 (mov R1, 0) PR11 = 0 PR12 = (cond) ? PR11 : PR10 Low-confidence H JOIN: add R5, R1, 1
9 Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path A C E B H Insert select-µops Diverge Branch CFM point A H
10 diverge-branch executed block CFM point Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path AAA AAA A H
11 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic Hammock SW pred Wish br. Dual-path
12 Dual-path Execution vs. DMP Low-confidence C D E F B D E F A B C D E F path 1path 2 C D E F B path 1path 2 Dual-pathDMP CFM
13 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic- hammock SW pred Wish br. Dual-path sometimes
14 Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically predicated in DMP.
15 Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically predicated in DMP.
16 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
17 Fetch Mechanism CB E D F G predicted path A C E B H Diverge Branch CFM point A H Low Confidence Round-robin fetch
18 PR21 PR11 PR41 add pr21 pr13, #1 (p1) Dynamic Predication Arch.Phy.M R1 R2PR12 R3PR13 A C E B H branch r0, C add r1 r3, #1 add r4 r1, r3 add r1 r2, # -1 branch pr10,C p1 = pr10 add pr24 pr41, pr13add pr31 pr12, # -1(!p1) Arch.Phy.M R1 R2PR12 R3PR13 PR select-µop pr41 = p1? pr21 : pr31 RAT2 RAT1 Forks RAT, RAS, and GHR PR11
19 DMP Support ISA Support Mark diverge branches/CFM points. Compiler Support [CGO’07] The compiler identifies diverge branches and the corresponding CFM points. Hardware Support Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication
20 Hardware Complexity Analysis ST-LD Forwarding SW pred. Dual path Select-Uop Gen. Rename Support Front-End Check Flush/no Flush Predicate Registers Confidence Estimator Wish br. Multi path Dyn. ham. DMP
21 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
22 Simulation Methodology 12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation Alpha ISA execution driven simulator Baseline processor configuration 64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence estimator Less aggressive processor (paper) Power model using Wattch
23 Different CFG types
24 Performance Improvement
25 Energy Consumption
26 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
27 Conclusion DMP introduces the concept of frequently-hammocks and it dynamically predicates complex CFGs. DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG. DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate dynamically.
Thank You!!
Questions?