Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University Huiyang Zhou, Tom Conte
2 Outline Introduction Quantitative measure of code size efficiency Best code size efficiency for a given code size limit Optimal code size efficiency for a program Summary Future work
3 Introduction Instruction level parallelism (ILP) vs. static code size –Region enlarging optimizations usually enhance ILP Cyclic scheduling: loop unrolling, loop peeling, etc. Acyclic scheduling: tail duplication, recovery code, etc. I-cache and ITLB performance vs. static code size –Larger code usually means larger I-Cache footprint Trade off of the conflicting effects of code size increase –Especially in acyclic global scheduling
4 Background of Treegion Scheduling Treegion scheduling –An acyclic scheduling technique –Two phases Treegion formation Treegion-based instruction scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01) Treegion –Basic scheduling unit –A single-entry / multiple-exit nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion) BB1 BB2 BB3 BB4 BB5BB6 Tree1 Tree2
5 Background of Treegion Scheduling Treegion examples BB1 BB2 BB3 BB4 BB5BB6 Natural treegion : treegions formed without tail duplication (i.e., no code size increase during natural treegion formation) BB1 BB2 BB3 BB4 BB5BB6 BB4’ BB5’BB6’ Tree1 Tree2 Tree 1’
6 Code Size Effects in Treegion Scheduling Tail duplication increases code size General operation combining reduces code size BB1 BB2 BB3 … R1=R3+R4 … BB5BB6 BB4’ BB5’ … R7=R3+R4 R9=R7*4 … R1=R3+R4 … BB2 BB3 … ________ … BB5BB6 BB4’ BB5’ … _________ R9=R1*4 …
7 Quantitative Measure of Code Size Efficiency ILP vs. static code size Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].
8 Code Size Efficiency for Any Code Size Related Optimizations Use the ratio of IPC changes over code size changes as an indication of code size efficiency. –Average code size efficiency –Instantaneous code size efficiency
9 Average and Instantaneous Code Size Efficiency Code Size Static IPC A1A1 A2A2 A3A3 A4A4 A0A0
10 Estimate Static IPC Before Scheduling Use the expected execution time to calculate the static IPC For a multi-path region: Now, IPC changes can be calculated as execution time saved by the optimization. tree1 tree2 Tree1’ Example:
11 Optimal Code Size Efficiency For A Given Code Size Limit Code Size Static IPC Natural Treegion Size Limit Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency
12 Optimal Tail Duplication Under Code Size Constraint 1.Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. 2.Find the one with best code size efficiency. 3.If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process. 4.Repeat steps 2-3 until the code size limit is reached. Relative Code Size IPC limit
13 Processor Model Specification Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles. I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles Branch Predictor G-share style Multiway branch prediction [20] Branch prediction table: 2 14 entries; Branch target buffer: 2 14 entries/8-way/LRU. Branch misprediction penalty: 10 cycles
14 Results: ILP vs. Code Size 0% 2% 5% 80% 30%
15 Results: ILP vs. Code Size (cont.) 0% 2% 5% 80% 30% Reason: only a very small part of the program is frequently executed.
16 Optimal Code Size Efficiency Definition: the point where the ‘diminishing returns’ start Finding the optimal code size efficiency Relative code size IPC A l A’
17 Finding the Optimal Code Size Efficiency Relative code size 0 K K1K1 K2K2 Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency ! A or A’ K is the slope of line l
18 Finding the Optimal Code Size Efficiency (cont.) Meaning of K1 and K2 Relative code size IPC AB C l1l1 l2l2 K1 and K2 are the slope of the lines l1 and l2. The range (K1 – K2) determines the robustness of the threshold scheme. Point B Threshold as K1 Point C Threshold as K2
19 Algorithm for Finding the Optimal Code Size Efficiency 1.Set the threshold k anywhere between tan( /6) to tan( /12) 2.Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. 3.If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates. When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])
20 Results for Optimal Code Size Efficiency Vary threshold from tan( /12 ) to tan( /6 ), the threshold scheme finds the optimal efficiency accurately. Use m88ksim as an example 0% 2% 5% 10% 20%
21 I-Cache Impacts of the Code Size Increase Code size impacts and locality impacts (ref [3])
22 I-Cache Impacts of the Code Size Increase (cont.) Denser schedule of optimal efficiency results
23 I-Cache Impacts of the Code Size Increase (cont.) The combined impact
24 Processor Performance In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.
25 Conclusions Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase Best code size efficiency for a given code size limit –Results Significant but varying impact on IPC Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve –Results Improved I-cache performance (4%) Significant speedup (17%) Moderate static code size increase (2%) Future Work –Combine with other optimization, e.g., loop unrolling.
26 Contact Information Huiyang Zhou Tom Conte TINKER Research Group North Carolina State University