Compiler Optimizations for Modern Hardware Architectures - Part II Bob Wall CS 550 (Fall 2003) Class Presentation Compiling for the Intel® Itanium® – A Processor of EPIC Proportions
Acknowledgements As part of the presentation, I refer to slides from a presentation by Dr. Yong-fong Lee of Intel, "An Overview of IA-64 Architectural Features and Compiler Optimization“, presented at the 14th International Conference on Parallel and Distributed Computing Systems (PDCS-2001), Aug. 8-10, 2001, Richardson, TX. A link to the PDF file is on my presentation site – and here too:
Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word). EPIC features: –Parallel instruction encoding –Flexible instruction grouping –Large register file –Predicated instruction set –Control and data speculation –Compiler control of memory hierarchy What is EPIC?
History A timeline of architectures: Hennessy and Patterson were early RISC proponents. VLIW (very long instruction word) architectures, –Alan Charlesworth (Floating Point Systems) –Josh Fisher (Yale / Multiflow – Multiflow Trace machines) –Bob Rau (TRW / Cydrome – Cydra-5 machine) Hard-wired discrete logic Microprocessor (advent of VLSI) CISC (microcode) RISC (microcode bad!) VLIW Superpipelined, Superscalar EPIC
History (cont.) Superscalar processors –Tilak Agerwala and John Cocke of IBM coined term –First microprocessor (Intel i960CA) and workstation (IBM RS/6000) introduced in –Intel Pentium in Foundations of EPIC - HP –HP FAST (Fine-grained Architecture and Software Technologies) research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture. – In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects. –HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in –First implementation was Merced / Itanium, announced in 1999.
IA-64 / Itanium IA-64 is not a complete architecture – it is more a set of high-level requirements for an architecture and instruction set, co-developed by HP and Intel. Itanium is Intel’s implementation of IA-64. HP announced an IA-64 successor to their PA-RISC line, but apparently abandoned it in favor of joint development of the Itanium 2. Architectural features: –No complex out-of-order logic in processor – leave it for compiler –Provide large register file, multiple execution units (again, compiler has visibility to explicitly manage) –Add hardware support for advanced ILP optimization: predication, control and data speculation, register rotation, multi-way branches, parallel compares, visible memory hierarchy –Load / store architecture – no direct-to-memory instructions
Hardware Resources General (user-accessible) registers – bit general registers; each has 1-bit NaT flag attached (GR0 hardwired to 0) – bit floating-point registers –64 1-bit predicate registers (p0 hardwired to 0) –8 64-bit branch registers Special application registers –Current Frame Marker (CFM), Loop Counter (LC), Epilog Counter (EC) Advanced Load Address Table (ALAT) Register Stack Engine (RSE)
IA-64 Instruction Formats Instruction Bundle – three instructions per 128-bit “word” Template indicates how each instruction is used –Instructions are divided into types: –Predefined templates Instr 2 – 41 bitsInstr 1 – 41 bitsInstr 0 – 41 bitstemplate – 5 bits F (float) L+X (long immed.) M (memory) I (integer) B (branch) Branch – MIB, MMB, MFB, MBB, BBB All come with or without stop at end Regular – MII, MLX, MMI, MFI, MMF Stop – MI_I, M_MI (treated as separate groups)
Instruction Groups Sequence of instructions with no register dependencies (can all be executed in parallel) –Write after read dependencies are allowed –Memory operations still require sequential execution Stops delimit groups, so a group of parallel instructions can span bundles (allows new hardware implementation to add functional units – it would still be able to execute code compiled for a smaller number of units). –Itanium 2 can issue up to six instructions simultaneously
The IA-64 includes 64 special-purpose registers that can be used to predicate most of the normal instructions. For example, this allows the following conversion: Most special compare / test instructions write the result and its complement into two different predicate registers Predication converts control dependencies into data dependencies, allowing more scheduling options Predication if (a > b) then c = c + a else c = c + b cmp.lt p1, p2 = a, b (p1) c = c + a (p2) c = c + b
Predication (cont.) The removal of branches has the following effects: –Reduces branch mis-predicts and pipeline bubbles –Improves instruction fetch efficiency –Better utilizes wide-issue instruction efficiency The down-side –Instruction cache “polluted” with non-executed code –Execution units used to execute instructions with false predicates –Need to determine when it will be beneficial – determine hard-to- predict branches
Parallel Compares Used in conjunction with predication – allows multiple checks to be executed in a single instruction group Add.and and.or suffixes to comparison instructions Example: –If ( A & B & C ){ S }; –cmp.eq p1 = r0,r0 ;; // initialize p1=1 cmp.ne.and p1 = rA,0 cmp.ne.and p1 = rB,0 cmp.ne.and p1 = rC,0 (p1) S
Multi-way Branches Also used in conjunction with predication – allows multiple branch targets within a single instruction group Example: –(p1) br.cond target_1 (p2) br.cond target_2 (p3) br.call b1 Takes the branch associated with the first true condition, or falls through if no predicate set
Control Speculation The IA-64 instruction set allows speculative loads – load a register with a value from memory, but do not generate any exceptions. Exceptions can be deferred; when the program is ready to use the register, it executes an instruction that checks - if an exception was generated, it branches to compiler-generated patch-up code to re-execute the load and jump back to the point where execution should continue. Allows “hoisting” of loads above branches – provides more options for concurrent instruction scheduling. In conjunction with parallel instruction execution, masks memory latency.
Control Speculation (cont.) Hardware implementation –Add extra NaT bit to each general register, special NaTVal value for floating point register –Add speculative load (ld.s) instruction (sets NaT if exception occurs), and speculation check (chk.s) instruction to determine if exception occurred earlier (conditional branch if NaT). –NaT propagates through operations The down-side –Increased code size –Extended live range for register –Might unnecessarily execute ld.s –Don’t speculatively move load out of unlikely block
Data Speculation The IA-64 instruction set also allows advanced loads – load a register from memory, but remember that it might contain a value that is subsequently overwritten by a store. The program can later check to see if the overwrite took place – if so, it can reload the value and continue. Allows “hoisting” of loads above stores – also provides more opportunities for parallelism and masking of memory latency.
Data Speculation (cont.) Hardware implementation –Add advance load (ld.a) instruction – perform the load, but store the physical address and target register in the ALAT. (Also combined ld.sa.) –Add check load (ld.c) instruction – check whether the specified address / register are in the ALAT. If not, branch to compiler- generated patch-up code to re-execute the load and jump back. –ALAT is an associative lookup table. Note that it works on physical addresses, rather than virtual ones. The down-side –Increased code size –Extended live range for register –Need to avoid for likely conflicts
Software Pipelining Similar to hardware pipelining – overlap executions of different pieces of a loop. Requires multiple execution units. Problem with loop unrolling / pipelining – requires a lot of registers. Hardware support: –Rotating predicate registers to embed the prolog and epilog of the software pipeline within the main loop – minimizes code expansion. –Rotating registers can be used to avoid register renaming within unrolled loops, further minimizing code expansion –Loop Count (LC) and Epilog Count (EC) registers, special loop branch instruction to rotate registers
Software Pipelining (cont.) Modulo scheduling algorithm –Divide loop into prolog (filling pipeline), kernel (steady state), and epilog (flush pipeline) Issues –Loops with multiple exits (either need to convert to single exit or generate additional epilogs) –Complicates control and data speculation
The Register Stack Portion of process address space set aside as backing store for registers – spill to this area. Processor supports a register stack frame similar to normal memory stack frame. –First 32 registers are static. Remaining 96 form the register stack. –Current Frame Marker (CFM) and Top of Stack (TOS) registers used to manipulate stack frames –Called routine calls alloc instruction to indicate how many registers it requires. This allocates the next frame and updates the registers. –Code uses virtual register names (i.e. always uses register 32 as the first register in its frame) – processor maps to physical register
The Register Stack (cont.) –Register frame divided into fixed region, rotating region (if needed), and output region (for passing parameters to other functions) –The incoming / outgoing registers allow up to 8 parameters – can eliminate any memory access for procedure calls. When register stack overflows (wraps around all 96 registers), the Register Stack Engine (RSE) automatically stalls the processor and spills registers to the backing store. The RSE also automatically fills registers from backing store when room is available. Flexibility in IA-64 spec to make RSE more aggressive – can asynchronously spill / fill when memory bandwidth available.
Memory Hierarchy Control Compiler can insert prefetch hints (lfetch) in instruction stream. Load, store, and prefetch instructions can include post- increment – will also cause cache load of line containing post-incremented address
What’s a Compiler to Do? Many new architectural features are based on improving opportunities for ILP. Obviously, there are high expectations for the compiler to perform good instruction scheduling. Scheduler should probably drive many other optimizations. Tradeoff between ILP and code expansion. The choice of making the compiler do a lot of the work is probably a good one – it has much more context available to make decisions. However, it lacks the knowledge of the dynamic run-time environment.
Conclusions x86 architecture – NOT COOL!!! IA-64 architecture – WAY COOL!!! Could fuel a new round of “compiler wars”? Future directions (some speculation on my part): –Out of order execution in hardware –Possibly introduce some dynamic optimization capability – base speculation on runtime profile (i.e. remove speculation if inside a rarely executed block)
And Finally … Q: How many Intel architects does it take to change a lightbulb ? A: None, they have a predicating compiler that eliminates lightbulb dependencies. If the dependencies are not entirely eliminated, they have four levels of prediction to determine if you need to replace the lightbulb. Q: How many AMD architects does it take to change a lightbulb? A: I have no idea, but they'll do anything to stick twice as many lightbulbs into the same socket and ensure that it glows brighter than the bulb Intel replaced. Q: How many Apple PowerPC architects does it take to change a lightbulb? A: Our Power G4 processor at 733MHz, with the phenomenally powerful velocity engine(TM) technology, has the power to burn. It'll burn CDs, DVDs, and probably light up your room as well. You don't need a lightbulb. Oh, by the way, it'll burn Pentiums too, so you don't need Intel to replace your lightbulb. (Borrowed from