IA-64 Microarchitecture --- Itanium Processor Jun Feng Jun Xie Huafeng Lü
Outline Introduction Pipeline Issue Performance Comparison Summary
Itanium Processor First implementation of IA-64 Compiler based exploitation of ILP Also has many features of superscalar
10-stage Pipeline Front-end Instruction delivery Operand delivery Execution
Front-end IPG, Fetch, Rotate Prefetches up to 32 bytes per cycle (2 bundles) into a prefetch buffer (up to hold 8 bundles) Branch prediction is done using a multilevel adaptive predictor
Instruction delivery EXP and REN Distributes up to 6 instructions to the 9 functional units Implements registers renaming for both rotation and register stacking
Operand delivery WLD and REG Accesses the register file Performs register bypassing Accesses and updates a register scoreboard Checks predicate dependences
Execution EXE, DET and WRB Executes instructions through ALUs and load/store units Detects exceptions and posts NaTs Retires instructions and performs write-back
Integer Performance SPECint benchmark: considerably slower Itanium is considerably slower than Alpha 21264 and Pentium 4. Only: 60% of of P4, 68% of Alpha Itanium: HP rx4610, 800MHz, 4MB off-chip L3 cache Alpha 21264: Compaq GS320, 1GHz, on-chip L2 cache Pentium 4: Compaq Precision 330, 2GHz, 256KB on-chip L2 cache
Floating Point Performance SPECfp benchmarks: a different story Itanium is quicker than Alpha 21264 and Pentium 4. 108% of of P4, 120% of Alpha Itanium: HP rx4610, 800MHz, 4MB off-chip, L3 cache Alpha 21264: Compaq GS320, 1GHz, on-chip L2 cache Pentium 4: Compaq Precision 330, 2GHz, on-chip L2 cache
Discussion on SPECfp Floating point app: competitive .higher degrees of ILP .aggressive memory system Art benchmark: 4 times of Pentium 4 Alpha: outperform when tuned In terms of power: worse than P4 56% of floating point performance per watt
Summary By Us Good floating point performance Poor integer performance Overall: not so good as Intel has advertised
Conclusion Large code size Only static instruction-level parallelism Cannot manage cache misses/hits flexibly Lack of applications