Presentation is loading. Please wait.

Presentation is loading. Please wait.

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp.

Similar presentations


Presentation on theme: "The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp."— Presentation transcript:

1 The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp Presentation et 4 074, January 8 th 2001

2 Contents  Introduction to the IA-64 architecture and EPIC  The Itanium  processor  Branch removal  Predication  Speculative execution  Control speculation  Comparison: ARM conditional instructions  Data speculation

3 Introduction to the IA-64 architecture  Joint research by Intel and Hewlett-Packard (1994)  exploitation of the ILP concept  tight coupling of hard- and software EPIC is introduced as basic concept: Explicitly Parallel Instruction Computing This results in a more complex task for the compiler and Hardware support for communication of meta-information  speculation, predication and branch hints “The future of computing” “The future of computing” – Intel web site

4 The Itanium  processor The Itanium , formerly code-named Merced , is the first processor based on the IA-64 architecture Still a prototype, compilers announced (as of nov. 2000) 10-stage pipeline, running at 800Mhz To support EPIC, it is equipped with: 4 ALU’s, 4 MMX units, 4 FPU’s (2 SP, 2 DP), 2 L/S units, 3 br units MS Win2K and Linux announced (as of oct. 2000)

5 IA 64 resources and instructions  Register resources r0 r1.... r31.... r32.... r126 r127 64 + 1 b 128 GR’s Static Stacked / Rotating f0 f1.... f31.... f32.... f126 f127 82 b 128 FR’s Rotating ar0 ar1.... f126 f127 64 b 128 AR’s AR application register BR Branch register FR Floating point register GR General register PR Predicate register p0 b0 b1.... b7 64 b 8 BR’s... p15p16... p62p63 64 PR’s 1 b Deferred exception (Not A Thing, NaT) Control speculation Function call linkage and return (64b address space!) Holds result of a conditional expression evaluation Predication Support for register stack and software pipelining

6 IA64 resources and instructions  Instruction encoding Instruction 2Instruction 1Instruction 0Template 41 b 5 b IA-64 “Bundle” OpReg 1Reg 2Reg 3Predicate 14 b7 b 6 b Instruction format {.mii ld8 r1 = 4[r2] add r3 = r1, r3 shr r7 = r4, r12 } {.mbb ld8 r6 = 8[r5] (p3) br.cond Label1 (p4) br.cond Label2 } Templates are used to group instructions to exploit parallel execution by keeping execution units buzy. Predicates are used to allow for conditional execution. 6 bits used to address 64 predicate registers The Itanium processor issues 8 ops/clock: ALU MMX L/S MMX FP SFP DFP SFP DBR MIIMBB

7 Branch removal  Branch-prediction is costly  Cost of misprediction is proportional to pipeline length Optimizing the use of prediction resources can significantly improve the overall performance Conditional instructions can eliminate the need for branches cmp r1, r2 beq equal mov r1, #0 bal end.equal mov r2, #0.end cmp r1, r2 moveq r1, #0 movne r2, #0 Executes only if eq-bit is set in status register; else NOP

8 Branch removal – Conditional instructions Conditional instructions can reduce a branch-penalty due to a misprediction from N pipeline-stages to 1  Implementing conditional instructions in instruction space directly increases instruction-size while the amount of conditions to test on is limited (typically to a few bits in the processor status register)  Unbalanced execution paths: conditional code might decrease performance in favor of a branch misprediction ARM Conditional Instructions

9 Branch removal – Conditional instructions Example: conditional code performance (one instruction executed each cycle) cmp r1, r2 moveq r1, #0 addeq r2, r2, #10 ldbeq r3, (r5)+ inceq r3 stbeq r3, (r5)+ inceq r1 mov r2, #0 r1  r2 cmp r1, r2 bne end mov r1, #0 add r2, r2, #10 ldb r3, (r5)+ inc r3 stb r3, (r5)+ inc r1.end mov r2, #0 6 NOP’s LOSS: 6 vs r1  r2 mispredict Pipeline flushed: branch- penalty LOSS: #pipeline On a machine with a 5-stage pipeline, conditional instructions would lead to performance loss The compiler should decide!

10 Predication Predication: tagging instructions with a boolean value cmp.nep1, p0 = r4, 0;; (p1) addr1 = r2, r3 (p1) ld8r6 = [r5] The limitations of conditional instructions are decreased by predication: with predication the amount of conditions to test on equals the number of predicate registers SET BOOLEAN VALUES Compare r4 to #0; not equal p1 is TRUE if r4  0 p2 = NOT(p1) if r4  0 then r1 = (r2 + r3) if r4  0 then r6 = MEM(r5)

11 Advantages of predication The compiler has more freedom when scheduling if predicates are guaranteed not to conflict. Code motion past branches and Ld/Str ops results in speculative execution Predication – moving instructions Code Motion UpwardUpwardDownwardDownward

12 Speculative execution The compiler selects commonly executed blocks Instruction selection, prioritization and reordening To enable agressive code-motion done by the compiler, explicitly speculative instructions must be available

13 Speculative execution – Control speculation IA-64 provides speculative load instructions instrA instrB... br ld8 r1 = [r2] use r1 ld8.s r1 = [r2] use r1 instrA instrB... br chk.s The load instruction is replaced by a speculative load speculation check Exception Handling: If a speculative load raises an exception, a deferred exeception token (NaT) is written to the target register. This NaT is propagated by almost all instructions. chk.s checks for NaT and if present, jumps to fix-up code (compiler generated). This fix-up code may excute the load non-speculatively and return to main code afterwards. NaT may be written in r1

14 Speculative execution – Data speculation IA-64 provides advanced load instructions instrA... store ld8 r1 = [r2] use r1 ld8.a r1 = [r2] use r1 instrA... store chk.a The load instruction is replaced by an advanced load advanced load check reg#addr reg#addr size... addr... reg#... size... reg#, addr and size are stored in the advanced load address table (ALAT) WaR Handling: When the store is executed, all ALAT-entries will be compared with the store address. Overlapping entries are removed. chk.a checks for the address of it’s corresponding advanced load in the ALAT. If the address is still there, chk.a does nothing. If it’s gone, chk.a jumps to fix-up code.

15 Speculative execution – fix-up The fix-up code generated by the compiler is general In case of control speculation: Not only the load is speculative, but also all instructions using the destination register. In case of data speculation: Not only the load is speculative, but also all computations before the (possibly conflicting) store. Although the compiler must include fix-up code to handle exceptions and WaR-conflicts, this relatively simple mechanism allows for aggressive code-motion

16 0000 EQ Z 0001 NE ~Z 0010 CS C 0011 CC ~C 0100 MI N 0101 PL ~N 0110 VS V 0111 VC ~V 1000 HI C and ~Z 1001 LS ~C or Z 1010 GE N = V 1011 LT N = ~V 1100 GT (N = V) and ~Z 1101 LE (N = ~V) or Z 1110 AL True 1111 NV False (=NOP) Comparison: ARM conditional instructions Conditional instructions to allow for branch-removal as implemented in the ARM processor (+/- 1985) Cond000OPCSSRC1DESTSH#SHSRC2 ADDEQ S Rd, Rn,Rm,ASLRc Rd = Sign(Rn+(Rm << Rc)) Single cycle execution Straightforward orthogonal instruction coding: all instructions can be coded conditionally on all conditions Only 4 condition bits: Z, C, N, V in processor status register: set by CMN, CMP, TEQ, TST Flexibility: branch removal, but no code motion! (conditional instructions after CMP) Instruction format code

17 EPIC: The future of computing? As processors grow in complexity, shifting responsibilities to the compiler seems obvious Keeping up with Moore’s law: calls for conceptual innovations, not only technological In conclusion

18 References [1] “Introducing the IA-64 architecture” J. Huck, D. Morris, J. Ross (HP), A. Knies, H. Mulder, R. Zahir (Intel) IEEE/Micro, sep-oct 2000, p. 12-23 [2] “Itanium processor microarchitecture” H. Sharangpani, K. Arora (Intel) IEEE/Micro, sep-oct 2000, p. 24-43 [3] “IA-64 Application developer’s architecture guide, Rev. 1.0” Intel Documentation, may 1999 Chap. 11: “Predication, Control Flow and Instruction Stream” http://developer.intel.com/software/idap/media/pdf/ADAG.pdf [4] “Itanium processor microarchitecture reference” Intel Documentation, aug. 2000 http://developer.intel.com/design/ia-64/downloads/245474.htm [5] “ARM Instruction formats and timings” R. Watts, nov. 1995 http://www.pinknoise.demon.co.uk/ARMinstrs/index.html Websites: - www.intel.com/pressroom www.intel.com/pressroom - developer.intel.com/design/ia-64 developer.intel.com/design/ia-64

19 It is now safe to ask your questions


Download ppt "The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp."

Similar presentations


Ads by Google