Download presentation
Presentation is loading. Please wait.
1
Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2011
Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds Henk Corporaal TUEindhoven 2011
2
Avoiding superscalar complexity
An alternative: EPIC (explicit parallel instruction computer) EPIC: Best of both worlds? Superscalar: expensive but binary compatible VLIW: simple, but not compatible Or: use VLIW with Binary translation at Run-time Transmeta: Crusoe VLIW processor Runs x86 code on a VLIW !!! 1/2/2019 ACA H.Corporaal
3
EPIC Architecture: IA-64 / Itanium
Explicit Parallel Instruction Computer Architecture IA-64 -> now called Itanium Implementations: Merced (2001), McKinley (2002) Montecite (2 core, 4 way multi-threading, 2x12MB L3, 596 mm2, 90nm,2006) Tukwila (4-core, 65nm, 699 mm2, 24MB L3, 2010) Poulson (8-core, 32nm, 3 Billion trans, 48 MB L3 cache, 544 mm2, 4 way hyperthreading/core, 12-issue/core, 2012 ) Kittson (?? 2014) 1/2/2019 ACA H.Corporaal
4
(2002) 1/2/2019 ACA H.Corporaal
5
Itanium: Register model
bit integer stack and rotating register file support bit floating point, rotating bit booleans bit branch target address system control registers 1/2/2019 ACA H.Corporaal
6
Itanium Instruction format
Instructions grouped in 128-bit bundles 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id 5 41 41 41 1/2/2019 ACA H.Corporaal
7
1/2/2019 ACA H.Corporaal
8
Predication Predicated execution of virtually all instructions
(p) add r1 = r2, r3 If p is true, normal add operation. Otherwise, NOP 64 1-bit predicate registers Advantages of predicated execution: Remove branches Convert control dependence to data dependence Reduce misprediction penalties Increase the size of basic block Both codes from taken & not-taken path can be scheduled in the same cycle 1/2/2019 ACA H.Corporaal
9
Control Speculation Loads incur high latency
Need to schedule loads as early as possible Two barriers – branches and stores Control speculation – move loads above branches: 1/2/2019 ACA H.Corporaal
10
Control speculation – move loads above branches
Problem: loads can cause exceptions Separate load behavior from exception behavior Speculative load (ld.s) initiates a load & detects exceptions On an exception, hardware propagates exception token (stored with destination register) from ld.s to chk.s Speculative check (chk.s) delivers the exception detected by ld.s 1/2/2019 ACA H.Corporaal
11
Control Speculation Control speculating uses further increase ILP
Dependent instructions following the load can also be speculated above branches 1/2/2019 ACA H.Corporaal
12
Data Speculation Loads and previous stores can conflict
When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence IA-64 enables data speculation by ld.a and ld.c/chk.a with ALAT (Advanced Load Address Table): ld. a performs a normal load and inserts the address to ALAT Any intervening stores eliminate the overlapping entries from ALAT The advanced load check (ld.c) checks ALAT whether there was a violation and reissues the load if necessary 1/2/2019 ACA H.Corporaal
13
Data Speculation Move loads above potentially overlapping stores
1/2/2019 ACA H.Corporaal
14
Data Speculation Uses of speculative data can be further speculated
Also, control and data speculation can be combined Schedule loads across branches and across stores at the same time Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s 1/2/2019 ACA H.Corporaal
15
Register Stack Procedure call overhead Register Stack
Spill registers to memory on call Restore them on procedure return Register Stack Register stack is used to save/restore procedure contexts across calls Stack area in memory to save/restore procedure context Explicit allocation of stack frames Effective use of 96 registers Allocate only what is needed Overlapping stack frames avoids parameter copying Mechanism implemented by renaming register addresses 1/2/2019 ACA H.Corporaal
16
Register Stack 1/2/2019 ACA H.Corporaal
17
Register Stack Engine (RSE)
Automatically saves/restores stack registers without software intervention Avoids explicit spill/fill (Eliminates stack management overhead) Provides the illusion of infinite physical registers RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background Overflow: alloc needs more registers than available Underflow: return needs to restore frame saved in memory 1/2/2019 ACA H.Corporaal
18
Software Pipelining Support
High performance loops without code size overhead No prologue and epilogue Rotating registers Provide automatic renaming Rotating predicates (stage predicates) Unify prologue, kernel, and epilogue Loop control registers (LC, EC) Loop branches Counted loop (br.ctop) While loop (br.wtop) Especially valuable for integer loops with small trip counts 1/2/2019 ACA H.Corporaal
19
Software Pipelining Example
ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add st L1: ld4 r4 = [r5], 4 //Cycle 0 add r7 = r4, r9 //Cycle 2 st4 [r6] = r7, 4 //Cycle 3 br.cloop L1;; L1: (p16) ld4 r32 = [r5], 4 // Cycle 0 (p18) add r35 = r34, r9 // Cycle 0 (p19) st4 [r6] = r36, // Cycle 0 br.ctop L // Cycle 0 What happens during runtime? Iteration r32 r33 r34 r35 … p16 p17 p18 p19 .. Iteration r33 r34 r35 r36 … p17 p18 p p16 Iteration r34 r35 r36 r37 … p18 p p16 p17 1/2/2019 ACA H.Corporaal
20
IA-64 / Itanium architecture: a VLIW?
Yes, but: Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel HW does the (Operation – FU) binding Pipeline latencies not visible in the ISA These measures make the ISA independent of #FUs and pipeline latencies ISA supports multiple implementations 1/2/2019 ACA H.Corporaal
21
HW vs SW scheduling + binding?
Architecture options Scheduling operations Binding operations to FUs HW/SW HW SW O-o-O Superscalar Itanium TRIPS VLIW 1/2/2019 ACA H.Corporaal
22
Montecito 2006: dual 11-issue cores
1/2/2019 ACA H.Corporaal
23
Tukwila 4 core Itanium, 2010 1/2/2019 ACA H.Corporaal
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.