The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp.

Slides:



Advertisements
Similar presentations
The IA-64 Architectural Innovations Hardware Support for Software Pipelining José Nelson Amaral 1.
Advertisements

ARM versions ARM architecture has been extended over several versions.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Advanced Computer Architecture Limits to ILP Lecture 3.
ELEN 468 Advanced Logic Design
ARM Microprocessor “MIPS for the Masses”.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Lecture 5: Decision and Control CS 2011 Fall 2014, Dr. Rozier.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Lecture 2: Basic Instructions CS 2011 Fall 2014, Dr. Rozier.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Lecture 6: Decision and Control CS 2011 Spring 2016, Dr. Rozier.
Computer Architecture: Branch Prediction (II) and Predicated Execution
GCSE COMPUTER SCIENCE Computers 1.5 Assembly Language.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
ELEN 468 Advanced Logic Design
Henk Corporaal TUEindhoven 2009
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
The EPIC-VLIW Approach
IA-64 Microarchitecture --- Itanium Processor
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP, Branch Prediction
Yingmin Li Ting Yan Qi Zhao
Adapted from the slides of Prof
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
Midterm 2 review Chapter
VLIW direct descendant of horizontal microprogramming
Dynamic Hardware Prediction
Presentation transcript:

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8 th 2001 Frans Dondorp Presentation et 4 074, January 8 th 2001

Contents  Introduction to the IA-64 architecture and EPIC  The Itanium  processor  Branch removal  Predication  Speculative execution  Control speculation  Comparison: ARM conditional instructions  Data speculation

Introduction to the IA-64 architecture  Joint research by Intel and Hewlett-Packard (1994)  exploitation of the ILP concept  tight coupling of hard- and software EPIC is introduced as basic concept: Explicitly Parallel Instruction Computing This results in a more complex task for the compiler and Hardware support for communication of meta-information  speculation, predication and branch hints “The future of computing” “The future of computing” – Intel web site

The Itanium  processor The Itanium , formerly code-named Merced , is the first processor based on the IA-64 architecture Still a prototype, compilers announced (as of nov. 2000) 10-stage pipeline, running at 800Mhz To support EPIC, it is equipped with: 4 ALU’s, 4 MMX units, 4 FPU’s (2 SP, 2 DP), 2 L/S units, 3 br units MS Win2K and Linux announced (as of oct. 2000)

IA 64 resources and instructions  Register resources r0 r1.... r r r126 r b 128 GR’s Static Stacked / Rotating f0 f1.... f f f126 f b 128 FR’s Rotating ar0 ar1.... f126 f b 128 AR’s AR application register BR Branch register FR Floating point register GR General register PR Predicate register p0 b0 b1.... b7 64 b 8 BR’s... p15p16... p62p63 64 PR’s 1 b Deferred exception (Not A Thing, NaT) Control speculation Function call linkage and return (64b address space!) Holds result of a conditional expression evaluation Predication Support for register stack and software pipelining

IA64 resources and instructions  Instruction encoding Instruction 2Instruction 1Instruction 0Template 41 b 5 b IA-64 “Bundle” OpReg 1Reg 2Reg 3Predicate 14 b7 b 6 b Instruction format {.mii ld8 r1 = 4[r2] add r3 = r1, r3 shr r7 = r4, r12 } {.mbb ld8 r6 = 8[r5] (p3) br.cond Label1 (p4) br.cond Label2 } Templates are used to group instructions to exploit parallel execution by keeping execution units buzy. Predicates are used to allow for conditional execution. 6 bits used to address 64 predicate registers The Itanium processor issues 8 ops/clock: ALU MMX L/S MMX FP SFP DFP SFP DBR MIIMBB

Branch removal  Branch-prediction is costly  Cost of misprediction is proportional to pipeline length Optimizing the use of prediction resources can significantly improve the overall performance Conditional instructions can eliminate the need for branches cmp r1, r2 beq equal mov r1, #0 bal end.equal mov r2, #0.end cmp r1, r2 moveq r1, #0 movne r2, #0 Executes only if eq-bit is set in status register; else NOP

Branch removal – Conditional instructions Conditional instructions can reduce a branch-penalty due to a misprediction from N pipeline-stages to 1  Implementing conditional instructions in instruction space directly increases instruction-size while the amount of conditions to test on is limited (typically to a few bits in the processor status register)  Unbalanced execution paths: conditional code might decrease performance in favor of a branch misprediction ARM Conditional Instructions

Branch removal – Conditional instructions Example: conditional code performance (one instruction executed each cycle) cmp r1, r2 moveq r1, #0 addeq r2, r2, #10 ldbeq r3, (r5)+ inceq r3 stbeq r3, (r5)+ inceq r1 mov r2, #0 r1  r2 cmp r1, r2 bne end mov r1, #0 add r2, r2, #10 ldb r3, (r5)+ inc r3 stb r3, (r5)+ inc r1.end mov r2, #0 6 NOP’s LOSS: 6 vs r1  r2 mispredict Pipeline flushed: branch- penalty LOSS: #pipeline On a machine with a 5-stage pipeline, conditional instructions would lead to performance loss The compiler should decide!

Predication Predication: tagging instructions with a boolean value cmp.nep1, p0 = r4, 0;; (p1) addr1 = r2, r3 (p1) ld8r6 = [r5] The limitations of conditional instructions are decreased by predication: with predication the amount of conditions to test on equals the number of predicate registers SET BOOLEAN VALUES Compare r4 to #0; not equal p1 is TRUE if r4  0 p2 = NOT(p1) if r4  0 then r1 = (r2 + r3) if r4  0 then r6 = MEM(r5)

Advantages of predication The compiler has more freedom when scheduling if predicates are guaranteed not to conflict. Code motion past branches and Ld/Str ops results in speculative execution Predication – moving instructions Code Motion UpwardUpwardDownwardDownward

Speculative execution The compiler selects commonly executed blocks Instruction selection, prioritization and reordening To enable agressive code-motion done by the compiler, explicitly speculative instructions must be available

Speculative execution – Control speculation IA-64 provides speculative load instructions instrA instrB... br ld8 r1 = [r2] use r1 ld8.s r1 = [r2] use r1 instrA instrB... br chk.s The load instruction is replaced by a speculative load speculation check Exception Handling: If a speculative load raises an exception, a deferred exeception token (NaT) is written to the target register. This NaT is propagated by almost all instructions. chk.s checks for NaT and if present, jumps to fix-up code (compiler generated). This fix-up code may excute the load non-speculatively and return to main code afterwards. NaT may be written in r1

Speculative execution – Data speculation IA-64 provides advanced load instructions instrA... store ld8 r1 = [r2] use r1 ld8.a r1 = [r2] use r1 instrA... store chk.a The load instruction is replaced by an advanced load advanced load check reg#addr reg#addr size... addr... reg#... size... reg#, addr and size are stored in the advanced load address table (ALAT) WaR Handling: When the store is executed, all ALAT-entries will be compared with the store address. Overlapping entries are removed. chk.a checks for the address of it’s corresponding advanced load in the ALAT. If the address is still there, chk.a does nothing. If it’s gone, chk.a jumps to fix-up code.

Speculative execution – fix-up The fix-up code generated by the compiler is general In case of control speculation: Not only the load is speculative, but also all instructions using the destination register. In case of data speculation: Not only the load is speculative, but also all computations before the (possibly conflicting) store. Although the compiler must include fix-up code to handle exceptions and WaR-conflicts, this relatively simple mechanism allows for aggressive code-motion

0000 EQ Z 0001 NE ~Z 0010 CS C 0011 CC ~C 0100 MI N 0101 PL ~N 0110 VS V 0111 VC ~V 1000 HI C and ~Z 1001 LS ~C or Z 1010 GE N = V 1011 LT N = ~V 1100 GT (N = V) and ~Z 1101 LE (N = ~V) or Z 1110 AL True 1111 NV False (=NOP) Comparison: ARM conditional instructions Conditional instructions to allow for branch-removal as implemented in the ARM processor (+/- 1985) Cond000OPCSSRC1DESTSH#SHSRC2 ADDEQ S Rd, Rn,Rm,ASLRc Rd = Sign(Rn+(Rm << Rc)) Single cycle execution Straightforward orthogonal instruction coding: all instructions can be coded conditionally on all conditions Only 4 condition bits: Z, C, N, V in processor status register: set by CMN, CMP, TEQ, TST Flexibility: branch removal, but no code motion! (conditional instructions after CMP) Instruction format code

EPIC: The future of computing? As processors grow in complexity, shifting responsibilities to the compiler seems obvious Keeping up with Moore’s law: calls for conceptual innovations, not only technological In conclusion

References [1] “Introducing the IA-64 architecture” J. Huck, D. Morris, J. Ross (HP), A. Knies, H. Mulder, R. Zahir (Intel) IEEE/Micro, sep-oct 2000, p [2] “Itanium processor microarchitecture” H. Sharangpani, K. Arora (Intel) IEEE/Micro, sep-oct 2000, p [3] “IA-64 Application developer’s architecture guide, Rev. 1.0” Intel Documentation, may 1999 Chap. 11: “Predication, Control Flow and Instruction Stream” [4] “Itanium processor microarchitecture reference” Intel Documentation, aug [5] “ARM Instruction formats and timings” R. Watts, nov Websites: developer.intel.com/design/ia-64 developer.intel.com/design/ia-64

It is now safe to ask your questions