Chapter 21 IA-64 Architecture (Think Intel Itanium)

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Machine Instructions Operations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Machine Instructions Operations 1 ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson Slides4-1.ppt Modification date: March 18, 2015.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Dynamic Scheduling Why go out of style?
Computer Architecture Chapter (14): Processor Structure and Function
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
Instruction Level Parallelism and Superscalar Processors
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Chapter 13 Instruction-Level Parallelism and Superscalar Processors
Control unit extension for data hazards
Dynamic Hardware Prediction
Control unit extension for data hazards
Presentation transcript:

Chapter 21 IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing)

Superpipelined & Superscaler Machines Superpipelined machine: Superpiplined machines overlap pipe stages Relies on stages being able to begin operations before the last is complete. Superscaler Machine: A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel. Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently.

Why A New Architecture Direction? Processor designers obvious choices for use of increasing number of transistors on chip and extra speed: Bigger Caches  diminishing returns Increase degree of Superscaling by adding more execution units  complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies. Multiple Processors  challenge to use them effectively in general computing Longer pipelines  greater penalty for misprediction

IA-64 : Background Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP) New 64 bit architecture Not extension of x86 series Not adaptation of HP 64bit RISC architecture To exploit increasing chip transistors and increasing speeds Utilizes systematic parallelism Departure from superscalar trend Note: Became the architecture of the Intel Itanium

Basic Concepts for IA-64 Instruction level parallelism EXPLICIT in machine instruction, rather than determined at run time by processor Long or very long instruction words (LIW/VLIW) Fetch bigger chunks already “preprocessed” Predicated Execution Marking groups of instructions for a late decision on “execution”. Control Speculation Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later Data Speculation (or Speculative Loading) Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong Software Pipelining Multiple iterations of a loop can be executed in parallel “Revolvable” Register Stack Stack Frames are programmable and used to reduce unnecessary movement of data on procedure calls

Predication

Speculative Loading

General Organization

IA-64 Key Hardware Features Large number of registers IA-64 instruction format assumes 256 Registers 128 * 64 bit integer, logical & general purpose 128 * 82 bit floating point and graphic 64 (1 bit) predicated execution registers (To support high degree of parallelism) Multiple execution units Probably 8 or more pipelined

IA-64 Register Set

Predicate Registers Used as a flag for instructions that may or may not be executed. A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch). Only instructions with a predicate value of true are executed. When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed. Those instructions with predicate false are now candidates for cleanup.

Instruction Format 128 bit bundles Can fetch one or more bundles at a time Bundle holds three instructions plus template Instructions are usually 41 bit long Have associated predicated execution registers Template contains info on which instructions can be executed in parallel Not confined to single bundle e.g. a stream of 8 instructions may be executed in parallel Compiler will have re-ordered instructions to form contiguous bundles Can mix dependent and independent instructions in same bundle

Instruction Format Diagram

IA-64 Execution Units I-Unit M-Unit B-Unit F-Unit Integer arithmetic Shift and add Logical Compare Integer multimedia ops M-Unit Load and store Between register and memory Some integer ALU operations B-Unit Branch instructions F-Unit Floating point instructions

Relationship between Instruction Type & Execution Unit

Field Encoding & Instr Set Mapping Note: BAR indicates stops: Possible dependencies with Instructions after the stop

Predication (review)

Speculative Loading (review)

Assembly Language Format [qp] mnemonic [.comp] dest = srcs ;; // qp - predicate register 1 at execution time  execute and commit result to hardware 0 at execution time  result is discarded mnemonic - name of instruction comp – one or more instruction completers used to qualify mnemonic dest – one or more destination operands srcs – one or more source operands ;; - instruction groups stops Sequence without hazards - read after write, write after write, . . // - comment follows

Assembly Example Register Dependency: ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 Changed by first instruction Can not be in same group for parallel execution Note ;; ends the group of instructions that can be executed in parallel

Assembly Example Multiple Register Dependencies: ld8 r1 = [r5] //first group sub r6 = r8, r9 ;; //first group add r3 = r1, r4 //second group st8 [r6] = r12 //second group Last instruction stores in the memory location whose address is in r6, which is established in the second instruction

Assembly Example – Predicated Code Consider the Following program with branches: if (a && b) j = j + 1; else if(c) k = k + 1; k = k – 1; i = i + 1;

Assembly Example – Predicated Code Pentium Assembly Code cmp a, 0 ; compare with 0 je L1 ; branch to L1 if a = 0 cmp b, 0 je L1 add j, 1 ; j = j + 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 ; k = k + 1 L2: sub k, 1 ; k = k – 1 L3: add i, 1 ; i = i + 1 Source Code if (a && b) j = j + 1; else if(c) k = k + 1; k = k – 1; i = i + 1;

Assembly Example – Predicated Code Pentium Code cmp a, 0 je L1 cmp b, 0 add j, 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 L2: sub k, 1 L3: add i, 1 IA-64 Code cmp.eq p1, p2 = 0, a ;; (p2) cmp.eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp.ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i Source Code if (a && b) j = j + 1; else if(c) k = k + 1; k = k – 1; i = i + 1;

Example of Prediction IA-64 Code: cmp.eq p1, p2 = 0, a ;; (p2) cmp.eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp.ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i

Data Speculation Load data from memory before needed What might go wrong? Load could be completed before another required read or could later be shown to be incorrect Need subsequent check in value ?

Assembly Example – Data Speculation Consider the following code: (p1) br some_label // cycle 0 ld8 r1 = [r5] ;; // cycle 0 (indirect memory op – 2 cycles) add r1 = r1, r3 // cycle 2

Assembly Example – Data Speculation Consider the following code: Original code Speculated Code ld8.s r1 = [r5] ;; //cycle -2 // other instructions (p1) br some_label //cycle 0 chk.s r1, recovery //cycle 0 add r2 = r1, r3 //cycle 0 (p1) br some_label //cycle 0 ld8 r1 = [r5] ;; //cycle 0 add r1 = r1, r3 //cycle 2

Assembly Example – Data Speculation Consider the following code: st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 (indirect memory op – 2 cycles) add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 What if r4 and r8 point to the same address?

Assembly Example – Data Speculation Consider the following code: Without Data Speculation With Data Speculation ld8.a r6 = [r8] ;; //cycle -2, adv // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0, check add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1 st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Note: The Advanced load Address Table is checked for an entry. It should be there. If another access has been made to that target, it would have been removed.

Assembly Example – Data Speculation Consider the following code with an additional data dependency: Speculation Speculation with data dependency ld8.a r6 = [r8] ;; //cycle-2 // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0 add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1 ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, check back: //return pt st8 [r18] = r5 //cycle 0 recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute br back //jump back

Software Pipelining Consider loop in which: y[i] = x[i] + c L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 r9 holds c st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 Adds constant to one vector and stores result in another No opportunity for instruction level parallelism in one iteration Instruction in iteration x all executed before iteration x+1 begins

IA-64 Register Set (recall)

Pipeline - Unrolled Loop, Pipeline Display ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4 //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4 //cycle 3 add r37=r33,r9 //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4 //cycle 3 add r38=r34,r9 //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9 //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9 //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7 Original Loop L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7, 4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 Pipeline Display

Mechanism for “Unrolling” Loops Automatic Register Naming r 32-r127, fr 32-fr127, and pr 16-pr63 are capable of rotation for automatic renaming of registers Predication of Loops each instruction in a given loop is predicated. on the prolog, each cycle an additional instruction predicate is true on the kernel, n instruction’s predicates are true on the Epilog, each cycle an additional predicate is made false Spl Loop Termination Instructions the loop count and epilog count is used to determine when the loop is complete and the process stops

Unrolled Loop Example Observations Completes 5 iterations in 7 cycles Compared with 20 cycles in original code Assumes two memory ports Load and store can be done in parallel

IA-64 Register Stack The Register Stack mechanism avoids unecessary movement of register data during procedure call and return (r32-r127 are used in a rotation) the number of local, & pass/return are specifiable the “register renaming” allows locals to become hidden and pass/return to become local on a call, and changed back on a return IF the stacking mechanism runs out of registers, the last used are moved to memory

Basic Concepts for IA-64 Instruction level parallelism EXPLICIT in machine instruction, rather than determined at run time by processor Long or very long instruction words (LIW/VLIW) Fetch bigger chunks already “preprocessed” Predicated Execution Marking groups of instructions for a late decision on “execution”. Control Speculation Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later Data Speculation (or Speculative Loading) Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong Software Pipelining Multiple iterations of a loop can be executed in parallel “Revolvable” Register Stack Stack Frames are programmable and used to reduce unnecessary movement of data on procedure calls