Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Multiscalar processors
Microprocessors Introduction to RISC Mar 19th, 2002.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
TECH 6 VLIW Architectures {Very Long Instruction Word}
What have mr aldred’s dirty clothes got to do with the cpu
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day11:
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Advanced Architectures
Computer Architecture Principles Dr. Mike Frank
A Closer Look at Instruction Set Architectures
VLIW Architecture FK Boachie..
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Instruction-Level Parallelism
CS203 – Advanced Computer Architecture
Henk Corporaal TUEindhoven 2009
Superscalar Processors & VLIW Processors
The EPIC-VLIW Approach
Central Processing Unit
Yingmin Li Ting Yan Qi Zhao
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSC3050 – Computer Architecture
How to improve (decrease) CPI
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 4: Instruction Set Design/Pipelining
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles

Instruction Level Parallelism Certain instructions can be executed in parallel Certain instructions can be executed in parallel Certain instructions can be executed in any order Certain instructions can be executed in any order Both of these stem from lack of dependency between instructions. Both of these stem from lack of dependency between instructions. The goal of the ia64 design: The goal of the ia64 design: Exploit ILP more effectively Exploit ILP more effectively

EPIC Explicitly Parallel Instruction Computing Explicitly Parallel Instruction Computing Conventional RISC Conventional RISC Processor discovers and exploits ILP Processor discovers and exploits ILP Conventional VLIW Conventional VLIW Programmer knows the precise execution model and explicitly lays out the program to take advantage of ILP Programmer knows the precise execution model and explicitly lays out the program to take advantage of ILP EPIC EPIC Programmer indicates possible ILP, processor does the rest of the job. Programmer indicates possible ILP, processor does the rest of the job.

The ia64 Architecture Instructions are bundled in packets of 3 Instructions are bundled in packets of 3 Packet length is 128 bits Packet length is 128 bits Three 41-bit instructions in each packet Three 41-bit instructions in each packet 5-bits of scheduling information 5-bits of scheduling information Scheduling information indicates Scheduling information indicates What functional units required for each instruction in the packet. What functional units required for each instruction in the packet. What instructions can be executed in parallel What instructions can be executed in parallel

Instruction Bundles An instruction bundle is a group of instructions that can be executed in parallel An instruction bundle is a group of instructions that can be executed in parallel No read after write dependencies No read after write dependencies That’s where one instruction writes a value to memory or a register that is read by another instruction. That’s where one instruction writes a value to memory or a register that is read by another instruction. No write after write dependencies No write after write dependencies That’s where two instructions write the same register or location in memory That’s where two instructions write the same register or location in memory

More on Bundles The scheduling bits indicate the length of a particular instruction bundle The scheduling bits indicate the length of a particular instruction bundle At one extreme, one instruction per bundle, no parallelism, works but slow! At one extreme, one instruction per bundle, no parallelism, works but slow! At the other extreme, can join packets together to make bundles of arbitrary length At the other extreme, can join packets together to make bundles of arbitrary length Compiler is supposed to construct bundles as big as possible, but does not otherwise have to worry about latencies for correctness. Compiler is supposed to construct bundles as big as possible, but does not otherwise have to worry about latencies for correctness.

Bundles and MP Versions Versions of the ia64 implementation may differ in their capabilities of executing instructions in parallel. Versions of the ia64 implementation may differ in their capabilities of executing instructions in parallel. If a bundle is larger than what the implementation can handle, it just breaks it up into pieces done sequentially If a bundle is larger than what the implementation can handle, it just breaks it up into pieces done sequentially Unlike VLIW, or even conventional RISC, no need to recompile for new versions of processors. Unlike VLIW, or even conventional RISC, no need to recompile for new versions of processors.

Bundles and Jumps A jump can dynamically end a bundle A jump can dynamically end a bundle First jump to take ends bundle dynamically First jump to take ends bundle dynamically So it is permissible to have multiple jumps in one bundle. Processor takes care of this. So it is permissible to have multiple jumps in one bundle. Processor takes care of this.

The Compiler and Bundles The compiler needs to do an analysis to find ILP to construct the largest possible bundles. The compiler needs to do an analysis to find ILP to construct the largest possible bundles. In some cases, this may entail predication, trace scheduling, speculative execution etc In some cases, this may entail predication, trace scheduling, speculative execution etc These can all be done as much as the compiler wants, but are not required. These can all be done as much as the compiler wants, but are not required.

Speculative Execution, Predication All instructions are predicated All instructions are predicated Large number of predicate registers Large number of predicate registers Instruction effective only if predicated Instruction effective only if predicated Allows larger bundles Allows larger bundles For example, can have all instructions of both the then and else branches of an IF statement in a single bundle with only the relevant branch being actually executed For example, can have all instructions of both the then and else branches of an IF statement in a single bundle with only the relevant branch being actually executed

Speculative Execution, Propagation If instructions are executed speculatively, i.e. you don’t know if they should be executed or not, some instruction may give a garbage value (e.g. divide by zero) If instructions are executed speculatively, i.e. you don’t know if they should be executed or not, some instruction may give a garbage value (e.g. divide by zero) Don’t want a trap, since perhaps we will find out in a moment that we should discard the whole thread. Don’t want a trap, since perhaps we will find out in a moment that we should discard the whole thread. Therefore, must silently propagate indication of bad value (not a value). Therefore, must silently propagate indication of bad value (not a value).

Speculative Execution, Loads Loads can cause pipeline stalls Loads can cause pipeline stalls Therefore you want to do them early Therefore you want to do them early But danger in moving them across stores But danger in moving them across stores So there is a load-predict instruction So there is a load-predict instruction Please load this value, I think I will need it Please load this value, I think I will need it And a load confirm instruction And a load confirm instruction OK, now I want that value, check no one stored there since my load predict. If so, too bad you will have to go load it now. OK, now I want that value, check no one stored there since my load predict. If so, too bad you will have to go load it now.

Lots and Lots of Registers The ia64 has hundreds of user level registers. The ia64 has hundreds of user level registers. Easier to do speculative execution in registers Easier to do speculative execution in registers As usual, we hate loads, so avoid them As usual, we hate loads, so avoid them Instructions not limited to 32 bits, so we can afford long register identifier fields. Instructions not limited to 32 bits, so we can afford long register identifier fields.

Register Windows Register windows are provided Register windows are provided Like the SPARC, except that you can say how much to move the window by Like the SPARC, except that you can say how much to move the window by Overlap between caller and callee possible as on the SPARC Overlap between caller and callee possible as on the SPARC But if you only need a few registers you don’t need to consume a large fixed chunk of registers. But if you only need a few registers you don’t need to consume a large fixed chunk of registers. (old idea, AMD29K had a similar design) (old idea, AMD29K had a similar design)

Efficient Code for Loops Suppose we have a loop whose form is Suppose we have a loop whose form is Load value Load value Add some constant to that value Add some constant to that value Store result Store result That’s nasty for dependencies That’s nasty for dependencies We want space between the load and the add We want space between the load and the add And space beween the add and the store And space beween the add and the store

Loop Unrolling and Software Pipelining If we unroll several iterations of the loop we can be doing an add of previous iteration while loading the next If we unroll several iterations of the loop we can be doing an add of previous iteration while loading the next Generates much more code Generates much more code Requires complex prolog (get things started) and epilog (finish things off) code Requires complex prolog (get things started) and epilog (finish things off) code In practice, hard to apply in all cases In practice, hard to apply in all cases

Rotating Registers Suppose we generate code for the loop Suppose we generate code for the loop Load register R7 with input value Load register R7 with input value Add constant to register R8 Add constant to register R8 Store register R9 to memory Store register R9 to memory Certainly no dependencies Certainly no dependencies But code looks wrong and useless! But code looks wrong and useless! How can we make the above make sense How can we make the above make sense

More on Rotating Registers Here is the code Here is the code Load register R7 with input value Load register R7 with input value Add constant to register R8 Add constant to register R8 Store register R9 to memory Store register R9 to memory Now renumber registers on each loop Now renumber registers on each loop Old R7 is new R8 Old R7 is new R8 Old R8 is new R9 Old R8 is new R9 Old R9 is new R7 Old R9 is new R7 Ah ha! Magic, the generated code is OK! Ah ha! Magic, the generated code is OK!

More on Rotating Registers Limited subsets of registers can rotate Limited subsets of registers can rotate Giving the renumbering on previous slide Giving the renumbering on previous slide The loop instruction automatically triggers the rotation (a bit like registers windows) The loop instruction automatically triggers the rotation (a bit like registers windows) Special prolog/epilog counts deal with setup and cleanup cases Special prolog/epilog counts deal with setup and cleanup cases Voila! Efficient loops without Voila! Efficient loops without Loop unrolling Loop unrolling Software pipelining Software pipelining

The Bottom Line The advantages of VLIW The advantages of VLIW Greater ILP exploitation Greater ILP exploitation Simpler hardware Simpler hardware Without the disadvantages Without the disadvantages Code does not depend on processor model Code does not depend on processor model But But We still depend on the compiler a whole lot! We still depend on the compiler a whole lot! Next time: Details of the ia64 architecture Next time: Details of the ia64 architecture