Download presentation
Presentation is loading. Please wait.
Published byNoel Burns Modified over 9 years ago
1
Arun Hariharan (N.M.S.U)
2
MOTIVATION Need for high speed computing and Architecture More complex compilers (JAVA) Large Database Systems Distributed Computing on Internet Peer competition from other manufacturers SOLUTION Instruction Level Parallelism (ILP) in general-purpose Microprocessors Wide floating-point exponents Register Stack Engine Hardware exception deferral Control speculation Register rotation Large register files Data speculation Predication Parallel semantics
3
GOALS OF ARCHITECTURE Overcome performance limiters : Branches Memory Latency Sequential Program Model Long Architectural Life Large Register File Fully Interlocked Architecture – Not tied to any particular design No Fixed Issue – ex. Instructions length.
4
REGISTER RESOURCES 128 65-bit General Registers (1 KB) ( 64 + 1”NaT” ) 128 82-bit Floating Point Registers Space for up to 128 64-bit special-purpose application registers (1 KB) Eight 64-bit branch registers for function call linkage and return 64 one-bit predicate
7
INSTRUCTION ENCODING Key Words Long life Instruction bundle PredicateReg 3Reg 2Reg 1Op code 5 bit 7 bit 7 bit 7 bit 6bit = 32 bit Also called Template Helps to decode and route instruction Marks end of basic block =41 bits
10
DISTRIBUTING RESPONSIBILITY Shift a lot of the complexity to the compiler ILP Out-of-Order Execution Control Flow Parallelism Influencing Dynamic Events – Learn hints from compiler about branch prediction, instruction/data caching & pre-fetching.
11
ILP – Instruction Level Parallelism Sequential In-Order execution was not enough to have maximum parallelism Out-of-order execution – Compilers task to creates instruction groups so that all instructions in an instruction group can be safely executed in parallel Key Word Basic Block
12
CONTROL FLOW PARALLELISM Traditional execution Compare a and 0 Check flag if true Store flag value for further computation Compare b <= 5 Check flag if true Store flag value for further computation | Compare if any one had set the flag. Move 8 to r3 In IA-64 Initialize p1 to false Set compare condition’s prerequisite Compare in parallel Branch
13
FINDING AND CREATING PARALLELISM BRANCHES LIMIT ILP: Sequential, no-predict: normal bank teller Sequential, predict: fill out slip in advance (predict whether deposit or withdrawal) Predicated Execution: fill out both slips, throw away whichever is wrong
14
FINDING AND CREATING PARALLELISM (cont..) Scheduling and Speculation Moving basic blocks ahead of barriers - compilers task to find possible route and schedule it instead of the processor. Use of basic blocks (Define) Best possible Route – Most predicted flow of program (speculation), not all instructions are executed Compilers – Have a birds eye view of program, unlike the processor.
15
CONTROL SPECULATION Removing branches – Expensive Not all can be removed Moving basic blocks call cause Exceptions =41 bits Key Word Fix-up Code
16
DATA SPECULATION ALAT – Adv. Load Address table Key Word Fix-up Code
17
REGISTER MODEL 128 – 64bit registers of which 32 are fixed for µP operations (like RISC) 96 are free to compiler to use. Unlimited registers use possible as they are paged to memory in background using the RSE (Register Stack Engine) “Alloc” to specify number for registers for local and output (for parameters to calls. Programs renames registers to start from 32 to 127.
18
RSE (Register Stack Engine) Automatically saves/restores stack registers without software intervention (Can work synchronously) Provides the illusion of infinite physical registers by mapping to a stack of physical registers in memory Overflow: Alloc needs more registers than available needs more Underflow: Return needs to restore frame saved in memory RSE may be designed to utilize unused memory bandwidth to perform register spill and fill operations in the background (Asynchronously - Speculatively to load and store data)
19
SOFTWARE PIPELINE Time complexity is calculated by O(n) This notation is used to count time spent in loops That is because loops take most execution time Time complexity is calculated by ____ ? Can we implement loops in parallel ? ANS : Yes. If we resolve some problems. Managing the loop count, Handling the renaming of registers for the pipeline, Finishing the work in progress when the loop ends, Starting the pipeline when the loop is entered, and Unrolling to expose cross-iteration parallelism. IA-64 Solution Special architecture Loop count LC Epilog count EC Use of register rename base (rrb)
21
SUMMARY Synergy ILP by compiler and hardware Data and Control Speculation Multi-chip and multi-processing EPIC – Explicit parallel instruction computing
22
“RISC architectures claim to match many of the features of IA-64 with similar sounding instructions. However, just like a tank formed by bolting weapons and armor to an old truck, the benefits are limited to specific conditions, but fall short in the heat of battle.” Existing RISC architectures that use ‘cmoves’ and similar instructions may remove branches, but at the cost of adding so many instructions that the benefits are nearly outweighed by the code-bloat (hardly worth the trade-off). The reason why ILP works with IA-64 is the use of completely new architectural constructs such as predicates that are not available to any existing RISC architecture. Traditional RISC architectures can use a ‘non-faulting load’ to avoid costly error handling when loading data ahead of time which may not be valid. But if you want to turn off the errors, why have errors in the first place? Traditional RISC architectures face one of two alternatives: add extra error-checking code which, once again, cancels out the performance benefit of speculative execution ; or ‘work without a net,’ risking disastrous undetected errors due to turning off the error messages. IA-64 gets around both problems by offering a novel architectural approach to dealing with errors when loading data. RISC Vs IA-64 – Whitepaper by Intel & HP(1999)
23
Benchmark comparison
24
BACKWARD COMPATIBILITY Intel promises compatibility with the 32-bit software (IA-32). It should be possible to run software in real mode (16 bits), protected mode (32 bits) and virtual mode 86 (16 bits).
26
Questions? REFERENCES 1.Ricardo Zelenovsky and Alexandre Mendonca – “Intel 64-bit Architecture” – 2001 2.Bruce Jacob – “The IA-64 Architecture” – University of Maryland (College Park) 3.Whitepaper – “IA-64 Architecture Innovations” –HP & Intel – 1999 4.Carole Dulong et al. - “An overview of Intel IA-64 Compiler” 5.M. F. Guest - “Intel’s Itanium IA-64 Processor: Overview and Initial Experience” – CLRC Daresburg Laboratory
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.