CSE 586 Computer Architecture Lecture 4

CSE 586 Computer Architecture Lecture 4
Jean-Loup Baer CSE 586 Spring 00

Highlights from last week
ILP: where can the compiler optimize Loop unrolling and software pipelining Speculative execution (we’ll see predication today) ILP: Dynamic scheduling in a single issue machine Scoreboard Tomasulo’s algorithm CSE 586 Spring 00

Highlights from last week (c’ed) -- Scoreboard
The scoreboard keeps a record of all data dependencies The scoreboard keeps a record of all functional unit occupancies The scoreboard decides if an instruction can be issued The scoreboard decides if an instruction can store its result Implementation-wise, scoreboard keeps track of which registers are used as sources and destinations and which functional units use them CSE 586 Spring 00

Highlights from last week (c’ed) – Tomasulo’s algorithm
Decentralized control Use of reservation stations to buffer and/or rename registers (hence gets rid of WAW and WAR hazards) Results –and their names– are broadcast to reservations stations and register file Instructions are issued in order but can be dispatched, executed and completed out-of-order CSE 586 Spring 00

Highlights from last week (c’ed)
Register renaming: avoids WAW and WAR hazards Is performed at “decode” time to rename the result register Two basic implementation schemes Have a separate physical register file Use of reorder buffer and reservation stations (cf. Tomasulo algorithm extended implementation) Often a mix of the two (cf. description of MIPS in Smith and Sohi) CSE 586 Spring 00

Multiple Issue Alternatives
Superscalar (hardware detects conflicts) Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) Dynamically scheduled (in order issue, out of order dispatch and execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium Pro) VLIW – EPIC (Explicitly Parallel Instruction Computing) Compiler generates “bundles “ of instructions that can be executed concurrently (cf. Intel Merced – Itanium) CSE 586 Spring 00

Multiple Issue for Static/Dynamic Scheduling
Issue in order Otherwise bookkeeping too complex (the old “data flow” machines could issue any ready instruction in the whole program) Check for structural hazards; if any stall Dispatch for static scheduling Check for data dependencies; stall adequately Can take forwarding into account Dispatch for dynamic scheduling Dispatch out of order (reservation stations, instruction window) Requires possibility of dispatching concurrently dependent instructions (otherwise little benefit over static. sched.) CSE 586 Spring 00

Impact of Multiple Issue on IF
IF: Need to fetch more than 1 instruction at a time Simpler if instructions are of fixed length In fact need to fetch as many instructions as the issue stage can handle in one cycle (otherwise the issue stage will stall) Simpler if restricted not to overlap I-cache lines But with branch prediction and superblocks, this is not realistic hence introduction of (instruction) fetch buffers Always attempt to keep at least as many instructions in the fetch buffer as can be issued in the next cycle (BTB’s help for that) For example, have an 8 wide instruction buffer for a machine that can issue 4 instructions per cycle CSE 586 Spring 00

Stalls at the IF Stage Instruction buffer is full Branch misprediction
Most likely there are stalls in the stages downstream Branch misprediction Instructions are stored in several I-cache lines In one cycle one I-cache line can be brought into fetch buffer A basic block might start in the middle (or end) of an I-cache line Requires several cache lines to fill the buffer The ID (issue-dispatch) stage will stall if not enough instructions in the fetch buffer Instruction cache miss CSE 586 Spring 00

Sample of Current Micros
Two instruction issue: Alpha 21064, Sparc 2, Pentium, Cyrix Three instruction issue: Pentium Pro (but 5 uops from IF/ID to EX; AMD has 4 uops) Four instruction issue: Alpha 21164, Alpha 21264, Power PC 620, Sun UltraSparc, HP PA-8000, MIPS R10000 Many papers written in predicted 16-way issue by We are still at 4! CSE 586 Spring 00

The Decode Stage (simple case: dual issue and static scheduling)
ID = Issue + Dispatch Look for conflicts between the (say) 2 instructions If one integer unit and one f-p unit, only check for structural hazard, i.e. the two instructions need the same f-u (easy to check with opcodes ) Slight difficulty for integer ops that are f-p load/store/move (potential multiple accesses to f-p register file; solution: provide additional ports) RAW dependencies resolved as in single pipelines Note that the load delay (assume 1 cycle) can now delay up to 3 instructions, i.e., 3 issue slots are lost CSE 586 Spring 00

Decode in Simple Multiple Issue Case
If instructions i and i+1 are fetched together and: Instruction i stalls, instruction i+1 will stall Instruction i is dispatched but instruction i+1 stalls (e.g., because of structural hazard = need the same f-u), instruction i+2 will not advance to the issue stage. It will have to wait till both i and i+1 have been dispatched CSE 586 Spring 00

Branch prediction during “swap”
FP S S S2 S3 Fet Swap Dec Iss Load-store 4 stages common to all instructions. Once passed the 4th stage, no stalls . Branch prediction during “swap” Check structural and data hazards during issue In blue, a subset of the 38 bypasses (forwarding paths) Integer Alpha way issue Alpha way issue (more pipes) CSE 586 Spring 00

Alpha 21064 IF – S0: Access I-cache Swap stage - S1:
Prefetcher fetches 2 instructions (8 bytes) at a time Swap stage - S1: Prefetcher contains branch prediction logic tested at this stage: 4 entry return stack; 1 bit/instruction in the I-cache + static prediction BTFNT Initial decode yields 0, 1 or 2 instruction potential issue; align instructions depending on the functional unit there are headed for. End of decode: S2. Check for WAW and WAR (my guess) CSE 586 Spring 00

Alpha 21064 (c’ed) Instruction Issue: S3
Check for RAW; forwarding etc Conditions for 2 instruction issue (S2 and S3) The first instruction must be able to issue (in order execution) Load/store can issue with an operate except stores cannot issue with an operate of different format (share the same result bus) An integer op. can issue with a f-p op. A branch can issue with a load/store/operate (but not with stores of the same format) CSE 586 Spring 00

Alpha 21164 Main differences in with 21064 (besides caches)
Up to 4 instructions issued/cycle Two integer units; Two f-p units (one add, one multiply; divide can be concurrent with add) Slightly different execution pipe organizations Still common trunk of 4 stages S0: Access I-cache. The instructions are predecoded (determination of whether the instruction is a branch – used in S1 – and of the pipeline executing the instruction – used in S2 –) S1: Branch prediction (2-bit saturating counters in I-cache associated with each instruction). Buffer 4 instructions for next stage CSE 586 Spring 00

Alpha 21164 (c’ed) Still common trunk of 4 stages (c’ed)
S2: Slot-swap instructions so that they are headed for the right pipeline. If functional unit conflicts, stall previous stages until all four are gone S3: Check for WAW hazards. Read integer file. Stall if results are not ready. CSE 586 Spring 00

Pentium Recall dual integer pipeline and a f-p
Decode 2 consecutive instructions I1 and I2. If both are intended for the integer pipes, issue both iff I1 and I2 are “simple” instructions (no microcode) I1 is not a jump instruction No WAR and WAW hazard between I1 and I2 (I1 precedes I2) CSE 586 Spring 00

The Decode Stage (dynamic scheduling)
Decode means: Dispatch to either A FIFO queue associated with each functional unit (not done any more) A centralized instruction window common to all functional units (Pentium Pro and Pentium III -- I think) Reservation stations associated with functional units (MIPS 10000, AMD K5, IBM Power PC 620) Rename registers (if supported by architecture) Set up entry at tail of reorder buffer (if supported by architecture) Issue operands, when ready, to functional unit CSE 586 Spring 00

Stalls in Decode (issue/dispatch) Stage
There can be several instructions ready to be dispatched in same cycle to same functional unit There might not be enough bus/ports to forward values to all the reservation stations that need them in the same cycle CSE 586 Spring 00

The Execute Stage Use of forwarding in the case of static scheduling
Use of broadcast bus and reservation stations for dynamic scheduling We’ll talk at length about memory operations (load-store) when we study memory hierarchies CSE 586 Spring 00

The Commit Step (in-order completion)
Recall: need of a mechanism (reorder buffer) to: “Complete” instructions in order. This commits the instruction. Since multiple issue machine, should be able to commit (retire) several instructions per cycle Know when an instruction has completed non-speculatively,i.e., what to do with branches Know whether the result of an instruction is correct, i.e., what to do with exceptions CSE 586 Spring 00

Power PC 620 (see figure in book)
Issue stage. Up to 4 instructions issued/cycle except if structural hazards such as: No reservation station available No rename register available Reorder buffer is full. Two operations (e.g., forwarding operands) for the same unit. Only one write port/set of reservation stations for each unit. Miscellaneous structural hazards, e.g., too many concurrent reads to the register file First 3 due to “the program”, last 2 to the implementation CSE 586 Spring 00

Pentium Pro Fetch-Decode unit Dispatch (aka issue)-execution unit
Transforms instructions into micro-operations (uops) and stores them in a global reservation table (instruction window). Does register renaming (RAT = register alias table) Dispatch (aka issue)-execution unit Issues uops to functional units that execute them and temporarily store the results (the reservation table is 5-ported, hence 5 uops can be issued concurrently) Retire unit Commits the instructions in order (up to 3 commits/cycle) CSE 586 Spring 00

Fetch/Decode unit Dispatch/Execute Unit Retire unit Instruction pool The 3 units of the Pentium Pro are “independent” and communicate through the instruction pool CSE 586 Spring 00

Impact on Branch Prediction and Completion
When a conditional branch is decoded: Save the current physical-logical mapping Predict and proceed When branch is ready to commit (head of buffer) If prediction correct, discard the saved mapping If prediction incorrect Flush all instructions following mispredicted branch in reorder buffer Restore the mapping as it was before the branch as per the saved map Note that there have been proposals to execute both sides of a branch using register shadows limited to one extra set of registers CSE 586 Spring 00

Exceptions Instructions carry their exception status
When instruction is ready to commit No exception: proceed normally Exception Flush (as in mispredicted branch) Restore mapping (more difficult than with branches because the mapping is not saved at every instruction; this method can also be used for branches CSE 586 Spring 00

Limits to Hardware-based ILP
Inherent lack of parallelism in programs Partial remedy: loop unrolling and other compiler optimizations; Branch prediction to allow earlier issue and dispatch Complexity in hardware Needs large bandwidth for instruction fetch (might need to fetch from more than one I-cache line in one cycle) Requires large register bandwidth (multiported register files ) Forwarding/broadcast requires “long wires” (long wires are slow) as soon as there are many units. CSE 586 Spring 00

Limits to Hardware-based ILP (c’ed)
Difficulties specific to the implementation More possibilities of structural hazards (need to encode some priorities in case of conflict in resource allocations) Parallel search in reservation stations, reorder buffer etc. Additional state savings for branches (mappings), more complex updating of BPT’s and BTB’s. Keeping precise exceptions is more complex CSE 586 Spring 00

A (naïve) Primer on VLIW - EPIC
Disclaimer: Some of the next few slides are taken (and slightly edited) from an Intel-HP presentation (see Outline for a reference) VLIW direct descendant of horizontal microprogramming Two commercially unsuccessful machines: Multiflow and Cydrome Compiler generates instructions that can execute together Instructions executed in order and assumed to have a fixed latency Difficulties occur with : Branch prediction -> Use of predication Pointer-based computations -> Use cache hints and speculative loads Unpredictable latencies (e.g., cache misses) CSE 586 Spring 00

IA-64 Architecture : Explicit Parallelism
Parallel Machine Code Original Source Code Compile Compiler Hardware multiple functional units IA-64 Compiler Views Wider Scope More efficient use of execution resources . . . . Fundamental design philosophy enables new levels of headroom CSE 586 Spring 00

IA-64 : Explicitly Parallel Architecture
128 bits (bundle) Template 5 bits Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Memory (M) Memory (M) Integer (I) (MMI) IA-64 template specifies The type of operation for each instruction MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB Intra-bundle relationship M / MI or MI / I Inter-bundle relationship Most common combinations covered by templates Headroom for additional templates Simplifies hardware requirements Scales compatibly to future generations M=Memory F=Floating-point I=Integer L=Long Immediate B=Branch Basis for increased parallelism CSE 586 Spring 00

Merced/ Itanium implementation (?)
Can execute 2 bundles (6 instructions) per cycle 10 stage pipeline 4 integer units (2 of them can handle load-store), 2 f-p units and 3 branch units Issue in order, execute in order but can complete out of order. Uses a (restricted) register scoreboard technique to resolve dependencies. CSE 586 Spring 00

Predication reduces number of branches and number of mispredicts, Nonetheless: sophisticated branch predictor Compiler hints: BPR instruction provides “easy” to predict branch address and addresses; reduces number of entries in BTB Two-level hardware prediction Sas (4,2) (512 entry local history table 4-way set-associative, indexing 128 PHT –one per set- each with 16 entries –2-bit saturating counters). Number of bubbles on predicted branch taken: 2 or 3 And a 64-entry BTB (only 1 bubble) Mispredicted branch penalty: 9 cycles CSE 586 Spring 00

There are “instruction queues” between the fetch unit and the execution units. Therefore branch bubbles can often be absorbed because of long latencies (and stalls) in the execute stages CSE 586 Spring 00

IA-64 for High Performance
Number of branches in large server apps overwhelm traditional processors IA-64 predication removes branches, avoids mispredicts Environments with a large number of users require high performance IA-64 uses speculation to reduce impact of memory latency 64-bit addressing enables systems with very large virtual and physical memory CSE 586 Spring 00

Middle Tier Application Needs
Mid-tier applications (ERP, etc.) have diverse code requirements Integer code with many small loops Significant call / return requirements (C++, Java) IA-64’s unique register model supports these various requirements Large register file provides significant resources for optimized performance Rotating registers enables efficient loop execution Register stack to handle call-intensive code IA-64 resources enable optimization for a variety of application requirements CSE 586 Spring 00

IA-64’s Large Register File
Floating-Point Registers Branch Registers Predicate Registers Integer Registers 63 81 63 bit 0 GR1 GR31 GR127 GR32 GR0 GR1 GR31 GR127 GR32 GR0 0.0 BR0 PR0 1 BR7 PR1 PR15 PR16 PR63 NaT 32 Static 32 Static 16 Static 96 Stacked, Rotating 96 Rotating 48 Rotating Large number of registers enables flexibility and performance CSE 586 Spring 00

Software Pipelining via Rotating Registers
Software pipelining - improves performance by overlapping execution of different software loops - execute more loops in the same amount of time Sequential Loop Execution Software Pipelining Loop Execution Time Time Traditional architectures need complex software loop unrolling for pipelining Results in code expansion --> Increases cache misses --> Reduces performance IA-64 utilizes rotating registers to achieve software pipelining Avoids code expansion --> Reduces cache misses --> Higher performance IA-64 rotating registers enable optimized loop execution CSE 586 Spring 00 48 66 31 95

Traditional Register Models
Traditional Register Stacks Procedure Register Memory Procedures Register A A A A B B Procedure A calls procedure B Procedures must share space in register Performance penalty due to register save / restore B C C D ? D I think that the “traditional register stack” model they refer to is the “register windows” model Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources IA-64 significantly improves upon this CSE 586 Spring 00 8

Traditional Register Stacks
IA-64 Register Stack Traditional Register Stacks IA-64 Register Stack Procedures Register Procedures Register A A A A B B B B C C D D C C D ? D D D Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources IA-64 able to reserve variable block sizes No wasted resources IA-64 combines high performance and high efficiency CSE 586 Spring 00 8

IA-64 Floating-Point Architecture
Multiple read ports (82 bit floating point numbers) A X B + C Memory 128 FP Register File . . . FMAC . . . FMAC #1 FMAC #2 FMAC Multiple write ports D 128 registers Allows parallel execution of multiple floating-point operations Simultaneous Multiply - Accumulate (FMAC) 3-input, 1-output operation : a * b + c = d Shorter latency than independent multiply and add Greater internal precision and single rounding error Resourced for scientific analysis and 3D graphics CSE 586 Spring 00

IA-64 : Next Generation Architecture
IA-64 Features Explicit Parallelism : compiler / hardware synergy Register Model : large register file, rotating registers, register stack engine Floating Point Architecture : extended precision calculations,128 registers, FMAC, SIMD Multimedia Architecture : parallel arithmetic, parallel shift, data arrangement instructions Memory Management : 64-bit addressing, speculation, memory hierarchy control Compatibility : full binary compatibility with existing IA-32 instructions in hardware, PA-RISC through software translation Function Executes more instructions in the same amount of time Able to optimize for scalar and object oriented applications High performance 3D graphics and scientific analysis Improves calculation throughput for multimedia data Manages large amounts of memory, efficiently organizes data from / to memory Existing software runs seamlessly Benefits Maximizes headroom for the future World-class performance for complex applications Enables more complex scientific analysis Faster digital content creation and rendering Efficient delivery of rich Web content Increased architecture & system scalability Preserves investment in existing software IA-64 : Next Generation Architecture CSE 586 Spring 00

Predication Basic Idea
Associate a Boolean condition (predicate) with the issue, execution, or commit of an instruction The stage in which to test the predicate is an implementation choice If the predicate is true, the result of the instruction is kept If the predicate is false, the instruction is nullified Distinction between Partial predication: only a few opcodes can be predicated Full predication: every instruction is predicated CSE 586 Spring 00

Predication Benefits Allows compiler to overlap the execution of independent control constructs w/o code explosion Allows compiler to reduce frequency of branch instructions and, consequently, of branch mispredictions Reduces the number of branches to be tested in a given cycle Reduces the number of multiple execution paths and associated hardware costs (copies of register maps etc.) Allows code movement in superblocks CSE 586 Spring 00

Predication Costs Increased fetch utilization
Increased register consumption If predication is tested at commit time, increased functional-unit utilization With code movement, increased complexity of exception handling For example, insert extra instructions for exception checking CSE 586 Spring 00

Flavors of Predication Implementation
Has its roots in vector machines like CRAY-1 Creation of vector masks to control vector operations on an element per element basis Often (partial) predication limited to conditional moves as, e.g., in the Alpha, MIPS 10000, Power PC, SPARC and the Pentium Pro Predication to nullify next instruction as in HP PA-RISC Full predication:Every instruction predicated as in IA-64 The guarded execution model where a special instruction controls the conditional execution of several of the subsequent instructions is similar CSE 586 Spring 00

Partial Predication: Conditional Moves
CMOV R1, R2, R3 Move R2 to R1 if R3 = 0 Main compiler use: If (cond ) S1 (with result in Rres) (1) Compute result of S1 in Rs1; (2) Compute condition in Rcond; (3) CMOV Rres, Rs1, Rcond Increases register pressure (Rcond is general register) No need (in this example) for branch prediction Very useful if condition can be computed ahead or, e.g., in parallel with result. CSE 586 Spring 00

Other Forms of Partial Predication
Select dest, src1, src2,cond Corresponds to C-like --- dest = ( (cond) ? src1 : src2) Note the destination register is always assigned a value Use in the Multiflow (first commercial VLIW machine) Nullify Any register-register instruction can nullify the next instruction, thus making it conditional CSE 586 Spring 00

Full Predication Define predicates with instructions of the form:
Pred_<cmp> Pout1<type> , Pout2<type>,, src1, src2 (Pin) where Pout1 and Pout2 are assigned values according to the comparison between src1 and src2 and the cmp “opcode” The predicate types are most often U (unconditional) and U its complement, and OR and OR The predicate define instruction can itself be predicated with the value of Pin There are definite rules for that, e.g., if Pin = 0, U and U are set to 0 independently of the result of the comparison and the OR predicates are not modified. CSE 586 Spring 00

If-conversion if The if condition will set p1 to U
The then will be executed predicated on p1(U) The else will be executed predicated on p1(U) The “join” will in general be predicated on some form of OR predicate then else join CSE 586 Spring 00

Levels of Parallelism within a Single Processor
ILP: smallest grain of parallelism Resources are not that well utilized (far from ideal CPI) Stalls on operations with long latencies (division, cache miss) Multiprogramming: Several applications (or large sections of applications) running concurrently O.S. directed activity Change of application requires a context-switch (e.g., on a page fault) Multithreading Main goal: tolerate latency of long operations without paying the price of a full context-switch CSE 586 Spring 00

Multithreading The processor supports several instructions streams running “concurrently” Each instruction stream has its own context (process state) Registers PC, status register, special control registers etc. The multiple streams are multiplexed by the hardware on a set of common functional units CSE 586 Spring 00

Fine Grain Multithreading
Conceptually, at every cycle a new instruction stream dispatches an instruction to the functional units If enough instruction streams are present, long latencies can be hidden For example, if 32 streams can dispatch an instruction, latencies of 32 cycles could be tolerated For a single application, requires highly sophisticated compiler technology Discover many threads in a single application Basic idea behind Tera’s MTA Burton Smith third such machine (he started in late 70’s) CSE 586 Spring 00

Tera’s MTA Each processor can execute
16 applications in parallel (multiprogramming) 128 streams At every clock cycle, processor selects a ready stream and issues an instruction from that stream An instruction is a “LIW”: Memory, arithmetic, control Several instructions of the same stream can be in flight simultaneously (ILP) Instructions of different streams can be in flight simultaneously (multithreading) CSE 586 Spring 00

Tera’s MTA (c’ed) Since several streams belong to the same application, synchronization is very important (will be discussed alter in the quarter) Needs instructions – and compiler support – to allocate, activate, and deallocate streams Compiler support: loop level parallelism and software pipelining Hardware support: dynamic allocation of streams (depending on mix of applications etc.) CSE 586 Spring 00

Coarse Grain Multithreading
Switch threads (contexts) only at certain events Change thread context takes a few (10-20?) cycles Used when long memory latency operations, e.g., access to a remote memory in a shared-memory multiprocessor (100’s of cycles) Of course, context-switches occur when there are exceptions such as page faults Many fewer contexts needed than in fine-grain multithreading CSE 586 Spring 00

Simultaneous Multithreading (SMT)
Combines the advantages of ILP and fine grain multithreading Hoz. waste still present but not as much. Vert. waste does not necessarily all disappear as this figure implies Vertical waste of the order of 60% of overall waste Hoz. waste ILP SMT CSE 586 Spring 00

SMT (a UW invention) Needs one context per thread
But fewer threads needed than in fine grain multithreading Can issue simultaneously from distinct threads in the same cycle Can share resources For example: physical registers for renaming, caches, BPT etc. Future generation Alpha based on SMT CSE 586 Spring 00

SMT (c’ed) Compared with an ILP superscalar of same issue width
Requires 5% more real estate Slightly more complex to design (thread scheduling, identifying threads that raise exceptions etc.) Drawback (common to all wide-issue processors): centralized design Benefits Increases throughput of applications running concurrently Dynamic scheduling No partitioning of many resources (in contrast with chip multiprocessors) CSE 586 Spring 00

Trace Caches Filling up the instruction buffer of wide issue processors is a challenge (even more so in SMT) Instead of fetching from I-cache, fetch from a trace cache The trace cache is a complementary instruction cache (I-cache) that stores sequences of instructions organized in dynamic program execution order Implemented in forthcoming Intel Willamette (thanks Luni for the pointer) and some Sun Sparc architecture. CSE 586 Spring 00

CSE 586 Computer Architecture Lecture 4

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 4"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 586 Computer Architecture Lecture 4

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 4"— Presentation transcript:

Similar presentations

About project

Feedback