OMSE 510: Computing Foundations 4: The CPU!

OMSE 510: Computing Foundations 4: The CPU!
Chris Gilmore Systems Software Lab Portland State University/OMSE

Today Caches DLX Assembly CPU Overview

Introduction to RISC Reduced Instruction Set Computer
1975 John Cocke IBM 801 IBM started working on a RISC-type computer on 1975 without calling it by this name used as an I/O processor for IBM Mainframe Patterson and Hennessey RISC was first introduction by Patterson and Ditzel in1980 Produced first RISC chip in early 1980s RISC I and RISC II from Berkeley and MIPS from Stanford

RISC Chips RISC II Had 39 instructions and 2 addressing modes, 3 data types 234 combinations Compared to VAX 304 inst, 16 address mode, 14 data type 68,096 Found that Compiled programs were 30% larger than CISC (Vax 11/780) Ran upto 5 times faster than 68000 Assembler-Compiler ratio (Execution time of assembler program divided by the exec time of compiled version) Ratio < 50% for CISC 90% for RISC

RISC Definition 1. Single cycle operation 2. Load / store design
3. Hardwired control unit 4. Few instructions and addressing modes 5. Fixed instruction format 6. More compile time effort to avoid pipeline penalties

Disadvantages of CISC Large, complicated, and time-consuming instruction set Complex CU to decode and execute Not necessarily faster than a sequence of several RISC instructions Complexity of the CISC CU A large number of design errors Longer design time Too large a choice for the compiler Very difficult to design the optimal compiler Not always yield the most efficient code Specialized to fit certain HLL instruction May be redundant for another HLL Relatively low cost/benefit factor The RISC Controversy (80’s point of view)

The Advantage of RISC RISC and VLSI realization
Relatively small and simple C.U. hardware RISC I : 6 % RISC II : 10 % MC68020 : 68 % Higher chance of fitting other features on a chip Can fit a large number of CPU registers Enhances the throughput and HLL support Increase the regularization factor

The Advantage of RISC RISC and Computing Speed Faster decoding process
Small instruction set, addressing mode, fixed instruction format Reduce Memory access. A large number of CPU registers permits R-R operations Faster Parameter passing Register windows in RISC I and RISC II streamlined instruction handing All instruction have the same length All execute in one cycle Suitable for the pipelined implementation

The Advantage of RISC RISC and design costs and reliability
Shorter time to design and reduction of overall design costs Reduce the probability that the end product will be obsolete Reduced number of design errors Virtual Memory Management System enhancement inst will not cross word boundaries and can’t wind up on two separate pages

The Advantage of RISC RISC and HLL Support
Shorter and simpler compiler Usually only a single choice rather than several choice in CISC Large Number of CPU registers More efficient code optimization Fast Parameter Passing between procedures “register windows” Reduced burden on compiler writer HLL = High Level Language

The Disadvantage and Criticism of RISC(`80s)
RISC code to be longer Extra burden on the machine and assembly language programmer Several instructions required per a single CISC instruction More Memory Locations for their storage Floating Point Support and VMM support Floating Point is harder to “wedge in” because it tends to take longer to compute Virtual Memory Management also destroys some of the simplicity

RISC Characteristics Pipelined operation
Compiler responsible for pipeline conflict resolution Delayed branch Delayed load

Question #1: Why do microcoding?
If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)

Pipelining is Natural! Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40
Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D

Execution Cycle Instruction Obtain instruction from program storage
Fetch Decode Operand Execute Result Store Next Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction

The Five Stages of Load Ifetch: Instruction Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

Note: These 5 stages were there all along!
IR <= MEM[PC] PC <= PC + 4 R-type ALUout <= A fun B R[rd] <= ALUout ALUout <= A op ZX R[rt] <= ALUout ORi ALUout <= A + SX R[rt] <= M M <= MEM[ALUout] LW MEM[ALUout] <= B SW 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= ALUout ALUout <= PC +SX Fetch Decode Execute Memory Write-back

Pipelining Improve performance by increasing throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?

Basic Idea What do we need to add to split the datapath into stages?
What do we need to add to split the datapath into stages?

Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB

Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Why Pipeline? Because we can!
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm

Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) control hazards: attempt to make a decision before condition is evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)

Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized

Control Hazard Solution #1: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch

Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 * * .5) = 1.5 Total CPI might then be: 1.5 * * .8 = 1.1 (20% branch) More dynamic scheme: history of 1 branch ( 90%)

Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As we launch more instruction per clock cycle, less useful

Delayed/Predicted Branch
Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Data Hazard on r1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
xor r10,r1,r11

Data Hazard on r1: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Data Hazard Solution: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7
“Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg

Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall

Detecting Control Signals
Situation Example code Action No dependence LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R6, R7 OR R9, R6, R7 No hazards Dependence requiring stall LD R1, 45(R2) DADD R5, R1, R7 Detect use of R1 during ID of DADD and stall Dependence overcome by forwarding DSUB R8, R1, R7 Detect use of R1 during ID of DSUB and set mux control signal that accepts result from bypass path Dependence with accesses in order OR R9, R1, R7 No action required

Conflicts/Problems I-cache and D-cache are accessed in the same cycle – it helps to implement them separately Registers are read and written in the same cycle – easy to deal with if register read/write time equals cycle time/2 (else, use bypassing) Branch target changes only at the end of the second stage -- what do you do in the meantime? Data between stages get latched into registers (overhead that increases latency per instruction)

Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost

Slowdowns from Stalls Perfect pipelining with no hazards  an instruction completes every cycle (total cycles ~ num instructions)  speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles Slowdown because of stalls = 1/ (1 + stall cycles per instr)

Control and Datapath: Split state diag into 5 pieces
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem

Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Can have WAR and WAW in more complicated pipes

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Summary: Pipelining Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with fast clock Utilize capabilities of the Datapath start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

Some Issues for your consideration
Won’t be tested We’ll talk about modern processors and what’s really hard: exception handling trying to improve performance with out-of- order execution, etc. Trying to get CPI < 1 (Superscalar execution)

Superscalar Execution
Throwing more hardware at the problem Instruction level parallelism (ILP) Multiple functional units Eg. Multiple ALU’s Add a, b, c Add d, e, f Can get CPI <1!

Out-of-order execution
Idea: It’s best if we keep all functional units busy Can sometimes reorder computations to take advantage of functional units that are otherwise idle Automatically do reordering like we did 4 slides ago!

Register Renaming Internally rename registers, allow for better ILP
Add a, b, c Add b, c, d

Hyperthreading/Multicore
>1 “virtual” CPUs Multi-core: >1 actual CPUs per die

Integrated Circuits Costs
IC cost = Die cost Testing cost Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area Die Yield = Wafer yield * 1 + Not on midterm { Defects_per_unit_area * Die_Area a } - a Die Cost goes roughly with die area4

Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX $ % $4 486DX $ % $12 PowerPC $ % $53 HP PA $ % $73 DEC Alpha $ % $149 SuperSPARC $ % $272 Pentium $ % $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Midterm Questions Examples: List and describe 3 types of DRAM
What are the relative advantages/disadvantages of RISC/CISC What do we have a memory heirarchy? Using your choice of assembly write a (commented) routine that computes the n’th fibonnaci number. Why do CPU’s have registers? Describe how a 3-disk RAID-5 system works

Midterm Questions More Examples:
What are the differences between an synchronous and asynchronous bus? What are the relative advantages/disadvantages? List and describe techniques to improve cache miss rate, reduce cache miss penalty and reduce cache hit times

Topics for further study
The following slides will not be covered in class or on tests.

Multicycle Instructions
Functional unit Latency Initiation interval Integer ALU 1 Data memory 2 FP add 4 FP multiply 7 FP divide 25

Effects of Multicycle Instructions
Structural hazards if the unit is not fully pipelined (divider) Frequent RAW hazard stalls Potentially multiple writes to the register file in a cycle WAW hazards because of out-of-order instr completion Imprecise exceptions because of o-o-o instr completion

Precise Exceptions On an exception:
must save PC of instruction where program must resume all instructions after that PC that might be in the pipeline must be converted to NOPs (other instructions continue to execute and may raise exceptions of their own) temporary program state not in memory (in other words, registers) has to be stored in memory potential problems if a later instruction has already modified memory or registers A processor that fulfils all the above conditions is said to provide precise exceptions (useful for debugging and of course, correctness)

Dealing with these Effects
Multiple writes to the register file: increase the number of ports, stall one of the writers during ID, stall one of the writers during WB (the stall will propagate) WAW hazards: detect the hazard during ID and stall the later instruction Imprecise exceptions: buffer the results if they complete early or save more pipeline state so that you can return to exactly the same state that you left at

ILP Instruction-level parallelism: overlap among instructions:
pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline

Types of Dependences Data dependences: an instr produces a result for another (true dependence, results in RAW hazards in a pipeline) Name dependences: two instrs that use the same names (anti and output dependences, result in WAR and WAW hazards in a pipeline) Control dependences: an instruction’s execution depends on the result of a branch – re-ordering should preserve exception behavior and dataflow

An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)

Design Details - I Instructions enter the pipeline in order
No need for branch delay slots if prediction happens in time Instructions leave the pipeline in order – all instructions that enter also get placed in the ROB – the process of an instruction leaving the ROB (in order) is called commit – an instruction commits only if it and all instructions before it have completed successfully (without an exception) To preserve precise exceptions, a result is written into the register file only when the instruction commits – until then, the result is saved in a temporary register in the ROB

Design Details - II Instructions get renamed and placed in the issue queue – some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6) As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue – instructions now know if their operands are ready When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution) Can you have WAW or WAR hazards? By using more names (T1-T6), name dependences can be avoided

Design Details - III If instr-3 raises an exception, wait until it reaches the top of the ROB – at this point, R1-R32 contain results for all instructions up to instr-3 – save registers, save PC of instr-3, and service the exception If branch is a mispredict, flush all instructions after the branch and start on the correct path – mispredicted instrs will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes) Potential problems: ?

Managing Register Names
Temporary values are stored in the register file and not the ROB Logical Registers R1-R32 Physical Registers P1-P64 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 What happens on commit?

The Commit Process On commit, no copy is required
The register map table is updated – the “committed” value of R1 is now in P33 and not P1 – on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 – it is put in a free pool and a new instruction can now enter the pipeline  for every instr that commits, a new instr can enter the pipeline  number of in-flight instrs is a constant = number of extra (rename) registers

The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Register Map Table R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 ALU ALU ALU Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)

Lecture 11: Advanced Static ILP
Topics: loop unrolling, software pipelining (Section 4.4)

Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel  multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism Parallel loops are especially desireable in a multiprocessor system

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences
A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

Finding Dependences – the GCD Test
Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true

Static vs. Dynamic ILP Loop: L.D F0, 0(R1) ; F0 = array element
ADD.D F4, F0, F ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# ; decrement address pointer BNE R1, R2, Loop ; branch if R1 != R2 Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop ….. Statically unrolled loop Large window dynamic ooo proc

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop Renamed

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop 1 3 6 2 4 7 5 8 9 Cycle of Issue Renamed

… … Loop Pipeline L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE
… L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE

Statically Unrolled Loop
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) L.D F18, -32(R1) ADD.D F4, F0, F2 L.D F22, -40(R1) ADD.D F8, F6, F2 L.D F26, -48(R1) ADD.D F12, F10, F2 L.D F30, -56(R1) ADD.D F16, F14, F2 L.D F34, -64(R1) ADD.D F20, F18, F2 S.D F4, 0(R1) L.D F38, -72(R1) ADD.D F24, F22, F2 S.D F8, -8(R1) … … S.D F12, 16(R1) S.D F16, 8(R1) DADDUI R1, R1, # -32 S.D BNE R1,R2, Loop

Static Vs. Dynamic New iterations completed 1 Dynamic ILP Cycles
Static ILP Cycles What if I doubled the number of resources in each processor? What if I unrolled the loop and executed it on a dynamic ILP processor?

Static vs. Dynamic Dynamic: because of the loop index, at most one iteration can start every cycle – even fewer if there are resource constraints – in other words, we have a pipeline that has a throughput of one iteration per cycle! Static: by eliminating loop index, each iteration is independent  as many loops can start in a cycle as there are resources – however, after a while, we don’t start any more iterations – thus, loop unrolling provides a brief steady state, where an iteration starts/finishes every cycle and the rest is start-up/wind-down for each unrolled loop

… … Software Pipeline?! L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI
… L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE

Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers

Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel  multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism Parallel loops are especially desireable in a multiprocessor system

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences
A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

loops can be restructured to be parallel
Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4 } B[101] = C[100] + D[100]; S1 depends on S2 from prev iteration S4 depends on S3 of same iteration

Finding Dependences – the GCD Test
Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true

Predication A branch within a loop can be problematic to schedule
Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions

Predicated or Conditional Instructions
The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)

Complications Each instruction has one more input operand – more
register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate

Support for Speculation
In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences However, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependences st br ld

Detecting Exceptions Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults) For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative Note that a speculative instruction needs a special opcode to indicate that it is speculative

Program-Terminate Exceptions
When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the core-dump happens two procedures later) Alternatively, an instruction (the sentinel) in the speculative instruction’s original location checks the register value and initiates recovery

Memory Dependence Detection
If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64) If a store finds its address in the ALAT, it indicates that a violation occurred for that address A special instruction (the sentinel) in the load’s original location checks to see if the address had a violation and re-executes the load if necessary

Dynamic Vs. Static ILP Static ILP:
+ The compiler finds parallelism  no scoreboarding  higher clock speeds and lower power + Compiler knows what is next  better global schedule - Compiler can not react to dynamic events (cache misses) - Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument) - Static branch prediction is poor  even statically scheduled processors use hardware branch predictors - Building an optimizing compiler is easier said than done A comparison of the Alpha, Pentium 4, and Itanium (statically scheduled IA-64 architecture) shows that the Itanium is not much better in terms of performance, clock speed or power

Summary Topics: scheduling, loop unrolling, software pipelining,
predication, violations while re-ordering instructions Static ILP is a great approach for handling embedded domains For the high performance domain, designers have added many frills, bells, and whistles to eke out additional performance, while compromising power/complexity

OMSE 510: Computing Foundations 4: The CPU!

Similar presentations

Presentation on theme: "OMSE 510: Computing Foundations 4: The CPU!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OMSE 510: Computing Foundations 4: The CPU!

Similar presentations

Presentation on theme: "OMSE 510: Computing Foundations 4: The CPU!"— Presentation transcript:

Similar presentations

About project

Feedback