Download presentation
Presentation is loading. Please wait.
1
OMSE 510: Computing Foundations 4: The CPU!
Chris Gilmore Systems Software Lab Portland State University/OMSE
2
Today Caches DLX Assembly CPU Overview
3
Introduction to RISC Reduced Instruction Set Computer
1975 John Cocke IBM 801 IBM started working on a RISC-type computer on 1975 without calling it by this name used as an I/O processor for IBM Mainframe Patterson and Hennessey RISC was first introduction by Patterson and Ditzel in1980 Produced first RISC chip in early 1980s RISC I and RISC II from Berkeley and MIPS from Stanford
4
RISC Chips RISC II Had 39 instructions and 2 addressing modes, 3 data types 234 combinations Compared to VAX 304 inst, 16 address mode, 14 data type 68,096 Found that Compiled programs were 30% larger than CISC (Vax 11/780) Ran upto 5 times faster than 68000 Assembler-Compiler ratio (Execution time of assembler program divided by the exec time of compiled version) Ratio < 50% for CISC 90% for RISC
5
RISC Definition 1. Single cycle operation 2. Load / store design
3. Hardwired control unit 4. Few instructions and addressing modes 5. Fixed instruction format 6. More compile time effort to avoid pipeline penalties
6
Disadvantages of CISC Large, complicated, and time-consuming instruction set Complex CU to decode and execute Not necessarily faster than a sequence of several RISC instructions Complexity of the CISC CU A large number of design errors Longer design time Too large a choice for the compiler Very difficult to design the optimal compiler Not always yield the most efficient code Specialized to fit certain HLL instruction May be redundant for another HLL Relatively low cost/benefit factor The RISC Controversy (80’s point of view)
7
The Advantage of RISC RISC and VLSI realization
Relatively small and simple C.U. hardware RISC I : 6 % RISC II : 10 % MC68020 : 68 % Higher chance of fitting other features on a chip Can fit a large number of CPU registers Enhances the throughput and HLL support Increase the regularization factor
8
The Advantage of RISC RISC and Computing Speed Faster decoding process
Small instruction set, addressing mode, fixed instruction format Reduce Memory access. A large number of CPU registers permits R-R operations Faster Parameter passing Register windows in RISC I and RISC II streamlined instruction handing All instruction have the same length All execute in one cycle Suitable for the pipelined implementation
9
The Advantage of RISC RISC and design costs and reliability
Shorter time to design and reduction of overall design costs Reduce the probability that the end product will be obsolete Reduced number of design errors Virtual Memory Management System enhancement inst will not cross word boundaries and can’t wind up on two separate pages
10
The Advantage of RISC RISC and HLL Support
Shorter and simpler compiler Usually only a single choice rather than several choice in CISC Large Number of CPU registers More efficient code optimization Fast Parameter Passing between procedures “register windows” Reduced burden on compiler writer HLL = High Level Language
11
The Disadvantage and Criticism of RISC(`80s)
RISC code to be longer Extra burden on the machine and assembly language programmer Several instructions required per a single CISC instruction More Memory Locations for their storage Floating Point Support and VMM support Floating Point is harder to “wedge in” because it tends to take longer to compute Virtual Memory Management also destroys some of the simplicity
12
RISC Characteristics Pipelined operation
Compiler responsible for pipeline conflict resolution Delayed branch Delayed load
13
Question #1: Why do microcoding?
If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)
14
Pipelining is Natural! Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D
15
Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40
Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
16
Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads
17
Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D
18
Execution Cycle Instruction Obtain instruction from program storage
Fetch Decode Operand Execute Result Store Next Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Compute result value or status Deposit results in storage for later use Determine successor instruction
19
The Five Stages of Load Ifetch: Instruction Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)
20
Note: These 5 stages were there all along!
IR <= MEM[PC] PC <= PC + 4 R-type ALUout <= A fun B R[rd] <= ALUout ALUout <= A op ZX R[rt] <= ALUout ORi ALUout <= A + SX R[rt] <= M M <= MEM[ALUout] LW MEM[ALUout] <= B SW 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= ALUout ALUout <= PC +SX Fetch Decode Execute Memory Write-back
21
Pipelining Improve performance by increasing throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
22
Basic Idea What do we need to add to split the datapath into stages?
What do we need to add to split the datapath into stages?
23
Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
24
Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB
25
Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr
26
Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
27
Why Pipeline? Because we can!
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm
28
Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) control hazards: attempt to make a decision before condition is evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards
29
Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)
30
Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI 1.3 otherwise resource is more than 100% utilized
31
Control Hazard Solution #1: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch
32
Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 * * .5) = 1.5 Total CPI might then be: 1.5 * * .8 = 1.1 (20% branch) More dynamic scheme: history of 1 branch ( 90%)
33
Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) As we launch more instruction per clock cycle, less useful
34
Delayed/Predicted Branch
Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
35
Data Hazard on r1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
xor r10,r1,r11
36
Data Hazard on r1: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11
37
Data Hazard Solution: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7
“Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11
38
Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg
39
Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall
40
Detecting Control Signals
Situation Example code Action No dependence LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R6, R7 OR R9, R6, R7 No hazards Dependence requiring stall LD R1, 45(R2) DADD R5, R1, R7 Detect use of R1 during ID of DADD and stall Dependence overcome by forwarding DSUB R8, R1, R7 Detect use of R1 during ID of DSUB and set mux control signal that accepts result from bypass path Dependence with accesses in order OR R9, R1, R7 No action required
41
Conflicts/Problems I-cache and D-cache are accessed in the same cycle – it helps to implement them separately Registers are read and written in the same cycle – easy to deal with if register read/write time equals cycle time/2 (else, use bypassing) Branch target changes only at the end of the second stage -- what do you do in the meantime? Data between stages get latched into registers (overhead that increases latency per instruction)
42
Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
43
Slowdowns from Stalls Perfect pipelining with no hazards an instruction completes every cycle (total cycles ~ num instructions) speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles Slowdown because of stalls = 1/ (1 + stall cycles per instr)
44
Control and Datapath: Split state diag into 5 pieces
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem
45
Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
46
Three Generic Data Hazards
InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
47
Three Generic Data Hazards
InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Can have WAR and WAW in more complicated pipes
48
Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
49
Summary: Pipelining Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with fast clock Utilize capabilities of the Datapath start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
50
Some Issues for your consideration
Won’t be tested We’ll talk about modern processors and what’s really hard: exception handling trying to improve performance with out-of- order execution, etc. Trying to get CPI < 1 (Superscalar execution)
51
Superscalar Execution
Throwing more hardware at the problem Instruction level parallelism (ILP) Multiple functional units Eg. Multiple ALU’s Add a, b, c Add d, e, f Can get CPI <1!
52
Out-of-order execution
Idea: It’s best if we keep all functional units busy Can sometimes reorder computations to take advantage of functional units that are otherwise idle Automatically do reordering like we did 4 slides ago!
53
Register Renaming Internally rename registers, allow for better ILP
Add a, b, c Add b, c, d
54
Hyperthreading/Multicore
>1 “virtual” CPUs Multi-core: >1 actual CPUs per die
55
Integrated Circuits Costs
IC cost = Die cost Testing cost Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area Die Yield = Wafer yield * 1 + Not on midterm { Defects_per_unit_area * Die_Area a } - a Die Cost goes roughly with die area4
56
Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX $ % $4 486DX $ % $12 PowerPC $ % $53 HP PA $ % $73 DEC Alpha $ % $149 SuperSPARC $ % $272 Pentium $ % $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15
57
Midterm Questions Examples: List and describe 3 types of DRAM
What are the relative advantages/disadvantages of RISC/CISC What do we have a memory heirarchy? Using your choice of assembly write a (commented) routine that computes the n’th fibonnaci number. Why do CPU’s have registers? Describe how a 3-disk RAID-5 system works
58
Midterm Questions More Examples:
What are the differences between an synchronous and asynchronous bus? What are the relative advantages/disadvantages? List and describe techniques to improve cache miss rate, reduce cache miss penalty and reduce cache hit times
59
Topics for further study
The following slides will not be covered in class or on tests.
60
Multicycle Instructions
Functional unit Latency Initiation interval Integer ALU 1 Data memory 2 FP add 4 FP multiply 7 FP divide 25
61
Effects of Multicycle Instructions
Structural hazards if the unit is not fully pipelined (divider) Frequent RAW hazard stalls Potentially multiple writes to the register file in a cycle WAW hazards because of out-of-order instr completion Imprecise exceptions because of o-o-o instr completion
62
Precise Exceptions On an exception:
must save PC of instruction where program must resume all instructions after that PC that might be in the pipeline must be converted to NOPs (other instructions continue to execute and may raise exceptions of their own) temporary program state not in memory (in other words, registers) has to be stored in memory potential problems if a later instruction has already modified memory or registers A processor that fulfils all the above conditions is said to provide precise exceptions (useful for debugging and of course, correctness)
63
Dealing with these Effects
Multiple writes to the register file: increase the number of ports, stall one of the writers during ID, stall one of the writers during WB (the stall will propagate) WAW hazards: detect the hazard during ID and stall the later instruction Imprecise exceptions: buffer the results if they complete early or save more pipeline state so that you can return to exactly the same state that you left at
64
ILP Instruction-level parallelism: overlap among instructions:
pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline
65
Types of Dependences Data dependences: an instr produces a result for another (true dependence, results in RAW hazards in a pipeline) Name dependences: two instrs that use the same names (anti and output dependences, result in WAR and WAW hazards in a pipeline) Control dependences: an instruction’s execution depends on the result of a branch – re-ordering should preserve exception behavior and dataflow
66
An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Decode & Rename T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)
67
Design Details - I Instructions enter the pipeline in order
No need for branch delay slots if prediction happens in time Instructions leave the pipeline in order – all instructions that enter also get placed in the ROB – the process of an instruction leaving the ROB (in order) is called commit – an instruction commits only if it and all instructions before it have completed successfully (without an exception) To preserve precise exceptions, a result is written into the register file only when the instruction commits – until then, the result is saved in a temporary register in the ROB
68
Design Details - II Instructions get renamed and placed in the issue queue – some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6) As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue – instructions now know if their operands are ready When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution) Can you have WAW or WAR hazards? By using more names (T1-T6), name dependences can be avoided
69
Design Details - III If instr-3 raises an exception, wait until it reaches the top of the ROB – at this point, R1-R32 contain results for all instructions up to instr-3 – save registers, save PC of instr-3, and service the exception If branch is a mispredict, flush all instructions after the branch and start on the correct path – mispredicted instrs will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes) Potential problems: ?
70
Managing Register Names
Temporary values are stored in the register file and not the ROB Logical Registers R1-R32 Physical Registers P1-P64 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 What happens on commit?
71
The Commit Process On commit, no copy is required
The register map table is updated – the “committed” value of R1 is now in P33 and not P1 – on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 – it is put in a free pool and a new instruction can now enter the pipeline for every instr that commits, a new instr can enter the pipeline number of in-flight instrs is a constant = number of extra (rename) registers
72
The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Register Map Table R1P1 R2P2 Register File P1-P64 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Decode & Rename P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 ALU ALU ALU Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)
73
Lecture 11: Advanced Static ILP
Topics: loop unrolling, software pipelining (Section 4.4)
74
Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism Parallel loops are especially desireable in a multiprocessor system
75
Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences
A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1
76
Finding Dependences – the GCD Test
Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true
77
Static vs. Dynamic ILP Loop: L.D F0, 0(R1) ; F0 = array element
ADD.D F4, F0, F ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# ; decrement address pointer BNE R1, R2, Loop ; branch if R1 != R2 Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop ….. Statically unrolled loop Large window dynamic ooo proc
78
Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop Renamed
79
Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop 1 3 6 2 4 7 5 8 9 Cycle of Issue Renamed
80
… … Loop Pipeline L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE
… L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE
81
Statically Unrolled Loop
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) L.D F18, -32(R1) ADD.D F4, F0, F2 L.D F22, -40(R1) ADD.D F8, F6, F2 L.D F26, -48(R1) ADD.D F12, F10, F2 L.D F30, -56(R1) ADD.D F16, F14, F2 L.D F34, -64(R1) ADD.D F20, F18, F2 S.D F4, 0(R1) L.D F38, -72(R1) ADD.D F24, F22, F2 S.D F8, -8(R1) … … S.D F12, 16(R1) S.D F16, 8(R1) DADDUI R1, R1, # -32 S.D BNE R1,R2, Loop
82
Static Vs. Dynamic New iterations completed 1 Dynamic ILP Cycles
Static ILP Cycles What if I doubled the number of resources in each processor? What if I unrolled the loop and executed it on a dynamic ILP processor?
83
Static vs. Dynamic Dynamic: because of the loop index, at most one iteration can start every cycle – even fewer if there are resource constraints – in other words, we have a pipeline that has a throughput of one iteration per cycle! Static: by eliminating loop index, each iteration is independent as many loops can start in a cycle as there are resources – however, after a while, we don’t start any more iterations – thus, loop unrolling provides a brief steady state, where an iteration starts/finishes every cycle and the rest is start-up/wind-down for each unrolled loop
84
… … Software Pipeline?! L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI
… L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE
85
Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers
86
Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism Parallel loops are especially desireable in a multiprocessor system
87
Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences
A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1
88
loops can be restructured to be parallel
Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4 } B[101] = C[100] + D[100]; S1 depends on S2 from prev iteration S4 depends on S3 of same iteration
89
Finding Dependences – the GCD Test
Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true
90
Predication A branch within a loop can be problematic to schedule
Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions
91
Predicated or Conditional Instructions
The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)
92
Complications Each instruction has one more input operand – more
register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate
93
Support for Speculation
In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences However, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependences st br ld
94
Detecting Exceptions Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults) For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative Note that a speculative instruction needs a special opcode to indicate that it is speculative
95
Program-Terminate Exceptions
When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the core-dump happens two procedures later) Alternatively, an instruction (the sentinel) in the speculative instruction’s original location checks the register value and initiates recovery
96
Memory Dependence Detection
If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64) If a store finds its address in the ALAT, it indicates that a violation occurred for that address A special instruction (the sentinel) in the load’s original location checks to see if the address had a violation and re-executes the load if necessary
97
Dynamic Vs. Static ILP Static ILP:
+ The compiler finds parallelism no scoreboarding higher clock speeds and lower power + Compiler knows what is next better global schedule - Compiler can not react to dynamic events (cache misses) - Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument) - Static branch prediction is poor even statically scheduled processors use hardware branch predictors - Building an optimizing compiler is easier said than done A comparison of the Alpha, Pentium 4, and Itanium (statically scheduled IA-64 architecture) shows that the Itanium is not much better in terms of performance, clock speed or power
98
Summary Topics: scheduling, loop unrolling, software pipelining,
predication, violations while re-ordering instructions Static ILP is a great approach for handling embedded domains For the high performance domain, designers have added many frills, bells, and whistles to eke out additional performance, while compromising power/complexity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.