Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Optimally schedule code for(i=0;i<N;i++) A[i] = A[i] + 10; & (A[0]) in $s1 & (A[i]) in $s2 slt $t1, $s3, $s0 beq $t1, $0, end loop: lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop

1. Identify Dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop $t0 – lw->addi – RAW $t0 – addi->sw - RAW

2.Draw timing diagram WITH DATA FORWARDING lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop F D X M W

3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop RAW, WAR, WAW F D X M W D F F lw addi sw addi slt bne Target the false dependencies

3. Remove WAR/WAW dependencies lw $t0, 0($s1) sw $t0, 0($s1) addi $s1, $s1, 4 lw $t0, 0($s1) addi $s1, $s1, 4 sw $t0, 0($s1) lw $t0, 0($s1) addi sw Original Incorrect Correct

lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 sw $t0, ____($s1) slt $t1, $s1, $s2 bne $t1, $0, loop lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop

3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 slt $t1, $s1, $s2 sw $t0, -4($s1) bne $t1, $0, loop F D X M W lw addi sw addi slt bne

Software Control Hazard Removal If ( (x % 2) == 1) isodd = 1;

Software Control Hazard Removal If ( x == true) y = false; else y = true;

If ((x == MON) || (x == TUE) || (x == WED)) { } Software Control Hazard Removal

If ((TheCoinTossIsHeads) || (StudentStudiedForExam)) { } Increasing Branch Performance

What does it all mean? Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

The moral is….. Calculation is less expensive than …..

Superscalars - Parallelism Ford mass produces cars. We want to “mass produce” instructions Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly. Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

“Superpipelining” (deep pipelining – many stages) Limiting returns because…. Register delays are __________________________ of clock Difficult to __________________

SuperScalars __________ parts of pipeline Multiple instructions in _______ stage at once

SuperScalars Which instructions can execute in parallel? Fetching multiple instructions per cycle

Static Scheduling – VLIW or EPIC (Itanium) __________ schedules the instructions If one instruction stalls, all following instructions stall Book Example: SuperScalar MIPS: Two instructions / cycle one alu/branch, one ld/st each cycle

Schedule for SS MIPS Loop: lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0($s1) addi$s1, $s1, -4 bne$s1, $zero,Loop PCALU/branchld/st 0 8 16 24 32

SuperScalars - Static bne FetchMemoryWriteBackExecuteDecode Read Values Write Values addu sw lw addi

Loop Problem Problem: –Too many _______________ in loop –Not enough ______________ to fill in holes Solution: –Do ______________ at once –More instructions –Only one branch

Loop Unrolling 1. Unroll Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4 ($s1) bne$s1, $zero,Loop

Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop But wait!!! How has this helped? There are tons of dependencies? Whatever are we to do? Register Renaming!!!

Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop (Repeated slide for your reference) Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop

Loop Unrolling 3. Reduce Instructions Loop: lw$t0, 0($s1) addi$s1, $s1, -8 addu$t0, $t0, $s2 sw$t0, 8($s1) lw$t1, 4($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, ___($s1) lw$t1, ___($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop

Loop Unrolling 4. Schedule Loop: lw1$t0, 0($s1) addi$s1, $s1, -8 addu1$t0, $t0, $s2 sw1$t0, 8($s1) lw2$t1, 4($s1) addu2$t1, $t1, $s2 sw2$t1, 4($s1) bne$s1, $zero,Loop ALU/branchlw/sw lw1

Performance Comparison OriginalUnrolled ALU/branchld/st lw $t0, 0($s1) addi$s1, $s1, -4 addu $t0, $t0, $s2 bne$s1, $zero,L sw $t0, 4($s1)

Static Scheduling Summary Code size ______________ (because of nops) It can not resolve __________ dependencies If one instruction stalls, ___________________

Dynamic Scheduling _________ schedules ready instructions Only ___________ instructions stall _______________ resolved in hardware

4-wide Dynamic Superscalar Fetch Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Fetch 4 instructions each cycle addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WAR Register Renaming) or addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WARRegister Renaming) or 2. Functional Unit (RAW – result not ready) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Wait until your inputs are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Execute once they are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) First calculate the address addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)lw r2, 0(s1)

4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Ld/St Queue checks memory addresses – out of order lw/sw addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Commit Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) KEY Waiting for value Reading value Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 bne r1,r7,Loop sw r2, 0(s1) Instructions wait until all previous instructions have completed

Fallacies & Pitfalls Pipelining is easy –______________ is difficult Instruction set has no impact on pipelining –Complicated _____________ & _____________________ instructions complicate pipelining immensely

Technology Influences Pipelining ideas are good ideas regardless of technology –Only recently, with extra chip space, has ___________________ become better than ____________________ –Now, pipelining limited by ________

Exceptions – Unexpected Events InternalExternal

Definitions a.Anything unexpected happens b.External event occurs c.Internal event occurs d.Change in control flow ExceptionInterrupt Power PC Intel MIPS

Exception-Handling Stop Transfer control to OS Tell OS what happened Begin executing where we left off

1. Detect Exception Add control lines to detect errors

Step 2: Store PC into EPC Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2

Step 3: Tell OS the problem Store error code in the _________ Use vectored interrupts –Use error code to determine _________

Cause Register Set a flag in the cause register How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Vectored Interrupts The address of trap handler is determined by cause Exception typeException vector address (in hex) Undefined InstructionC0 00 00 00 hex Arithmetic OverflowC0 00 00 20 hex

Cause Register – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Handler PC

Vectored Interrupt – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Vector Table

Steps for Exceptions Detect exception Place processor in state before offending instruction Record exception type Record instruction’s PC in EPC Transfer control to OS

What happens if the third instruction is undefined? Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 ID WB MEM WB MEM WB MEM WB In what stage is it detected? In what cycle? 1. Detection

Must associate exception with proper instruction What happens if multiple exceptions happen in the same cycle?

Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 2. Preserve state before instruction What? What does that mean?!?

3. Record exception type Place value in cause register or Use vectored interrupts –(exception routine address dependent on exception type)

PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef add lwor 4. Record PC in EPC Machine in detection cycle

PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef 4. Record PC in EPC Machine in before transfer Where is the proper PC? Long gone!!!

4. Record PC in EPC Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory) Precise exceptions Imprecise exceptions

5. Transfer control to OS Same as before

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Similar presentations

Presentation on theme: "Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Similar presentations

Presentation on theme: "Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)"— Presentation transcript:

Similar presentations

About project

Feedback