Download presentation
Presentation is loading. Please wait.
Published byTylor Birkes Modified over 9 years ago
1
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
2
Optimally schedule code for(i=0;i<N;i++) A[i] = A[i] + 10; & (A[0]) in $s1 & (A[i]) in $s2 slt $t1, $s3, $s0 beq $t1, $0, end loop: lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop
3
1. Identify Dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop $t0 – lw->addi – RAW $t0 – addi->sw - RAW
4
2.Draw timing diagram WITH DATA FORWARDING lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop F D X M W
5
3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop RAW, WAR, WAW F D X M W D F F lw addi sw addi slt bne Target the false dependencies
6
3. Remove WAR/WAW dependencies lw $t0, 0($s1) sw $t0, 0($s1) addi $s1, $s1, 4 lw $t0, 0($s1) addi $s1, $s1, 4 sw $t0, 0($s1) lw $t0, 0($s1) addi sw Original Incorrect Correct
7
lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 sw $t0, ____($s1) slt $t1, $s1, $s2 bne $t1, $0, loop lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop
8
3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 slt $t1, $s1, $s2 sw $t0, -4($s1) bne $t1, $0, loop F D X M W lw addi sw addi slt bne
9
Software Control Hazard Removal If ( (x % 2) == 1) isodd = 1;
10
Software Control Hazard Removal If ( x == true) y = false; else y = true;
11
If ((x == MON) || (x == TUE) || (x == WED)) { } Software Control Hazard Removal
12
If ((TheCoinTossIsHeads) || (StudentStudiedForExam)) { } Increasing Branch Performance
13
What does it all mean? Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!
14
The moral is….. Calculation is less expensive than …..
15
Superscalars - Parallelism Ford mass produces cars. We want to “mass produce” instructions Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly. Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.
16
“Superpipelining” (deep pipelining – many stages) Limiting returns because…. Register delays are __________________________ of clock Difficult to __________________
17
SuperScalars __________ parts of pipeline Multiple instructions in _______ stage at once
18
SuperScalars Which instructions can execute in parallel? Fetching multiple instructions per cycle
19
Static Scheduling – VLIW or EPIC (Itanium) __________ schedules the instructions If one instruction stalls, all following instructions stall Book Example: SuperScalar MIPS: Two instructions / cycle one alu/branch, one ld/st each cycle
20
Schedule for SS MIPS Loop: lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0($s1) addi$s1, $s1, -4 bne$s1, $zero,Loop PCALU/branchld/st 0 8 16 24 32
21
SuperScalars - Static bne FetchMemoryWriteBackExecuteDecode Read Values Write Values addu sw lw addi
22
Loop Problem Problem: –Too many _______________ in loop –Not enough ______________ to fill in holes Solution: –Do ______________ at once –More instructions –Only one branch
23
Loop Unrolling 1. Unroll Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4 ($s1) bne$s1, $zero,Loop
24
Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop But wait!!! How has this helped? There are tons of dependencies? Whatever are we to do? Register Renaming!!!
25
Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop (Repeated slide for your reference) Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop
26
Loop Unrolling 3. Reduce Instructions Loop: lw$t0, 0($s1) addi$s1, $s1, -8 addu$t0, $t0, $s2 sw$t0, 8($s1) lw$t1, 4($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, ___($s1) lw$t1, ___($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop
27
Loop Unrolling 4. Schedule Loop: lw1$t0, 0($s1) addi$s1, $s1, -8 addu1$t0, $t0, $s2 sw1$t0, 8($s1) lw2$t1, 4($s1) addu2$t1, $t1, $s2 sw2$t1, 4($s1) bne$s1, $zero,Loop ALU/branchlw/sw lw1
28
Performance Comparison OriginalUnrolled ALU/branchld/st lw $t0, 0($s1) addi$s1, $s1, -4 addu $t0, $t0, $s2 bne$s1, $zero,L sw $t0, 4($s1)
29
Static Scheduling Summary Code size ______________ (because of nops) It can not resolve __________ dependencies If one instruction stalls, ___________________
30
Dynamic Scheduling _________ schedules ready instructions Only ___________ instructions stall _______________ resolved in hardware
31
4-wide Dynamic Superscalar Fetch Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Fetch 4 instructions each cycle addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
32
4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WAR Register Renaming) or addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
33
4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WARRegister Renaming) or 2. Functional Unit (RAW – result not ready) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
34
4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Wait until your inputs are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
35
4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Execute once they are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
36
4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) First calculate the address addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)lw r2, 0(s1)
37
4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Ld/St Queue checks memory addresses – out of order lw/sw addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)
38
4-wide Dynamic Superscalar Commit Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) KEY Waiting for value Reading value Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 bne r1,r7,Loop sw r2, 0(s1) Instructions wait until all previous instructions have completed
39
Fallacies & Pitfalls Pipelining is easy –______________ is difficult Instruction set has no impact on pipelining –Complicated _____________ & _____________________ instructions complicate pipelining immensely
40
Technology Influences Pipelining ideas are good ideas regardless of technology –Only recently, with extra chip space, has ___________________ become better than ____________________ –Now, pipelining limited by ________
41
Exceptions – Unexpected Events InternalExternal
42
Definitions a.Anything unexpected happens b.External event occurs c.Internal event occurs d.Change in control flow ExceptionInterrupt Power PC Intel MIPS
43
Exception-Handling Stop Transfer control to OS Tell OS what happened Begin executing where we left off
44
1. Detect Exception Add control lines to detect errors
45
Step 2: Store PC into EPC Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2
46
Step 3: Tell OS the problem Store error code in the _________ Use vectored interrupts –Use error code to determine _________
47
Cause Register Set a flag in the cause register How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?
48
Vectored Interrupts The address of trap handler is determined by cause Exception typeException vector address (in hex) Undefined InstructionC0 00 00 00 hex Arithmetic OverflowC0 00 00 20 hex
49
Cause Register – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Handler PC
50
Vectored Interrupt – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Vector Table
51
Steps for Exceptions Detect exception Place processor in state before offending instruction Record exception type Record instruction’s PC in EPC Transfer control to OS
52
What happens if the third instruction is undefined? Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 ID WB MEM WB MEM WB MEM WB In what stage is it detected? In what cycle? 1. Detection
53
Must associate exception with proper instruction What happens if multiple exceptions happen in the same cycle?
54
Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 2. Preserve state before instruction What? What does that mean?!?
55
3. Record exception type Place value in cause register or Use vectored interrupts –(exception routine address dependent on exception type)
56
PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef add lwor 4. Record PC in EPC Machine in detection cycle
57
PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef 4. Record PC in EPC Machine in before transfer Where is the proper PC? Long gone!!!
58
4. Record PC in EPC Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory) Precise exceptions Imprecise exceptions
59
5. Transfer control to OS Same as before
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.