Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

COMP4611 Tutorial 6 Instruction Level Parallelism

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

Pipelining 6.1, 6.2. Performance Measurements Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput:

Instruction-Level Parallelism (ILP)

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Data Dependencies A dependency type that can cause a stall.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CA406 Computer Architecture Pipelines... continued.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.

Introduction to Computer Organization Pipelining.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Computer Organization CS224

ELEN 468 Advanced Logic Design

Pipeline Architecture since 1985

ECS 154B Computer Architecture II Spring 2009

CDA 3101 Spring 2016 Introduction to Computer Organization

Pipelining: Advanced ILP

Pipelining review.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

The processor: Pipelining and Branching

Computer Architecture

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

Vishwani D. Agrawal James J. Danaher Professor

Datapath and Control Exceptions

Control unit extension for data hazards

CSC3050 – Computer Architecture

Control unit extension for data hazards

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Optimally schedule code for(i=0;i<N;i++) A[i] = A[i] + 10; & (A[0]) in $s1 & (A[i]) in $s2 slt $t1, $s3, $s0 beq $t1, $0, end loop: lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop

1. Identify Dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop $t0 – lw->addi – RAW $t0 – addi->sw - RAW

2.Draw timing diagram WITH DATA FORWARDING lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop F D X M W

3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop RAW, WAR, WAW F D X M W D F F lw addi sw addi slt bne Target the false dependencies

3. Remove WAR/WAW dependencies lw $t0, 0($s1) sw $t0, 0($s1) addi $s1, $s1, 4 lw $t0, 0($s1) addi $s1, $s1, 4 sw $t0, 0($s1) lw $t0, 0($s1) addi sw Original Incorrect Correct

lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 sw $t0, ____($s1) slt $t1, $s1, $s2 bne $t1, $0, loop lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop

3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 slt $t1, $s1, $s2 sw $t0, -4($s1) bne $t1, $0, loop F D X M W lw addi sw addi slt bne

Software Control Hazard Removal If ( (x % 2) == 1) isodd = 1;

Software Control Hazard Removal If ( x == true) y = false; else y = true;

If ((x == MON) || (x == TUE) || (x == WED)) { } Software Control Hazard Removal

If ((TheCoinTossIsHeads) || (StudentStudiedForExam)) { } Increasing Branch Performance

What does it all mean? Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

The moral is….. Calculation is less expensive than …..

Superscalars - Parallelism Ford mass produces cars. We want to “mass produce” instructions Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly. Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

“Superpipelining” (deep pipelining – many stages) Limiting returns because…. Register delays are __________________________ of clock Difficult to __________________

SuperScalars __________ parts of pipeline Multiple instructions in _______ stage at once

SuperScalars Which instructions can execute in parallel? Fetching multiple instructions per cycle

Static Scheduling – VLIW or EPIC (Itanium) __________ schedules the instructions If one instruction stalls, all following instructions stall Book Example: SuperScalar MIPS: Two instructions / cycle one alu/branch, one ld/st each cycle

Schedule for SS MIPS Loop: lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0($s1) addi$s1, $s1, -4 bne$s1, $zero,Loop PCALU/branchld/st

SuperScalars - Static bne FetchMemoryWriteBackExecuteDecode Read Values Write Values addu sw lw addi

Loop Problem Problem: –Too many _______________ in loop –Not enough ______________ to fill in holes Solution: –Do ______________ at once –More instructions –Only one branch

Loop Unrolling 1. Unroll Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4 ($s1) bne$s1, $zero,Loop

Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop But wait!!! How has this helped? There are tons of dependencies? Whatever are we to do? Register Renaming!!!

Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop (Repeated slide for your reference) Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop

Loop Unrolling 3. Reduce Instructions Loop: lw$t0, 0($s1) addi$s1, $s1, -8 addu$t0, $t0, $s2 sw$t0, 8($s1) lw$t1, 4($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, ___($s1) lw$t1, ___($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop

Loop Unrolling 4. Schedule Loop: lw1$t0, 0($s1) addi$s1, $s1, -8 addu1$t0, $t0, $s2 sw1$t0, 8($s1) lw2$t1, 4($s1) addu2$t1, $t1, $s2 sw2$t1, 4($s1) bne$s1, $zero,Loop ALU/branchlw/sw lw1

Performance Comparison OriginalUnrolled ALU/branchld/st lw $t0, 0($s1) addi$s1, $s1, -4 addu $t0, $t0, $s2 bne$s1, $zero,L sw $t0, 4($s1)

Static Scheduling Summary Code size ______________ (because of nops) It can not resolve __________ dependencies If one instruction stalls, ___________________

Dynamic Scheduling _________ schedules ready instructions Only ___________ instructions stall _______________ resolved in hardware

4-wide Dynamic Superscalar Fetch Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Fetch 4 instructions each cycle addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WAR Register Renaming) or addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WARRegister Renaming) or 2. Functional Unit (RAW – result not ready) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Wait until your inputs are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Execute once they are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) First calculate the address addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)lw r2, 0(s1)

4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Ld/St Queue checks memory addresses – out of order lw/sw addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

4-wide Dynamic Superscalar Commit Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) KEY Waiting for value Reading value Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 bne r1,r7,Loop sw r2, 0(s1) Instructions wait until all previous instructions have completed

Fallacies & Pitfalls Pipelining is easy –______________ is difficult Instruction set has no impact on pipelining –Complicated _____________ & _____________________ instructions complicate pipelining immensely

Technology Influences Pipelining ideas are good ideas regardless of technology –Only recently, with extra chip space, has ___________________ become better than ____________________ –Now, pipelining limited by ________

Exceptions – Unexpected Events InternalExternal

Definitions a.Anything unexpected happens b.External event occurs c.Internal event occurs d.Change in control flow ExceptionInterrupt Power PC Intel MIPS

Exception-Handling Stop Transfer control to OS Tell OS what happened Begin executing where we left off

1. Detect Exception Add control lines to detect errors

Step 2: Store PC into EPC Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2

Step 3: Tell OS the problem Store error code in the _________ Use vectored interrupts –Use error code to determine _________

Cause Register Set a flag in the cause register How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Vectored Interrupts The address of trap handler is determined by cause Exception typeException vector address (in hex) Undefined InstructionC hex Arithmetic OverflowC hex

Cause Register – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Handler PC

Vectored Interrupt – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Vector Table

Steps for Exceptions Detect exception Place processor in state before offending instruction Record exception type Record instruction’s PC in EPC Transfer control to OS

What happens if the third instruction is undefined? Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF ID WB MEM WB MEM WB MEM WB In what stage is it detected? In what cycle? 1. Detection

Must associate exception with proper instruction What happens if multiple exceptions happen in the same cycle?

Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF Preserve state before instruction What? What does that mean?!?

3. Record exception type Place value in cause register or Use vectored interrupts –(exception routine address dependent on exception type)

PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef add lwor 4. Record PC in EPC Machine in detection cycle

PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef 4. Record PC in EPC Machine in before transfer Where is the proper PC? Long gone!!!

4. Record PC in EPC Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory) Precise exceptions Imprecise exceptions

5. Transfer control to OS Same as before