1 CS3410 Guest Lecture A Simple CPU: remaining branch instructions CPU Performance Pipelined CPU Tudor Marian.

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

Adding the Jump Instruction

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Performance See: P&H 1.4.

© Kavita Bala, Computer Science, Cornell University Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipelining See: P&H Chapter 4.5.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University A Processor See: P&H Chapter ,

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 1.6,

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 1.4, Appendix A.

Pipelined Datapath and Control (Lecture #13) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 Recap (Pipelining). 2 What is Pipelining? A way of speeding up execution of tasks Key idea : overlap execution of multiple taks.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Appendix A Pipelining: Basic and Intermediate Concepts

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 1.6, Appendix B.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Hakim Weatherspoon CS 3410, Spring 2010 Computer Science Cornell University A Processor See: P&H Chapter ,

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 1.6,

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 1.6, Appendix B.

Computer Science Education

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapters 1.4 and 4.5.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University CPU Performance Pipelined CPU See P&H Chapters 1.4 and 4.5.

1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter 4.6.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Datapath and Control Unit Design

COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Performance of Single-cycle Design

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University MIPS Pipeline See P&H Chapter 4.6.

Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University A Processor See: P&H Chapter ,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.

COM181 Computer Hardware Lecture 6: The MIPs CPU.

Processor Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter ,

CPU Performance Pipelined CPU Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapters 1.4 and 4.5.

CPU Performance Pipelined CPU

CS161 – Design and Architecture of Computer Systems

Computer Organization

Morgan Kaufmann Publishers

Performance of Single-cycle Design

Morgan Kaufmann Publishers The Processor

ECE232: Hardware Organization and Design

Chapter 4 The Processor Part 2

Serial versus Pipelined Execution

Rocky K. C. Chang 6 November 2017

The Processor Lecture 3.4: Pipelining Datapath and Control

Guest Lecturer TA: Shreyas Chand

Pipelining Appendix A and Chapter 3.

Morgan Kaufmann Publishers The Processor

MIPS Pipeline Hakim Weatherspoon CS 3410, Spring 2013 Computer Science

Presentation transcript:

1 CS3410 Guest Lecture A Simple CPU: remaining branch instructions CPU Performance Pipelined CPU Tudor Marian

2 Memory Layout Examples (big/little endian): # r5 contains 5 (0x ) sb r5, 2(r0) lb r6, 2(r0) sw r5, 8(r0) lb r7, 8(r0) lb r8, 11(r0) 0x x x x x x x x x x x a 0x b... 0xffffffff

3 Control Flow: More Branches oprssubopoffset 6 bits5 bits 16 bits signed offsets almost I-Type opsubopmnemonicdescription 0x10x0BLTZ rs, offsetif R[rs] < 0 then PC = PC+4+ (offset<<2) 0x1 BGEZ rs, offsetif R[rs] ≥ 0 then PC = PC+4+ (offset<<2) 0x60x0BLEZ rs, offsetif R[rs] ≤ 0 then PC = PC+4+ (offset<<2) 0x70x0BGTZ rs, offsetif R[rs] > 0 then PC = PC+4+ (offset<<2) Conditional Jumps (cont.)

4 Absolute Jump tgt +4 || Data Mem addr ext 555 Reg. File PC Prog. Mem ALU inst control imm offset + Could have used ALU for branch cmp =? cmp opsubopmnemonicdescription 0x10x0BLTZ rs, offsetif R[rs] < 0 then PC = PC+4+ (offset<<2) 0x1 BGEZ rs, offsetif R[rs] ≥ 0 then PC = PC+4+ (offset<<2) 0x60x0BLEZ rs, offsetif R[rs] ≤ 0 then PC = PC+4+ (offset<<2) 0x70x0BGTZ rs, offsetif R[rs] > 0 then PC = PC+4+ (offset<<2)

5 Control Flow: Jump and Link opmnemonicdescription 0x3JAL targetr31 = PC+8 (+8 due to branch delay slot) PC = (PC+4) || (target << 2) opimmediate 6 bits 26 bits J-Type Function/procedure calls opmnemonicdescription 0x2J targetPC = (PC+4) || (target << 2)

6 Absolute Jump tgt +4 || Data Mem addr ext 555 Reg. File PC Prog. Mem ALU inst control imm offset + =? cmp Could have used ALU for link add +4 opmnemonicdescription 0x3JAL targetr31 = PC+8 (+8 due to branch delay slot) PC = (PC+4) || (target << 2)

7 Performance See: P&H 1.4

8 Design Goals What to look for in a computer system? Correctness: negotiable? Cost –purchase cost = f(silicon size = gate count, economics) –operating cost = f(energy, cooling) –operating cost >= purchase cost Efficiency –power = f(transistor usage, voltage, wire size, clock rate, …) –heat = f(power) Intel Core i7 Bloomfield: 130 Watts AMD Turion: 35 Watts Intel Core 2 Solo: 5.5 Watts Cortex-A9 Dual 0.4 Watts Performance Other: availability, size, greenness, features, …

9 Performance How to measure performance? GHz (billions of cycles per second) MIPS (millions of instructions per second) MFLOPS (millions of floating point operations per second) benchmarks (SPEC, TPC, …) Metrics latency: how long to finish my program throughput: how much work finished per unit time

10 How Fast? ALU PC Prog. Mem control new pc Reg. File ~ 3 gates All signals are stable 80 gates => clock period of at least 160 ns, max frequency ~6MHz Better: 21 gates => clock period of at least 42 ns, max frequency ~24MHz Assumptions: alu: 32 bit ripple carry + some muxes next PC: 30 bit ripple carry control: minimized for delay (~3 gates) transistors: 2 ns per gate prog,. memory: 16 ns (as much as 8 gates) register file: 2 ns access ignore wires, register setup time Better: alu: 32 bit carry lookahead + some muxes (~ 9 gates) next PC: 30 bit carry lookahead (~ 6 gates) Better Still: next PC: cheapest adder faster than 21 gate delays

11 Adder Performance 32 Bit Adder DesignSpaceTime Ripple Carry≈ 300 gates≈ 64 gate delays 2-Way Carry-Skip≈ 360 gates≈ 35 gate delays 3-Way Carry-Skip≈ 500 gates≈ 22 gate delays 4-Way Carry-Skip≈ 600 gates≈ 18 gate delays 2-Way Look-Ahead≈ 550 gates≈ 16 gate delays Split Look-Ahead ≈ 800 gates≈ 10 gate delays Full Look-Ahead≈ 1200 gates≈ 5 gate delays

12 Optimization: Summary Critical Path Longest path from a register output to a register input Determines minimum cycle, maximum clock frequency Strategy 1 (we just employed) Optimize for delay on the critical path Optimize for size / power / simplicity elsewhere –next PC

13 Processor Clock Cycle alu PC imm memory d in d out addr target offset cmp control =? extend new pc register file opmnemonicdescription 0x20LB rd, offset(rs)R[rd] = sign_ext(Mem[offset+R[rs]]) 0x23LW rd, offset(rs)R[rd] = Mem[offset+R[rs]] 0x28SB rd, offset(rs)Mem[offset+R[rs]] = R[rd] 0x2bSW rd, offset(rs)Mem[offset+R[rs]] = R[rd]

14 Processor Clock Cycle alu PC imm memory d in d out addr target offset cmp control =? extend new pc register file opfuncmnemonicdescription 0x00x08JR rsPC = R[rs] opmnemonicdescription 0x2J targetPC = (PC+4) || (target << 2)

15 Multi-Cycle Instructions Strategy 2 Multiple cycles to complete a single instruction E.g: Assume: load/store: 100 ns arithmetic: 50 ns branches: 33 ns Multi-Cycle CPU 30 MHz (33 ns cycle) with –3 cycles per load/store –2 cycles per arithmetic –1 cycle per branch Faster than Single-Cycle CPU? 10 MHz (100 ns cycle) with –1 cycle per instruction

16 CPI Instruction mix for some program P, assume: 25% load/store ( 3 cycles / instruction) 60% arithmetic ( 2 cycles / instruction) 15% branches ( 1 cycle / instruction) Multi-Cycle performance for program P: 3 * * *.15 = 2.1 average cycles per instruction (CPI) = MHz 10 MHz 15 MHz 800 MHz PIII “faster” than 1 GHz P4

17 Example Goal: Make 30 MHz CPU (15MIPS) run 2x faster by making arithmetic instructions faster Instruction mix (for P): 25% load/store, CPI = 3 60% arithmetic, CPI = 2 15% branches, CPI = 1

18 Amdahl’s Law Execution time after improvement = Or: Speedup is limited by popularity of improved feature Corollary: Make the common case fast Caveat: Law of diminishing returns execution time affected by improvement amount of improvement + execution time unaffected

19 Pipelining See: P&H Chapter 4.5

20 The Kids Alice Bob They don’t always get along…

21 The Bicycle

22 The Materials Saw Drill Glue Paint

23 The Instructions N pieces, each built following same sequence: SawDrillGluePaint

24 Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts

25 Elapsed Time for Alice: 4 Elapsed Time for Bob: 4 Total elapsed time: 4*N Can we do better? Sequential Performance time … Latency: Throughput: Concurrency:

26 Design 2: Pipelined Design Partition room into stages of a pipeline One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep AliceBobCarolDave

27 Pipelined Performance time … Latency: Throughput: Concurrency:

28 Unequal Pipeline Stages 0h1h2h3h… 15 min 30 min 45 min 90 min Latency: Throughput: Concurrency:

29 Poorly-balanced Pipeline Stages 0h1h2h3h… 15 min 30 min 45 min 90 min Latency: Throughput: Concurrency:

30 Re-Balanced Pipeline Stages 0h1h2h3h… 15 min 30 min 45 min 90 min Latency: Throughput: Concurrency:

31 Splitting Pipeline Stages 0h1h2h3h… 15 min 30 min 45 min min Latency: Throughput: Concurrency:

32 Pipeline Hazards 0h1h2h3h… Q: What if glue step of task 3 depends on output of task 1? Latency: Throughput: Concurrency:

33 Lessons Principle: Throughput increased by parallel execution Pipelining: Identify pipeline stages Isolate stages from each other Resolve pipeline hazards

34 A Processor alu PC imm memory d in d out addr target offset cmp control =? new pc register file inst extend +4

35 Write- Back Memory Instruction Fetch Execute Instruction Decode register file control A Processor alu imm memory d in d out addr inst PC memory compute jump/branch targets new pc +4 extend

36 Basic Pipeline Five stage “RISC” load-store architecture 1.Instruction fetch (IF) –get instruction from memory, increment PC 2.Instruction Decode (ID) –translate opcode into control signals and read registers 3.Execute (EX) –perform ALU operation, compute jump/branch targets 4.Memory (MEM) –access memory if needed 5.Writeback (WB) –update register file

37 Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a separate stage for the execution performed during each clock cycle Add pipeline registers (flip-flops) to isolate signals between different stages Slides thanks to Kevin Walsh, Sally McKee, and Kavita Bala