POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining Ver. Jan 14, 2014 Marco D. Santambrogio:

Slides:

Advertisements

Similar presentations

Adding the Jump Instruction

Advertisements

ELEN 468 Advanced Logic Design

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

CPSC 614 Computer Architecture Lec 3 Pipeline Review EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley.

CSCE 430/830 Computer Architecture Basic Pipelining & Performance

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

The Processor: Datapath & Control

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Fall 2007 MIPS Datapath (Single Cycle and Multi-Cycle)

Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.

DLX Instruction Format

Appendix A Pipelining: Basic and Intermediate Concepts

Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.

CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.

EECS 252 Graduate Computer Architecture Lecture 3  0 (continued) Review of Instruction Sets, Pipelines, Caches and Virtual Memory January 25 th, 2012.

Appendix A - Pipelining CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David.

Pipeline Review. 2 Review from last lecture Tracking and extrapolating technology part of architect’s responsibility Expect Bandwidth in disks, DRAM,

CSC 7080 Graduate Computer Architecture Lec 3 – Pipelining: Basic and Intermediate Concepts (Appendix A) Dr. Khalaf Notes adapted from: David Patterson.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.

CDA 3101 Fall 2013 Introduction to Computer Organization Multicycle Datapath 9 October 2013.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)

1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.

Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.

Branch Hazards and Static Branch Prediction Techniques

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.

COM181 Computer Hardware Lecture 6: The MIPs CPU.

Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:

CS161 – Design and Architecture of Computer Systems

CS161 – Design and Architecture of Computer Systems

Computer Organization

Pipelining: Hazards Ver. Jan 14, 2014

Morgan Kaufmann Publishers

Performance of Single-cycle Design

/ Computer Architecture and Design

ELEN 468 Advanced Logic Design

Morgan Kaufmann Publishers The Processor

CS/COE0447 Computer Organization & Assembly Language

Design of the Control Unit for Single-Cycle Instruction Execution

School of Computing and Informatics Arizona State University

Chapter 4 The Processor Part 2

Rocky K. C. Chang 6 November 2017

The Processor Lecture 3.4: Pipelining Datapath and Control

The Processor Lecture 3.2: Building a Datapath with Control

An Introduction to pipelining

Guest Lecturer TA: Shreyas Chand

Instruction Execution Cycle

COMS 361 Computer Organization

Processor: Multi-Cycle Datapath & Control

Pipelining Basic concept of assembly line

Pipelining Appendix A and Chapter 3.

Morgan Kaufmann Publishers The Processor

Introduction to Computer Organization and Architecture

The Processor: Datapath & Control.

COMS 361 Computer Organization

Processor: Datapath and Control

Presentation transcript:

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining Ver. Jan 14, 2014 Marco D. Santambrogio: Simone Campanoni:

2 Outline Processors and Instruction Sets Review of pipelining MIPS Reduced Instruction Set of MIPS  Processor Implementation of MIPS Processor Pipeline

Main Characteristics of MIPS  Architecture RISC (Reduced Instruction Set Computer) Architecture Based on the concept of executing only simple instructions in a reduced basic cycle to optimize the performance of CISC CPUs. LOAD/STORE Architecture ALU operands come from the CPU general purpose registers and they cannot directly come from the memory. Dedicated instructions are necessary to: load data from memory to registers store data from registers to memory Pipeline Architecture: Performance optimization technique based on the overlapping of the execution of multiple instructions derived from a sequential execution flow. 3

A Typical RISC ISA 4 32-bit fixed format instruction (3 formats) bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch Example: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Approaching an ISA 5 Instruction Set Architecture Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing Meaning of each instruction is described by RTL on architected registers and memory Given technology constraints assemble adequate datapath Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, MBR, …) Interconnect to move information among regs and FUs Map each instruction to sequence of RTLs Collate sequences into symbolic controller state transition diagram (STD) Implement controller

6 Example: MIPS Op Rs1Rd immediate Op Op Rs1Rs2 target RdOpx Register-Register Register-Immediate Op Rs1 Rs2/Opx immediate Branch Jump / Call

Datapath vs Control Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals Datapath Controller Control Points signals 7

Datapath vs Control Memory Data BUS Address BUS Control BUS PC PSW Data path Control Unit CPU IR Registri ALU Data Control Instruction 8

Starting scenario 0789 Data BUS Address BUS Contro BUS … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegister ALU IR PC R00 R01 R02 R03 R04 R05 … MemoryInstruction Data 9

0789 Read Instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R02, Reading Memory Instruction Data Data Bus Address Bus Contro bus 10

0790 Exe Instruction 0789 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R02, Reading Memory Instruction Data Data Bus Address Bus Contro bus 11

0790 Read instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R03, load R02, Reading Memory Instruction Data Data Bus Address Bus Contro bus 12

0791 Exe Instruction 0790 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R03, Reading 1492 Memory Instruction Data Data Bus Address Bus Contro bus 13

0791 Read Instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … add R01,R02,R load R03, Reading 1918 Memory Instruction Data Data Bus Address Bus Contro bus 14

0792 Exe Instruction 0791 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … add R01,R02,R add ackt 3410 Memory Instruction Data Data Bus Address Bus Contro bus 15

0792 Read Instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R02, add R01,R02,R Reading Memory Instruction Data Data Bus Address Bus Contro bus 16

0793 Exe Instruction 0792 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … load R02, Reading Memory Instruction Data Data Bus Address Bus Contro bus 17

0793 Read Instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … add R01,R01,R load R02, Reading Memory Instruction Data Data Bus Address Bus Contro bus 18

0794 Exe Instruction 0793 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … add R01,R01,R add ack Memory Instruction Data Data Bus Address Bus Contro bus 19

0794 Read Instruction … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … … … … … … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … store R01, add R01,R01,R Reading Memory Instruction Data Data Bus Address Bus Contro bus 20

… … … … … … … … Exe Instruction 0794 … … … … 0789load R02, load R03, add R01,R02,R load R02, add R01,R01,R store R01,4000 … … … … PSW Data path Control Unit CPURegistri ALU IR PC R00 R01 R02 R03 R04 R05 … store R01, writing Memory Instruction Data Data Bus Address Bus Contro bus 21

Reduced Instruction Set of MIPS Processor ALU instructions: add $s1, $s2, $s3 # $s1  $s2 + $s3 addi $s1, $s1, 4 # $s1  $s1 + 4 Load/store instructions: lw $s1, offset ($s2) # $s1  M[$s2+offset] sw $s1, offset ($s2) M[$s2+offset]  $s1 Branch instructions to control the control flow of the program: Conditional branches: the branch is taken only if the condition is satisfied. Examples: beq (branch on equal) and bne (branch on not equal) beq $s1, $s2, L1# go to L1 if ($s1 == $s2) bne $s1, $s2, L1# go to L1 if ($s1 != $s2) Unconditional jumps: the branch is always taken. Examples: j (jump) and jr (jump register) j L1# go to L1 jr $s1 # go to add. contained in $s1 22

Execution of MIPS Instructions Every instruction in the MIPS subset can be implemented in at most 5 clock cycles as follows: Instruction Fetch Cycle: Send the content of Program Counter register to Instruction Memory and fetch the current instruction from Instruction Memory. Update the PC to the next sequential address by adding 4 to the PC (since each instruction is 4 bytes). Instruction Decode and Register Read Cycle Decode the current instruction (fixed-field decoding) and read from the Register File of one or two registers corresponding to the registers specified in the instruction fields. Sign-extension of the offset field of the instruction in case it is needed. 23

Execution of MIPS instructions Execution Cycle The ALU operates on the operands prepared in the previous cycle depending on the instruction type: Register-Register ALU Instructions: ALU executes the specified operation on the operands read from the RF Register-Immediate ALU Instructions: ALU executes the specified operation on the first operand read from the RF and the sign-extended immediate operand Memory Reference: ALU adds the base register and the offset to calculate the effective address. Conditional branches: Compare the two registers read from RF and compute the possible branch target address by adding the sign-extended offset to the incremented PC. 24

Execution of MIPS instructions Memory Access (ME) Load instructions require a read access to the Data Memory using the effective address Store instructions require a write access to the Data Memory using the effective address to write the data from the source register read from the RF Conditional branches can update the content of the PC with the branch target address, if the conditional test yielded true. Write-Back Cycle (WB) Load instructions write the data read from memory in the destination register of the RF ALU instructions write the ALU results into the destination register of the RF. 25

Execution of MIPS Instructions 26

Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc LMDLMD ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Imm IR <= mem[PC] PC <= PC + 4 Reg[IR rd ] <= Reg[IR rs ] op IRop Reg[IR rt ] MIPS Data path 27

Instructions Latency Instruction Type Instruct. Mem. Register Read ALU Op. Data Memory Write Back Total Latency ALU Instr ns Load ns Store ns Cond. Branch ns Jump ns 28

Single-cycle Implementation of MIPS The length of the clock cycle is defined by the critical path given by the load instruction: T = 8 ns (f = 125 MHz). We assume each instruction is executed in a single clock cycle Each module must be used once in a clock cycle The modules used more than once in a cycle must be duplicated. We need an Instruction Memory separated from the Data Memory. Some modules must be duplicated, while other modules must be shared from different instruction flows To share a module between two different instructions, we need a multiplexer to enable multiple inputs to a module and select one of different inputs based on the configuration of control lines. 29

Implementation of MIPS data path with Control Unit 30

Multi-cycle Implementation The instruction execution is distributed on multiple cycles (5 cycles for MIPS) The basic cycle is smaller (2 ns  instruction latency = 10 ns) Implementation of multi-cycle CPU: Each phase of the instruction execution requires a clock cycle Each module can be used more than once per instruction in different clock cycles: possible sharing of modules We need internal registers to store the values to be used in the next clock cycles. 31

Pipelining Performance optimization technique based on the overlap of the execution of multiple instructions deriving from a sequential execution flow. Pipelining exploits the parallelism among instructions in a sequential instruction stream. Basic idea: The execution of an instruction is divided into different phases (pipelines stages), requiring a fraction of the time necessary to complete the instruction. The stages are connected one to the next to form the pipeline: instructions enter in the pipeline at one end, progress through the stages, and exit from the other end, as in an assembly line. 32

Pipelining Advantage: technique transparent for the programmer. Technique similar to a assembly line: a new car exits from the assembly line in the time necessary to complete one of the phases. An assembly line does not reduce the time necessary to complete a car, but increases the number of cars produced simultaneously and the frequency to complete cars. 33

Sequential vs. Pipelining Execution 34

Pipelining The time to advance the instruction of one stage in the pipeline corresponds to a clock cycle. The pipeline stages must be synchronized: the duration of a clock cycle is defined by the time requested by the slower stage of the pipeline (i.e. 2 ns). The goal is to balance the length of each pipeline stage If the stages are perfectly balanced, the ideal speedup due to pipelining is equal to the number of pipeline stages. 35

Performance Improvement Ideal case (asymptotically):If we consider the single- cycle unpipelined CPU1 with clock cycle of 8 ns and the pipelined CPU2 with 5 stages of 2 ns : The latency (total execution time) of each instruction is worsened: from 8 ns to 10 ns The throughput (number of instructions completed in the time unit) is improved of 4 times: (1 instruction completed each 8 ns) vs. (1 instruction completed each 2 ns) 36

Performance Improvement Ideal case (asymptotically): If we consider the multicycle unpipelined CPU3 composed of 5 cycles of 2 ns and the pipelined CPU2 with 5 stages of 2 ns : The latency (total execution time) of each instruction is not varied (10 ns) The throughput (number of instructions completed in the time unit) is improved of 5 times: (1 instruction completed every 10 ns) vs. (1 instruction completed every 2 ns) 37

Pipeline Execution of MIPS Instructions 38

Implementation of MIPS pipeline 39

Visualizing Pipelining I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 40

Note: Optimized Pipeline Register File used in 2 stages: Read access during ID and write access during WB What happens if read and write refer to the same register in the same clock cycle? It is necessary to insert one stall Optimized Pipeline: the RF read occurs in the second half of clock cycle and the RF write in the first half of clock cycle What happens if read and write refer to the same register in the same clock cycle? It is not necessary to insert one stall 41 From now on, this is the Pipeline we are going to use

42 5 Steps of MIPS Datapath Figure A.3, Page A-9 Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX IR <= mem[PC]; PC <= PC + 4 A <= Reg[IR rs ]; B <= Reg[IR rt ] rslt <= A op IRop B Reg[IR rd ] <= WB WB <= rslt 42

Questions 43