Instruction Level Parallelism

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

Lecture 4: CPU Performance
Computer Organization and Architecture
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.
Pipelining II (1) Fall 2005 Lecture 19: Pipelining II.
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Speed up on cycle time Stalls – Optimizing compilers for pipelining
CS2100 Computer Organization
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
CS 5513 Computer Architecture Pipelining Examples
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Pipelining review.
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Instruction Level Parallelism and Superscalar Processors
Computer Architecture
How to improve (decrease) CPI
Advanced Computer Architecture
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Daxia Ge Friday February 9th, 2007
Superscalar and VLIW Architectures
CSC3050 – Computer Architecture
CS 3853 Computer Architecture Pipelining Examples
Presentation transcript:

Instruction Level Parallelism Scalar-processors the model so far SuperScalar multiple execution units in parallel VLIW multiple instructions read in parallel

Scalar Processors T = Nq * CPI * Ct Pipeline But: The time to perform a task Nq, number of instruction, CPI cycles/instruction, Ct cycle-time Pipeline CPI = 1 Ct determined by critical path But: Floating point operations slow in software Even in hardware (FPU) takes several cycles WHY NOT USE SEVERAL FLOATING POINT UNITS?

SuperScalar Processors 1-cycle issue completion ALU IF DE PFU 1 DM WB …. PFU n m-cycles Each unit may take several cycles for finish

Instruction VS Machine Parallelism Instruction Parallelism Average nr of instructions that can be executed in parallel Depends on; “true dependencies” Branches in relation to other instructions Machine Parallelism The ability of the hardware to utilize instruction parallelism Nr of instructions that can be fetched and executed each cycle instruction memory bandwidth and instruction buffer (window) available resources The ability to spot instruction parallelism

instruction lookahead Example 1 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 dependent instruction lookahead or “prefetch” independent dependent 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 Concurrently executed

Issue & Completion Out of order issue, (starts “out of order”) RAW hazards WAR hazard (write after read) Out of order completion, (finishes “out of order”) WAW, Antidependence hazard (result overwritten) Issue Completion 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 1) 3) - - 2-parallel execution units 4-stage pipeline 2) 4) - - - - 1) 3) 2) 4)

Tomasulo’s Algorithm A mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 IF DE B DM WB IDLE A C IDLE B IDLE C ... ... ... ... ...

Instruction Issue mul $r1 2 3 2 A 3 mul $r1 2 3 mul $r2 $r1 4 IF DE B DM WB BUSY A $r1 A C IDLE B IDLE C ... ... ... ... ...

Instruction Issue mul $r1 2 3 2 A 3 mul $r2 A 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 A IF DE B DM WB 4 BUSY A $r1 A C WAIT B $r2 B IDLE C ... ... ... ... ...

Instruction Issue mul $r1 2 3 2 A 3 mul $r2 A 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 A IF DE B DM WB 4 mul $r2 5 6 BUSY A $r1 A C WAIT B $r2 C BUSY C ... ... ... ... ... Reg $r2 gets newer value

Clock until A and B finish mul $r1 2 3 2 A 3 mul $r2 6 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 6 IF DE B DM WB 4 mul $r2 5 6 IDLE A $r1 6 C BUSY B $r2 30 IDLE C ... ... ... ... ...

Clock until B finishes 2 A 3 mul $r2 6 4 mul $r1 2 3 mul $r2 $r1 4 IF DE B DM WB 2 IDLE A $r1 6 C IDLE B $r2 30 IDLE C ... ... ... ... ... NOT CHANGED!

SuperScalar Designs 3-8 times faster than Scalar designs depending on Instruction parallelism (upper bound) Machine parallelism Pros Backward compatible (optimization is done at run time) Cons Complex hardware implementation Not scaleable (Instruction Parallelism)

Why not let the compiler do the work? VLIW Why not let the compiler do the work? Use a Very Long Instruction Word (VLIW) Consisting of many instructions is parallel Each time we read one VLIW instruction we actually issue all instructions contained in the VLIW instruction

VLIW Usually the bottleneck 32 IF DE EX DM WB VLIW instruction 32 IF 128 32 IF DE EX DM WB 32 IF DE EX DM WB

VLIW Let the compiler can do the instruction issuing Let it take it’s time we do this only once, ADVANCED What if we change the architecture Recompile the code Could be done the first time you load a program Only recompiled when architecture changed We could also let the compiler know about Cache configuration Nr levels, line size, nr lines, replacement strategy, writeback/writethrough etc. Hot Research Area!

VLIW Pros Cons We get high bandwidth to instruction memory Cheap compared to SuperScalar Not much extra hardware needed More parallelism We spot parallelism at a higher level (C, MODULA, JAVA?) We can use advanced algorithms for optimization New architectures can be utilized by recompilation Cons Software compatibility It has not “HIT THE MARKET” (yet).

4 State Branch Prediction NO BRA loop : A bne 100times loop B j loop 1 2 Predict Branch BRA NO BRA BRA We always predict BRA (1) in the inner loop, when exit we fail once and go to (2). Next time we still predict BRA (2) and go to (1) NO BRA Predict no branch BRA NO BRA

Branch Prediction The 4-states are stored in 2 bits in the instruction cache together with the conditional Branch instruction We predict the branch We prefetch the predicted instructions We issue these before we know if branch taken! When predicting fails we abort issued instructions

Branch Prediction loop 1) 2) 3) Instructions 1) 2) and 3) are prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish predict branch taken bne $r1 loop In case of prediction failure we have to abort the issued instructions and start fetching 4) 5) and 6)

Multiple Branch Targets loop 1) 2) 3) Instructions 1) 2) 3) 4) 5) and 6) is prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish bne $r1 loop As soon as we know $r1 we abort the redundant instructions. VERY COMPLEX!!!