Instructional Parallelism

Slides:



Advertisements
Similar presentations
Computer Architecture Instruction-Level Parallel Processors
Advertisements

CSCI 4717/5717 Computer Architecture
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CMPE 421 Parallel Computer Architecture
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CDA3101 Recitation Section 8
Instruction Level Parallelism
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Performance of Single-cycle Design
Out of Order Processors
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
Appendix A - Pipelining
Advantages of Dynamic Scheduling
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
A Dynamic Algorithm: Tomasulo’s
Superscalar Processors & VLIW Processors
Levels of Parallelism within a Single Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Lecture: Pipelining Extensions
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Instruction Execution Cycle
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 3: ILP and Its Exploitation
Levels of Parallelism within a Single Processor
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Prof. Sirer CS 316 Cornell University
Pipelining.
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Throughput = #instructions per unit time (seconds/cycles etc.)
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Level Parallelism
Conceptual execution on a processor which exploits ILP
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Presentation transcript:

Instructional Parallelism

Getting Faster – So Far Speed up clock Reduce CPI Reduce Instructions/Program Clock, CPI, Instruction Power tightly interrelated – no free lunches Performance = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑠𝑒𝑐𝑜𝑛𝑑 = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑦𝑐𝑙𝑒 × 𝑐𝑦𝑐𝑙𝑒𝑠 𝑠𝑒𝑐𝑜𝑛𝑑

Getting Faster – So Far Pipelined processor Ideal speedup = N times more throughput for N stages But Latency increases Branches / conflicts mean limited returns after certain point

Getting Faster – ILP Instruction Level Parallelism Ability to run multiple instructions at the same time

Superscalar Superscalar : processor with multiple pipelines

Conventional vs Superscalar Not all parts need duplication

Conventional vs Superscalar Width generally focued on execution units Slowest part of pipeline Most specialized

Superscalar Multi-issue : Fetch multiple instruction, issue to dispatch units I1 I2

Superscalar Dispatch : Instructions transmitted to functional units for execution I2 I1

ARM A7 & A15 A7 A15 Patial Dual Issue 8 stage integer pipeline 3 instruction issue 15 stage integer pipeline

AMD Zen 10 wide execution 4 Integer ALUs 4 Floating Point Units 2 Address Generation Units

Superscalar Dependency issues just got MUCH harder… Instructions packed closer More to keep track of

Sample Program 9 instructions

Sample Program 9 instructions RAW Dependencies

In Order Issue Time 1 Issue 1 & 2 Time 2 Issue 3 & 4, start execution

In Order Issue Time 3 I5 Issued I4 is blocked on I1

In Order Issue Time 4 I5 blocked on I4 I6 issued

In Order Issue Time 5 I5 blocked on I4 I7 issued

In Order Issue Time 6 I8 issued I7 blocked on I5

In Order Issue Time 7 I9 issued I7 blocked on I5

In Order Issue Time 8

In Order Issue Time 9

Even More Complicated Reality is even more complicated

Analysis Single wide 3-stage pipeline: Those 9 instructions in 12 clocks

Analysis Double width 3-stage pipeline: Those 9 instructions in 9 clocks

Analysis – This Case 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25 Ideal speedup for double wide pipeline: 2x Speed up for dual pipelines in this case: 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25

Out Of Order Execution Out of Order execution: Allow execution units to process instructions out of order Reduce waiting Guarantee same behavior as in order

Out Of Order Execution Out of Order example: RAW dependency with R1 and R6

Out Of Order Execution Out of Order example: Resolve by moving ADD R6, R3, R8 up to fill bubble due to R1

WAR and WAW Out of order execution means new dangers Write After Read Write After Write

Out Of Order Execution Reservation stages : holding pens for instructions until needed resources ready

Out Of Order Execution Retire : put instructions back into order before writing out

Sample Program RAW Dependencies WAR WAW

Sample Program RAW Dependencies WAR WAW

Timing RAW Dependencies Need to follow by 2+ time stages

Timing RAW Dependencies Need to follow by 2+ time stages WAR Dependecies Can be issued at same time But I2 can't be before I1

Timing RAW Dependencies Need to follow by 2+ time stages WAR Dependecies Can be issued at same time But I2 can't be before I1 WAW Dependecies Be no later than dependency Or retire buffer used to fix writeback order

Sample Program Out of Order Execution Red = RAW 2 stage delay Green = WAR Must come at same or later time WAW handled by reorder buffer

Out of Order Issue Time 1 Time 2 Issue 1 & 2 Issue 3 & 6 to avoid 4 blocking on 1

Out of Order Issue Time 3 Now safe to issue 4 and 8

Out of Order Issue Time 4 Now safe to issue 9 5 not ready to start

Out of Order Issue Time 5 Now safe to issue 5

Out of Order Issue Time 6+ Can't start 7 until 5 is writing

Analysis Out of Order Can help keep pipelines full Can't shorten time for a critical path like 1  4  5  7

Register Renaming Programmers/Compiler reuse registers for different jobs: 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r2,r1,r5 ;z = x/(x – y) (reuse of r2) add r6,r4,r2 ;s = x·y + x/(x – 4)

Register Renaming Register renaming : Avoiding data conflicts by reassign registers to Other physical registers Hidden shadow registers

Register Renaming R2 renamed to r7 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r7,r1,r5 ;z = x/(x – y) (changed to r7) add r6,r4,r7 ;s = x·y + x/(x – 4)

Register Renaming Before: After:

Superscalar Pro/Con Good The hardware solves everything: Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way

Transistor Count Vast majority of transistor count is to support doing work faster

Superscalar Pro/Con Good Bad The hardware solves everything: Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way Bad Complex hardware Limit to scale

VLIW: Superscalar Alternative VLIW : Very Large Instruction Word One bundle contains multiple instructions Each bundle designed to schedule cleanly

Who does work? Compiler assembles long instructions Reorders at compile time Compiler has more time, information

VLIW Uses Itanium : EPIC : Explicitly Parallel Computing 3 instruction bundles

VLIW Pro/Con Good Bad Simple hardware No scheduling responsibilities Potentially better optimization in compiler Bad Binary compatibility : compiler builds for one specific hardware Good compilers are HARD to write