Instructional Parallelism

Instructional Parallelism

Getting Faster – So Far Speed up clock Reduce CPI
Reduce Instructions/Program Clock, CPI, Instruction Power tightly interrelated – no free lunches Performance = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑠𝑒𝑐𝑜𝑛𝑑 = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑦𝑐𝑙𝑒 × 𝑐𝑦𝑐𝑙𝑒𝑠 𝑠𝑒𝑐𝑜𝑛𝑑

Getting Faster – So Far Pipelined processor
Ideal speedup = N times more throughput for N stages But Latency increases Branches / conflicts mean limited returns after certain point

Getting Faster – ILP Instruction Level Parallelism
Ability to run multiple instructions at the same time

Superscalar Superscalar : processor with multiple pipelines

Conventional vs Superscalar
Not all parts need duplication

Conventional vs Superscalar
Width generally focued on execution units Slowest part of pipeline Most specialized

Superscalar Multi-issue :
Fetch multiple instruction, issue to dispatch units I1 I2

Superscalar Dispatch :
Instructions transmitted to functional units for execution I2 I1

ARM A7 & A15 A7 A15 Patial Dual Issue 8 stage integer pipeline
3 instruction issue 15 stage integer pipeline

AMD Zen 10 wide execution 4 Integer ALUs 4 Floating Point Units
2 Address Generation Units

Superscalar Dependency issues just got MUCH harder…
Instructions packed closer More to keep track of

Sample Program 9 instructions

Sample Program 9 instructions RAW Dependencies

In Order Issue Time 1 Issue 1 & 2 Time 2 Issue 3 & 4, start execution

In Order Issue Time 3 I5 Issued I4 is blocked on I1

In Order Issue Time 4 I5 blocked on I4 I6 issued

In Order Issue Time 5 I5 blocked on I4 I7 issued

In Order Issue Time 6 I8 issued I7 blocked on I5

In Order Issue Time 7 I9 issued I7 blocked on I5

In Order Issue Time 8

In Order Issue Time 9

Even More Complicated Reality is even more complicated

Analysis Single wide 3-stage pipeline:
Those 9 instructions in 12 clocks

Analysis Double width 3-stage pipeline:
Those 9 instructions in 9 clocks

Analysis – This Case 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25
Ideal speedup for double wide pipeline: 2x Speed up for dual pipelines in this case: 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25

Out Of Order Execution Out of Order execution:
Allow execution units to process instructions out of order Reduce waiting Guarantee same behavior as in order

Out Of Order Execution Out of Order example:
RAW dependency with R1 and R6

Out Of Order Execution Out of Order example:
Resolve by moving ADD R6, R3, R8 up to fill bubble due to R1

WAR and WAW Out of order execution means new dangers Write After Read
Write After Write

Out Of Order Execution Reservation stages : holding pens for instructions until needed resources ready

Out Of Order Execution Retire : put instructions back into order before writing out

Sample Program RAW Dependencies WAR WAW

Timing RAW Dependencies Need to follow by 2+ time stages

Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1

Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1 WAW Dependecies Be no later than dependency Or retire buffer used to fix writeback order

Sample Program Out of Order Execution Red = RAW 2 stage delay
Green = WAR Must come at same or later time WAW handled by reorder buffer

Out of Order Issue Time 1 Time 2 Issue 1 & 2
Issue 3 & 6 to avoid 4 blocking on 1

Out of Order Issue Time 3 Now safe to issue 4 and 8

Out of Order Issue Time 4 Now safe to issue 9 5 not ready to start

Out of Order Issue Time 5 Now safe to issue 5

Out of Order Issue Time 6+ Can't start 7 until 5 is writing

Analysis Out of Order Can help keep pipelines full
Can't shorten time for a critical path like 1  4  5  7

Register Renaming Programmers/Compiler reuse registers for different jobs: 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r2,r1,r5 ;z = x/(x – y) (reuse of r2) add r6,r4,r2 ;s = x·y + x/(x – 4)

Register Renaming Register renaming :
Avoiding data conflicts by reassign registers to Other physical registers Hidden shadow registers

Register Renaming R2 renamed to r7 1 ldr r1,[r0] ;get x
2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r7,r1,r5 ;z = x/(x – y) (changed to r7) add r6,r4,r7 ;s = x·y + x/(x – 4)

Register Renaming Before: After:

Superscalar Pro/Con Good The hardware solves everything:
Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way

Transistor Count Vast majority of transistor count is to support doing work faster

Superscalar Pro/Con Good Bad The hardware solves everything:
Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way Bad Complex hardware Limit to scale

VLIW: Superscalar Alternative
VLIW : Very Large Instruction Word One bundle contains multiple instructions Each bundle designed to schedule cleanly

Who does work? Compiler assembles long instructions
Reorders at compile time Compiler has more time, information

VLIW Uses Itanium : EPIC : Explicitly Parallel Computing
3 instruction bundles

VLIW Pro/Con Good Bad Simple hardware
No scheduling responsibilities Potentially better optimization in compiler Bad Binary compatibility : compiler builds for one specific hardware Good compilers are HARD to write

Instructional Parallelism

Similar presentations

Presentation on theme: "Instructional Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instructional Parallelism

Similar presentations

Presentation on theme: "Instructional Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback