Instructional Parallelism
Getting Faster – So Far Speed up clock Reduce CPI Reduce Instructions/Program Clock, CPI, Instruction Power tightly interrelated – no free lunches Performance = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑠𝑒𝑐𝑜𝑛𝑑 = 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑦𝑐𝑙𝑒 × 𝑐𝑦𝑐𝑙𝑒𝑠 𝑠𝑒𝑐𝑜𝑛𝑑
Getting Faster – So Far Pipelined processor Ideal speedup = N times more throughput for N stages But Latency increases Branches / conflicts mean limited returns after certain point
Getting Faster – ILP Instruction Level Parallelism Ability to run multiple instructions at the same time
Superscalar Superscalar : processor with multiple pipelines
Conventional vs Superscalar Not all parts need duplication
Conventional vs Superscalar Width generally focued on execution units Slowest part of pipeline Most specialized
Superscalar Multi-issue : Fetch multiple instruction, issue to dispatch units I1 I2
Superscalar Dispatch : Instructions transmitted to functional units for execution I2 I1
ARM A7 & A15 A7 A15 Patial Dual Issue 8 stage integer pipeline 3 instruction issue 15 stage integer pipeline
AMD Zen 10 wide execution 4 Integer ALUs 4 Floating Point Units 2 Address Generation Units
Superscalar Dependency issues just got MUCH harder… Instructions packed closer More to keep track of
Sample Program 9 instructions
Sample Program 9 instructions RAW Dependencies
In Order Issue Time 1 Issue 1 & 2 Time 2 Issue 3 & 4, start execution
In Order Issue Time 3 I5 Issued I4 is blocked on I1
In Order Issue Time 4 I5 blocked on I4 I6 issued
In Order Issue Time 5 I5 blocked on I4 I7 issued
In Order Issue Time 6 I8 issued I7 blocked on I5
In Order Issue Time 7 I9 issued I7 blocked on I5
In Order Issue Time 8
In Order Issue Time 9
Even More Complicated Reality is even more complicated
Analysis Single wide 3-stage pipeline: Those 9 instructions in 12 clocks
Analysis Double width 3-stage pipeline: Those 9 instructions in 9 clocks
Analysis – This Case 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25 Ideal speedup for double wide pipeline: 2x Speed up for dual pipelines in this case: 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 2−𝑤𝑖𝑑𝑒 𝑡𝑖𝑚𝑒 = 12 9 =1.25
Out Of Order Execution Out of Order execution: Allow execution units to process instructions out of order Reduce waiting Guarantee same behavior as in order
Out Of Order Execution Out of Order example: RAW dependency with R1 and R6
Out Of Order Execution Out of Order example: Resolve by moving ADD R6, R3, R8 up to fill bubble due to R1
WAR and WAW Out of order execution means new dangers Write After Read Write After Write
Out Of Order Execution Reservation stages : holding pens for instructions until needed resources ready
Out Of Order Execution Retire : put instructions back into order before writing out
Sample Program RAW Dependencies WAR WAW
Sample Program RAW Dependencies WAR WAW
Timing RAW Dependencies Need to follow by 2+ time stages
Timing RAW Dependencies Need to follow by 2+ time stages WAR Dependecies Can be issued at same time But I2 can't be before I1
Timing RAW Dependencies Need to follow by 2+ time stages WAR Dependecies Can be issued at same time But I2 can't be before I1 WAW Dependecies Be no later than dependency Or retire buffer used to fix writeback order
Sample Program Out of Order Execution Red = RAW 2 stage delay Green = WAR Must come at same or later time WAW handled by reorder buffer
Out of Order Issue Time 1 Time 2 Issue 1 & 2 Issue 3 & 6 to avoid 4 blocking on 1
Out of Order Issue Time 3 Now safe to issue 4 and 8
Out of Order Issue Time 4 Now safe to issue 9 5 not ready to start
Out of Order Issue Time 5 Now safe to issue 5
Out of Order Issue Time 6+ Can't start 7 until 5 is writing
Analysis Out of Order Can help keep pipelines full Can't shorten time for a critical path like 1 4 5 7
Register Renaming Programmers/Compiler reuse registers for different jobs: 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r2,r1,r5 ;z = x/(x – y) (reuse of r2) add r6,r4,r2 ;s = x·y + x/(x – 4)
Register Renaming Register renaming : Avoiding data conflicts by reassign registers to Other physical registers Hidden shadow registers
Register Renaming R2 renamed to r7 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = x·y 4 sub r5,r1,#4 ;q = x - 4 5 div r7,r1,r5 ;z = x/(x – y) (changed to r7) add r6,r4,r7 ;s = x·y + x/(x – 4)
Register Renaming Before: After:
Superscalar Pro/Con Good The hardware solves everything: Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way
Transistor Count Vast majority of transistor count is to support doing work faster
Superscalar Pro/Con Good Bad The hardware solves everything: Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way Bad Complex hardware Limit to scale
VLIW: Superscalar Alternative VLIW : Very Large Instruction Word One bundle contains multiple instructions Each bundle designed to schedule cleanly
Who does work? Compiler assembles long instructions Reorders at compile time Compiler has more time, information
VLIW Uses Itanium : EPIC : Explicitly Parallel Computing 3 instruction bundles
VLIW Pro/Con Good Bad Simple hardware No scheduling responsibilities Potentially better optimization in compiler Bad Binary compatibility : compiler builds for one specific hardware Good compilers are HARD to write