Download presentation
Presentation is loading. Please wait.
Published byαΌΟΟΡμίδΟΟΞΏΟ ΞΞ¬ΞΌΞ²Ξ±Ο Modified over 6 years ago
1
Instructional Parallelism
2
Getting Faster β So Far Speed up clock Reduce CPI
Reduce Instructions/Program Clock, CPI, Instruction Power tightly interrelated β no free lunches Performance = πππππππ π πππππ = πππππππ πππ π‘ππ’ππ‘ππππ Γ πππ π‘ππ’ππ‘ππππ ππ¦πππ Γ ππ¦ππππ π πππππ
3
Getting Faster β So Far Pipelined processor
Ideal speedup = N times more throughput for N stages But Latency increases Branches / conflicts mean limited returns after certain point
4
Getting Faster β ILP Instruction Level Parallelism
Ability to run multiple instructions at the same time
5
Superscalar Superscalar : processor with multiple pipelines
6
Conventional vs Superscalar
Not all parts need duplication
7
Conventional vs Superscalar
Width generally focued on execution units Slowest part of pipeline Most specialized
8
Superscalar Multi-issue :
Fetch multiple instruction, issue to dispatch units I1 I2
9
Superscalar Dispatch :
Instructions transmitted to functional units for execution I2 I1
10
ARM A7 & A15 A7 A15 Patial Dual Issue 8 stage integer pipeline
3 instruction issue 15 stage integer pipeline
11
AMD Zen 10 wide execution 4 Integer ALUs 4 Floating Point Units
2 Address Generation Units
12
Superscalar Dependency issues just got MUCH harderβ¦
Instructions packed closer More to keep track of
13
Sample Program 9 instructions
14
Sample Program 9 instructions RAW Dependencies
15
In Order Issue Time 1 Issue 1 & 2 Time 2 Issue 3 & 4, start execution
16
In Order Issue Time 3 I5 Issued I4 is blocked on I1
17
In Order Issue Time 4 I5 blocked on I4 I6 issued
18
In Order Issue Time 5 I5 blocked on I4 I7 issued
19
In Order Issue Time 6 I8 issued I7 blocked on I5
20
In Order Issue Time 7 I9 issued I7 blocked on I5
21
In Order Issue Time 8
22
In Order Issue Time 9
23
Even More Complicated Reality is even more complicated
24
Analysis Single wide 3-stage pipeline:
Those 9 instructions in 12 clocks
25
Analysis Double width 3-stage pipeline:
Those 9 instructions in 9 clocks
26
Analysis β This Case π= π πππππ π€πππ π‘πππ 2βπ€πππ π‘πππ = 12 9 =1.25
Ideal speedup for double wide pipeline: 2x Speed up for dual pipelines in this case: π= π πππππ π€πππ π‘πππ 2βπ€πππ π‘πππ = 12 9 =1.25
27
Out Of Order Execution Out of Order execution:
Allow execution units to process instructions out of order Reduce waiting Guarantee same behavior as in order
28
Out Of Order Execution Out of Order example:
RAW dependency with R1 and R6
29
Out Of Order Execution Out of Order example:
Resolve by moving ADD R6, R3, R8 up to fill bubble due to R1
30
WAR and WAW Out of order execution means new dangers Write After Read
Write After Write
31
Out Of Order Execution Reservation stages : holding pens for instructions until needed resources ready
32
Out Of Order Execution Retire : put instructions back into order before writing out
33
Sample Program RAW Dependencies WAR WAW
34
Sample Program RAW Dependencies WAR WAW
35
Timing RAW Dependencies Need to follow by 2+ time stages
36
Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1
37
Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1 WAW Dependecies Be no later than dependency Or retire buffer used to fix writeback order
38
Sample Program Out of Order Execution Red = RAW 2 stage delay
Green = WAR Must come at same or later time WAW handled by reorder buffer
39
Out of Order Issue Time 1 Time 2 Issue 1 & 2
Issue 3 & 6 to avoid 4 blocking on 1
40
Out of Order Issue Time 3 Now safe to issue 4 and 8
41
Out of Order Issue Time 4 Now safe to issue 9 5 not ready to start
42
Out of Order Issue Time 5 Now safe to issue 5
43
Out of Order Issue Time 6+ Can't start 7 until 5 is writing
44
Analysis Out of Order Can help keep pipelines full
Can't shorten time for a critical path like 1 ο 4 ο 5 ο 7
45
Register Renaming Programmers/Compiler reuse registers for different jobs: 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = xΒ·y 4 sub r5,r1,#4 ;q = x - 4 5 div r2,r1,r5 ;z = x/(x β y) (reuse of r2) add r6,r4,r2 ;s = xΒ·y + x/(x β 4)
46
Register Renaming Register renaming :
Avoiding data conflicts by reassign registers to Other physical registers Hidden shadow registers
47
Register Renaming R2 renamed to r7 1 ldr r1,[r0] ;get x
2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = xΒ·y 4 sub r5,r1,#4 ;q = x - 4 5 div r7,r1,r5 ;z = x/(x β y) (changed to r7) add r6,r4,r7 ;s = xΒ·y + x/(x β 4)
48
Register Renaming Before: After:
49
Superscalar Pro/Con Good The hardware solves everything:
Hardware solves scheduling/registers/etc⦠Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way
50
Transistor Count Vast majority of transistor count is to support doing work faster
51
Superscalar Pro/Con Good Bad The hardware solves everything:
Hardware solves scheduling/registers/etc⦠Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way Bad Complex hardware Limit to scale
52
VLIW: Superscalar Alternative
VLIW : Very Large Instruction Word One bundle contains multiple instructions Each bundle designed to schedule cleanly
53
Who does work? Compiler assembles long instructions
Reorders at compile time Compiler has more time, information
54
VLIW Uses Itanium : EPIC : Explicitly Parallel Computing
3 instruction bundles
55
VLIW Pro/Con Good Bad Simple hardware
No scheduling responsibilities Potentially better optimization in compiler Bad Binary compatibility : compiler builds for one specific hardware Good compilers are HARD to write
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.