Download presentation
Presentation is loading. Please wait.
1
Compiler Optimization of Scalar Value Communication Between Speculative Threads
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University
2
We need parallel programs
Motivation Industry is delivering multithreaded processors Improving throughput on multithreaded processors is straight forward IBM Power4 processor: 2 processor cores per die 4 dies per module bit processors per unit Proc Proc Cache How can we use multithreaded processors to improve the performance of a single application? We need parallel programs Compiler Optimization of Scalar Value Communication… - 2 -
3
Automatic Parallelization
Finding independent threads from integer programs is limited by Complex control flow Ambiguous data dependences Runtime inputs More fundamentally Parallelization is determined at compile time Thread-Level Speculation Detecting data dependences at runtime Compiler Optimization of Scalar Value Communication… - 3 -
4
Thread-Level Speculation (TLS)
Load Store Retry Load Time TLS Thread2 Thread3 How do we communicate value between threads? Compiler Optimization of Scalar Value Communication… - 4 -
5
Is efficient when the dependence occurs infrequently
Speculation Ti Ti+1 Ti+2 Ti+3 …=*q …=*q …=*q …=*q while(1) { …= *q; . . . *p = …; } *p=… *p=… *p=… Retry …=*q Is efficient when the dependence occurs infrequently Compiler Optimization of Scalar Value Communication… - 5 -
6
Is efficient when the dependence occurs frequently
Synchronization while(1) { …= a; a =…; } wait(a); signal(a); Ti Ti+1 wait a= (stall) signal =a Is efficient when the dependence occurs frequently Compiler Optimization of Scalar Value Communication… - 6 -
7
Critical Forwarding Path in TLS
while(1) { wait(a); …= a; a =…; signal(a); } wait(a); …= a; critical forwarding path a = …; signal(a); Compiler Optimization of Scalar Value Communication… - 7 -
8
Cost of Synchronization
Normalized Execution Time 50 100 sync other gcc go mcf parser perlbmk twolf vpr AVERAGE The processors spend 43% of total execution time on synchronization Compiler Optimization of Scalar Value Communication… - 8 -
9
Outline Compiler optimization Optimization opportunity
Conservative instruction scheduling Aggressive instruction scheduling Performance Conclusions Compiler Optimization of Scalar Value Communication… - 9 -
10
Reducing the Critical Forwarding Path
Long Critical Path Short Critical Path execution time execution time shorter critical forwarding path less execution time Compiler Optimization of Scalar Value Communication… - 10 -
11
A Simplified Example from GCC
do { counter = 0; if(p->jmp) p = p->jmp; if(!p->real) { p = p->next; continue; } q = p; do { counter++; q = q->next; } while(q); } while(p); start P counter=0 p->jmp? jmp p=p->jmp real next p->real? jmp real next q = p jmp real next p=p->next counter++; q=q->next; q? jmp real next p=p->next end Compiler Optimization of Scalar Value Communication… - 11 -
12
Insert Wait Before First Use of P
start counter=0 p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 12 -
13
Insert Wait Before First Use of P
start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 13 -
14
Insert Signal After Last Definition of P
start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next counter++; q=q->next; q? p=p->next end Compiler Optimization of Scalar Value Communication… - 14 -
15
Insert Signal After Last Definition of P
start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next signal(p) counter++; q=q->next; q? p=p->next signal(p) end Compiler Optimization of Scalar Value Communication… - 15 -
16
Earlier Placement for Signals
start counter=0 wait(p) p->jmp? p=p->jmp p->real? q = p p=p->next signal(p) counter++; q=q->next; q? How can we systematically find these insertion points? p=p->next signal(p) end Compiler Optimization of Scalar Value Communication… - 16 -
17
Outline Compiler optimization Optimization opportunity
Conservative instruction scheduling How to compute the forwarding value Where to compute the forwarding value Aggressive instruction scheduling Performance Conclusions Compiler Optimization of Scalar Value Communication… - 17 -
18
A Dataflow Algorithm entry exit Given: Control flow graph
For each node in the graph: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward the value at an earlier node? Compiler Optimization of Scalar Value Communication… - 18 -
19
Moving Instructions Across Basic Blocks
signal p p=p->next signal p signal p q = p p=p->next end signal p signal p end end (A) (B) (C) Compiler Optimization of Scalar Value Communication… - 19 -
20
Example: (1) start counter=0 p->jmp? p=p->jmp p->real? q = p
q=q->next; q? p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 20 -
21
Example (2) start counter=0 p->jmp? p=p->jmp p->real? q = p
signal p p=p->next q = p counter++; q=q->next; q? p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 21 -
22
Example (3) start counter=0 p->jmp? p=p->jmp p->real? q = p
signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 22 -
23
Example (4) start counter=0 p->jmp? p=p->jmp p->real?
signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 23 -
24
Example (5) start counter=0 p->jmp? p=p->jmp p->real?
signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 24 -
25
Example (6) start counter=0 p->jmp? p=p->jmp signal p
p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 25 -
26
Example (7) start counter=0 p->jmp? p=p->jmp signal p
p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 26 -
27
Nodes with Multiple Successors
signal p p=p->next signal p p=p->next signal p p=p->next Compiler Optimization of Scalar Value Communication… - 27 -
28
Example (7) start counter=0 p->jmp? p=p->jmp signal p
p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 28 -
29
Example (8) start signal p p=p->next counter=0 p->jmp?
p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 29 -
30
Example (9) start p=p->jmp signal p p=p->next counter=0
p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 30 -
31
Example (10) start p=p->jmp signal p p=p->next counter=0
p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 31 -
32
Nodes with Multiple Successors
p=p->jmp signal p p=p->next p=p->next signal p Compiler Optimization of Scalar Value Communication… - 32 -
33
Example (10) start p=p->jmp signal p p=p->next counter=0
p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 33 -
34
Example (11) start p=p->jmp signal p p=p->next counter=0
p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 34 -
35
Finding the Earliest Insertion Point
Find at each node in the CFG: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward value at an earlier node? Earliest analysis A node is earliest, if on some execution path, no earlier node can compute the forwarding value Compiler Optimization of Scalar Value Communication… - 35 -
36
The Earliest Insertion Point (1)
start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 36 -
37
The Earliest Insertion Point (2)
start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 37 -
38
The Earliest Insertion Point (3)
start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 38 -
39
The Earliest Insertion Point (4)
start p=p->jmp signal p p=p->next counter=0 p->jmp? p=p->next signal p p=p->jmp signal p p=p->next signal p p=p->next p->real? signal p p=p->next signal p p=p->next signal p p=p->next q = p counter++; q=q->next; q? p=p->next signal p p=p->next p=p->next signal p end Compiler Optimization of Scalar Value Communication… - 39 -
40
The Resulting Graph start counter=0 wait(p) p->jmp? p=p->jmp
p1=p->next signal(p1) p1=p->next signal(p1) p->real? q = p p=p1 counter++; q=q->next; q? p=p1 end Compiler Optimization of Scalar Value Communication… - 40 -
41
Outline Compiler optimization Optimization opportunity
Conservative instruction scheduling Aggressive instruction scheduling Speculate on control flow Speculate on data dependence Performance Conclusions Compiler Optimization of Scalar Value Communication… - 41 -
42
We Cannot Compute the Forwarded Value When…
Control Dependence Ambiguous Data Dependence update(p) signal p p=p->next signal p p=p->jmp signal p p=p->next Speculation Compiler Optimization of Scalar Value Communication… - 42 -
43
Speculating on Control Dependences
p=p->next signal(p) signal p p=p->next violate(p) p=p->jmp violate(p) p=p->jmp p=p->next signal(p) signal p p=p->next p=p->next signal p Compiler Optimization of Scalar Value Communication… - 43 -
44
Speculating on Data Dependence
p=load(add); signal(p); store1 store2 update(p) store3 p=p->next signal(p) p=load(add); signal(p); Hardware support Compiler Optimization of Scalar Value Communication… - 44 -
45
Hardware Support p = load(0xADE8); mark_load(0xADE8) p = load(addr);
unmark_load(0xADE8) mark_load(addr) store1 store2 store2(0x8438) store3 store3(0x88E8) unmark_load(addr) Compiler Optimization of Scalar Value Communication… - 45 -
46
Outline Compiler optimization Optimization opportunity
Conservative instruction scheduling Aggressive instruction scheduling Performance Hardware optimization for scalar value communication Hardware optimization for all value communication Conclusions Compiler Optimization of Scalar Value Communication… - 46 -
47
Conservative Instruction Scheduling
100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling A=Conservative Instruction Scheduling Improves performance by 15% Compiler Optimization of Scalar Value Communication… - 47 -
48
Optimizing Induction Variable Only
100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling I=Induction Variable Scheduling only A=Conservative Instruction Scheduling Is responsible for 10% performance improvement Compiler Optimization of Scalar Value Communication… - 48 -
49
Benefits from Global Analysis
Multiscalar instruction scheduling[Vijaykumar, Thesis’98] Uses local analysis to schedule instructions across basic blocks Does not allow scheduling of instruction across inner loops 100 sync fail M=Multiscalar Scheduling A=Conservative Scheduling other Normalized Execution Time 50 busy M A M A gcc go Compiler Optimization of Scalar Value Communication… - 49 -
50
Aggressively Scheduling Instructions Across Control Dependences
100 sync fail other Normalized Execution Time 50 busy gcc go mcf parser perlbmk twolf vpr AVERAGE A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) Has no performance improvement over conservative scheduling Compiler Optimization of Scalar Value Communication… - 50 -
51
Normalized Execution Time
Aggressively Scheduling Instructions Across Control and Data Dependence gcc go mcf parser perlbmk twolf vpr AVERAGE sync fail other busy Normalized Execution Time 50 100 A=Conservative Instruction Scheduling C=Aggressive Instruction Scheduling(Control) D=Aggressive Instruction Scheduling(Control + Data) Improves performance by 9% over conservative scheduling Compiler Optimization of Scalar Value Communication… - 51 -
52
Hardware Optimization Over Non-Optimized Code
Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE U=No Instruction Scheduling E=No Instruction Scheduling+Hardware Optimization Improves performance by 4% Compiler Optimization of Scalar Value Communication… - 52 -
53
Hardware Optimization + Conservative Instruction Scheduling
Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE A=Conservative Instruction Scheduling F=Conservative Instruction Scheduling+Hardware Optimization Improves performance by 2% Compiler Optimization of Scalar Value Communication… - 53 -
54
Hardware Optimization + Aggressive Instruction Scheduling
Normalized Execution Time 50 100 sync fail other busy gcc go mcf parser perlbmk twolf vpr AVERAGE D=Aggressive Instruction Scheduling O=Aggressive Instruction Scheduling+Hardware Optimization Degrades performance slightly Hardware optimization is less important with compiler optimization Compiler Optimization of Scalar Value Communication… - 54 -
55
Hardware Optimization for Communicating Both Scalar and Memory Values
Normalized Execution Time 50 100 sync fail other busy compress crafty gap gcc go gzip ijpeg m88k mcf parser perlbmk twolf vortex vpr Reduces cost of violation by 27.6%, translating into 4.7% performance improvement Compiler Optimization of Scalar Value Communication… - 55 -
56
Impact of Hardware and Software Optimizations
Normalized Execution Time 50 100 compress gap go ijpeg mcf perlbmk vortex crafty gcc gzip m88k parser twolf vpr 6 benchmarks are improved by %, 4 others by % Compiler Optimization of Scalar Value Communication… - 56 -
57
Critical forwarding path is best addressed by the compiler
Conclusions Critical forwarding path is an important bottleneck in TLS Loop induction variables serialize parallel threads Can be eliminated with our instruction scheduling algorithm Non-inductive scalars can benefit from conservative instruction scheduling Aggressive instruction scheduling should be applied selectively Speculating on control dependence alone is not very effective Speculating on both control and data dependences can reduce synchronization significantly GCC is the biggest benefactor Hardware optimization is less important as compiler schedules instructions more aggressively Critical forwarding path is best addressed by the compiler Compiler Optimization of Scalar Value Communication… - 57 -
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.