Download presentation
Presentation is loading. Please wait.
1
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
2
Improving Performance of a Single Application C C P C P C P C P Finding parallel threads is difficult T1T2T3T4 Chip Boundary Load 0x86 Store 0x86 dependence
3
Compiler Makes Conservative Decisions Cons Make conservative decisions for ambiguous dependences –Complex control flow –Pointer and indirect references –Runtime input Pros Examine the entire program … 0x80 0x86 … Iteration #1 Iteration #2Iteration #3Iteration #4 … 0x60 0x66 … … 0x90 0x96 … … 0x50 0x56 … for (i = 0; i < N; i++) { … *q *p … }
4
Using Hardware to Exploit Parallelism Search for independent instructions within a window Cons Exploit parallelism among a small number of instructions Instruction Window Pros Disambiguate dependence accurately at runtime How to exploit parallelism with both hardware and compiler? Instruction Window Load 0x88 Store 0x66 Add a1, a2 Instruction Window Load 0x88 Store 0x66 Add a1, a2
5
Thread-Level Speculation (TLS) Compiler Creates parallel threads Hardware Disambiguates dependences My goal: Speed up programs with the help of TLS Load *q Store *p dependence Sequential Speculatively parallel Store *p Load *q Store 0x86 Load 0x86 Time
6
Single Chip Multiprocessor (CMP) C P C P C P Interconnect Chip Boundary Memory Our support and optimization for TLS Built upon CMP Applied to other architectures that support multiple threads Replicated processor cores –Reduce design cost Scalable and decentralized design –Localize wires –Eliminate centralized structures Infrastructure transparency –Handle legacy codes
7
Thread-Level Speculation (TLS) Load *q Store *p dependence Speculatively parallel Time Support thread level speculation Recover from failed speculation Buffer speculative writes from the memory Track data dependences Detect data dependence violation
8
Buffering Speculative Writes from the Memory Memory C P C P C P C P Data contents Directory to maintain cache coherence Speculative state Extending cache coherence protocol to support TLS Interconnect Chip Boundary
9
Detecting Dependence Violations C P C P Interconnect store address violation detected ProducerConsumer Producer forwards all store addresses Consumer detects data dependence violation Extending cache coherence protocol [ISCA’00]
10
Synchronizing Scalars …=a a=… …=a a=… Identifying scalars causing dependences is straightforward Time Producer Consumer
11
Synchronizing Scalars …=a a=… …=a a=… Dependent scalars should be synchronized Time Signal(a) Wait(a) Producer Consumer Use forwarded value
12
Reducing the Critical Forwarding Path Long Critical PathShort Critical Path Instruction scheduling can reduce critical forwarding path Critical Path wait …=a a = … signal Time
13
Potential
14
Compiler Infrastructure Loops were targeted. –Selected loops were discarded Low coverage (<0.1%) Less than 30 instructions / iteration >16384 instructions / iteration Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results. Compiler inserts instructions to create and manage epochs. Compiler allocates forward variables on stack (forwarding frame) Compiler inserts wait and signal instructions.
16
Synchronization Constraints Wait before first use Signal after last definition Signal on every possible path Wait should be as late as possible Signal should be as early as possible.
17
Data flow Analysis for synchronization The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included. CFG is modeled as a graph, with BB as nodes.
18
Instruction Scheduling Conservative scheduling Aggressive scheduling –Control Dependences –Data dependences
19
Instruction Scheduling … = a a = … Initial Synchronization wait(a) signal(a) Scheduling Instructions Speculatively Scheduling Instructions
20
Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) … = a a = … wait(a) Initial Synchronization signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions
21
Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) … = a a = … wait(a) Initial Synchronization Speculatively Scheduling Instructions a = … signal(a) wait(a) … = a *q=… a = … signal(a) *q=… signal(a)
22
Instruction Scheduling Dataflow Analysis: Handles complex control flow We Define Two Dataflow Analyses: “Stack” analysis: finds instructions needed to compute the forwarded value “ “Earliest” analysis: finds the earliest node to compute the forwarded value
23
Computation Stack Stores the instructions to compute a forwarded value Associating a stack to every node for every forwarded scalar We don’t know how to compute the forwarded value We know how to compute the forwarded value signal a a = a*11 Not yet evaluated
26
A Simplified Example from GCC do { } while(p);... start p=p->jmp q = p; p->real? p=p->next counter++; q=q->next; q? end counter=0 wait(p) p->jmp?
27
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p counter=0 wait(p) p->jmp?
28
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp?
29
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next
30
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next
31
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
32
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
33
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next
34
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
35
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
36
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
37
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp Node to Revisit
38
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
39
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit
40
Stack Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Node to Revisit signal p p=p->next signal p p=p->next signal p p=p->next Solution is consistent
41
Scheduling Instructions Dataflow Analysis: Handles complex control flow We Define Two Dataflow Problems: “Stack” analysis: finds instructions needed to compute the forwarded value. “ “Earliest” analysis: finds the earliest node to compute the forwarded value.
42
The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated
43
The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated
44
The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated
45
The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest Not Evaluated
46
The Earliest Analysis start p=p->jmp q = p p->real? p=p->next counter++; q=q->next; q? end signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next signal p p=p->next p=p->jmp counter=0 wait(p) p->jmp? signal p p=p->next signal p p=p->next Earliest Not Earliest
47
Code Transformation start q = p p->real? p=p->next counter++; q=q->next; q? end counter=0 wait(p) p->jmp? p2=p->jmp p1=p2->next Signal(p1) p1=p->next Signal(p1) Earliest Not Earliest
48
Instruction Scheduling Scheduling Instructions wait(a) … = a a = … signal(a) Speculatively Scheduling Instructions a = … signal(a) wait(a) … = a *q=… a = … signal(a) *q=… … = a a = … wait(a) Initial Synchronization signal(a)
49
Aggressive scheduling Optimize common case by moving signal up. On wrong path, send violate_epoch to next thread and then forward correct value. If instructions are scheduled past branches, then on occurrence of an exception, a violation should be sent and non-speculative code should be executed. Changes: –Modify Meet operator to compute stack for frequently executed paths –Add a new node for infrequent paths and insert violate_epoch signals. Earliest is set to true for these nodes.
50
Aggressive scheduling Two new instructions –Mark_load – tells H/W to remember the address of location Placed when load moves past store –Unmark_load – clears the mark. placed at original position of load instruction. In the meet operation, conflicting load marks are merged using logical or.
51
Speculating Beyond a Control Dependence signal p p=p->next signal p p=p->jmp p=p->next end signal p Frequently Executed Path end violate(p) p=p->jmp signal(p) p1=p->next signal(p1) Frequently Executed Path
52
Speculating Beyond a Potential Data Dependence *q = NULL signal p p=p->next Hardware support Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] signal p p = p->next end Profiling Information *q = NULL end p = p->next signal(p) Speculative Load
53
Experimental Framework Benchmarks –SPECint95 and SPECint2000, -O3 optimization Underlying architecture –4-processor, single-chip multiprocessor –speculation supported by coherence Simulator –superscalar, similar to MIPS R14K –simulates communication latency –models all bandwidth and contention detailed simulation C C P C P Interconnect C P C P
54
Impact of Synchronization Stalls for Scalars Performance bottleneck: synchronization (40% of execution time) 0 100 gcc go mcf parser perlbmk twolf vpr compress crafty gap gzip ijpeg m88ksim vortex Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Synchronization Stall Other Norm. Region Exec. Time
55
Instruction Scheduling U=No Instruction Scheduling A=Instruction Scheduling Improves performance by 18% gcc go mcf parser perlbmk twolf vpr compress crafty gap gzip ijpeg m88ksim vortex 0 100 UAUAUAUAUAUAUAUAUAUAUAUAUAUA Still room for improvement Synchronization Stall Failed Speculation Other Busy Norm. Region Exec. Time 5%22%40%
56
Aggressively Scheduling Instructions A=Instruction Scheduling S=Speculating Across Control & Data Dependences Significantly for some benchmarks gcc mcf parser perlbmk twolf Synchronization Stall Failed Speculation Other Busy A AA A A SSSSS 0 100 Norm. Region Exec. Time 17% 19% 15%
57
Conclusions 6 of 14 applications, performance improvement by 6.2-28.5% Synchronization and Parallelization exposes some new data dependences. Speculative scheduling past control and data dependences gives a better performance.
58
Memory Variables Difficulty due to –Aliasing : Traditional data flow analysis doesn’t help –No clear way of defining location of last definition or first use. Potential gain in performance by reducing the failed cycles can be seen below.
59
Synchronizing Hardware Signal both the address and value to the next thread. Producer requires a signal address buffer for ensuring correct execution. –If a store address is found in the signal address buffer -> misspeculation. Consumer has a local flag (use forward flag) to decide whether to load the value from the speculative cache or the local memory –The flag is reset if the same thread writes to the memory location before reading. –NULL address is handled as any other address, when an exception is caused, non-speculative code is used.
60
Compiler support Each load and store are assigned an ID based on the call stack and are profiled for dependence. A dependence graph is constructed and all Ids accessing the same location are grouped. All loads and stores belonging to the same group are synchronized by the compiler. The compiler then clones all the procedures on the call stack containing frequent data dependences, so that synchronization is executed only in the context of the call stack. The original code is then modified to include these cloned procedures. Data flow analysis is performed similar to scalar variables to insert a signal for the last store at the end of every path.
61
Analysis of data dependence patterns. Potential study to decide when to speculate and when to synchronize. –When inter-epoch dependences in more than 5% of all epochs are predicted correctly, to obtain significant performance improvement. –Dependence distances of 1 epoch have significant effect on speedup.
62
Impact of Failed Speculation on Memory-Resident Values Next performance bottleneck: failed speculation (30% of execution) Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Failed Speculation Other Norm. Region Exec. Time 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf
63
Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization Seven benchmarks speedup by 5% to 46% Compiler-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Region Exec. Time
64
Compiler- vs. Hardware-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware each benefits different benchmarks Norm. Region Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better
65
Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both The combination is more robust 0 100 go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Region Exec. Time Failed Speculation Synchronization Stall Other Busy
66
Conclusion There is performance improvement by compiler inserted memory synchronization The hardware synchronization and compiler synchronization target different memory instructions and together do a better job.
67
Some other Issues Register pressure due to scheduling scalar and memory instructions. Profiling time for obtaining frequency of data dependences. Heuristics : choice of thresholds.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.