Improving Value Communication for Thread-Level Speculation

Improving Value Communication for Thread-Level Speculation
Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University My name is Greg Steffan, and this is work done by… I’ll begin by motivating this work.

Multithreaded Machines Are Everywhere
Threads C P P Multithreaded machines are becoming more and more common. Many current designs are single-chip multiprocessors and simultaneously-multithreaded processors But how can we use these multithreaded machines to improve performance--- in particular, to improve the performance of a single application? The answer is parallelism.. C C ALPHA 21464, Intel Xeon SUN MAJC, IBM Power4, Sibyte SB-1250 How can we use them? Parallelism!

Automatic Parallelization
Proving independence of threads is hard: complex control flow complex data structures pointers, pointers, pointers run-time inputs How can we make the compiler’s job feasible? Ideally, we’d like the compiler to parallelize every application that we care about for us. Traditionally, compilers have parallelized by proving that potential threads are independent--- but this is extremely difficult if not impossible for many general purpose programs. One promising technique for overcoming this problem is Thread-Level Speculation which allows the compiler to create parallel threads if it is *likely* that they are independent. Let’s look at a very quick example. Thread-Level Speculation (TLS)

Thread-Level Speculation
TLS  E1 E2 E3 Epoch1 Load Store  Retry  Load Time Epoch2 we begin with a sequential program that we carve into chunks of work that we call epochs under TLS we execute these epochs speculatively in parallel, allowing us to extract available thread-level parallelism and if speculation fails, we re-execute the appropriate epoch what causes speculation to fail? a true data dependence between epochs was not preserved; in other words a dependent load and store executed out-of-order. when epoch3 is re-executed, the proper value is communicated to the load there are in fact several possible forms of value communication when speculating in parallel lets take a closer look at each of them Epoch3 exploit available thread-level parallelism

good when p != q Speculate E1 E2 Load *q Store *p Memory
for any load-store pair that rarely points to the same memory location, the most efficient option is to speculate. this way, we can execute them speculatively in parallel, in whatever order they happen to be issued. however, this method is inefficient when they frequently point to the same memory location, since this will cause speculation to fail, and invoke the cost of recovery good when p != q

Synchronize (and forward)
Store *p Load *q E1 E2 Memory (Speculate) Store *p Load *q E1 E2 Memory Signal Wait (stall) in such cases where a value is frequently communicated between epochs, it is best to synchronize and forward and hence avoiding failed speculation and recovery. We accomplish this by forcing the load to wait until the corresponding store has issued, using wait/signal synchronization. however, synchronization creates what we call a critical forwarding path good when p == q

Reduce the Critical Forwarding Path
Wait Load X Store X Signal Overview Big Critical Path Small Critical Path decreases execution time critical path stall execution time within an epoch with a forwarded value, this is the code between the synchronized load and the corresponding store and signal. and the value flows through the epochs like so, If we can reduce the size of the critical forwarding path by moving code out of it, we can decrease execution time But we can do better still if the value is predictable--- rather than wait for synchronization, we can instead predict the value.

good when p == q and Predict *q is predictable (Speculate) Store *p
Load *q E1 E2 Memory (Speculate) Store *p Load *q E1 E2 Memory Value Predictor and eliminate synchronization stall time. with successful prediction, we can make the epochs completely independent however, we don’t want to predict all of the time,since misprediction is expensive Store *p Load *q E1 E2 Memory Signal Wait (stall) (Synchronize) good when p == q and *q is predictable

Improving on Compile-Time Decisions
Compiler Hardware Speculate Speculate Predict Lets try to get a more clear picture of how these techniques work together. For each pair of accesses, the compiler decides whether to speculate or synchronize. and this requires the hardware to support speculation and synchronization However, the hardware can exploit run-time information to improve on these compile-time decisions for instance, it can convert speculation into synchronization, for cases where there are frequent dependences hardware can also predict, and hence convert speculation and synchronization into prediction whenever appropriate finally, both the compiler and hardware can reduce the critical forwarding path to improve performance but before we continue, we want to know if it will be worth the effort Synchronize Synchronize reduce critical forwarding path is there any potential benefit?

Potential for Improving Value Communication
So here we have the results of an ideal limit study, using a detailed, 4-processor single-chip multiprocessor that supports TLS, we plot execution time, relative to that of the sequential case, so bars bigger than 100 are slowing down, and bars less than 100 are speeding up. for each benchmark, the first bar shows the performance when neither the compiler nor the hardware have optimized value communication in any way. and the second bar shows the performance when we perfectly predict the value of any inter-thread dependence we observe that efficient value communication makes the difference between slowing down and speeding up in almost every case, and hence is a good target for compiler and hardware optimization U=Un-optimized, P=Perfect Prediction (4 Processors) efficient value communication is key

Our Support for Thread-Level Speculation
Outline Our Support for Thread-Level Speculation Compiler Support Experimental Framework Baseline Performance Techniques for Improving Value Communication Combining the Techniques Conclusions heres an outline of the rest of this talk first I will describe our support for TLS, including our compiler support, experimental framework, and baseline performance next we’ll investigate the various techniques for improving value communication then we’ll look at how they all jive together and then conclude

Compiler Support (SUIF1.3 and gcc)
1) Where to speculate use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS insert new TLS-specific instructions synchronizes/forwards register values 3) Optimization eliminate dependences due to loop induction variables algorithm to schedule the critical forwarding path Our compiler is composed of several SUIF passes with gcc as a back-end. For this work, our compiler only targets loops, although TLS is applicable to other structures. The first pass uses profile information and heuristics to decide where to speculate, and whether to unroll loops next, the compiler inserts new instructions which manage speculative threads and synchronize and forward register values finally, optimization passes remove dependences due to loop induction variables, and schedules code to reduce the critical forwarding paths so our compiler does a lot compiler plays a crucial role

Experimental Framework
Benchmarks from SPECint95 and SPECint2000, -O3 optimization Underlying architecture 4-processor, single-chip multiprocessor speculation supported by coherence Simulator superscalar, similar to MIPS R10K models all bandwidth and contention C P Crossbar here is our experimental framework We simulate benchmarks from the spec95 and 2000, compiled with O3 optimization I’ll be showing a subset of the results available in the paper. We assume a 4-processor single-chip multiprocessor where each processor has its own primary data cache that is connected to a unified, shared second level cache by a crossbar. We simulate the hardware support for TLS described in our previous papers--- in a nutshell, an extended version of invalidation-based cache coherence tracks data dependences, and the data caches are used to buffer speculative state. Our processor model is configured similarly to a MIPS R10K, but with a 128-entry reorder buffer. We model all bandwidth, latency, and contention in the system. detailed simulation!

compiler optimization is effective
Compiler Performance this last bar shows performance after our compiler has removed dependences due to loop induction variables and used scheduling to reduce stall time for synchronized values note the significant reduction in the amount of white synchronization time, like in go for example. we see that several benchmarks still slowdown, go and mcf now break even, and crafty, m88ksi and vpr are speeding up. this last bar will be our baseline for the rest of the talk we’ve seen that compiler optimization is very important for decent performance. but can hardware make things even better? S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized compiler optimization is effective

 Techniques for Improving Value Communication
Outline  Our Support for Thread-Level Speculation  Techniques for Improving Value Communication When Prediction is Best Memory Value Prediction Forwarded Value Prediction Silent Stores When Synchronization is Best Combining the Techniques Conclusions next we will evaluate the two classes of techniques for improving value communication when prediction is the right thing to do, and when synchronization is best. we start by looking at the prediction of memory values, then of forwarded values, and finally the impact of exploiting silent stores

Memory Value Prediction
avoid failed speculation if *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Prediction With Value  E1 E2 Load *q  Store *p  when a load depends on a previous store, speculation will fail we can avoid failed speculation by instead predicting the load, if it is predictable Memory

Value Predictor Configuration
no prediction Context predicted value Stride load PC Confidence >? we evaluate such prediction using an aggressive hybrid predictor composed of context and stride predictors, selected by 2bit confidence counters. and we only predict when confidence is high Confidence Aggressive hybrid predictor 1K x 3-entry context and 1K-entry stride 2-bit, up/down, saturating confidence counters predict only when confident

Throttling Prediction
Store X Load X Load X predicting all loads would be an unnecessary risk, especially with the high cost of misprediction instead, why not focus solely on those loads that might possibly cause speculation to fail as shown, a load is exposed if it is not preceeded by a store to the same address in the same epoch this information is readily available since our hardware already tracks with words in each cache line have been speculatively modified. so we will only predict loads that are exposed not exposed exposed Only predict exposed loads hardware tracks which words are speculatively modified use to determine whether a load is exposed predict only exposed loads

Here we see the performance of our predictor on all exposed loads where incorrect predictions are red, correct are green, and blue means no prediction was made we see that there is a significant fraction of correct predictions, and that misprediction is quite rare so we expect memory value prediction to be effective exposed loads are fairly predictable

and we see in the third bar that performance is much better. for most benchmarks we have either preserved the performance of the baseline, or improved it such as crafty by 3% and m88ksim by 45% so memory value prediction can be effective if properly throttled B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads effective if properly throttled

Forwarded Value Prediction
avoid synchronization stall if X is predictable Store X Load X E1 E2 Value Predictor Prediction With Value  E2 Store X Wait Signal (stall)  Load X similarly to avoiding dependence violations, when the compiler has decided to synchronize and forward a value between epochs we can eliminate the stall time if the value is predictable rather than waiting, we just go ahead and predict the value 

and forwarded values are also fairly predictable note also that misprediction rates are still low forwarded values are also fairly predictable

as we see in the third bar, this fixes things for crafty but not for vpr, but maintains the improvements for gzip, m88ksim and mcf B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s only predict loads that have caused stalls

avoid failed speculation if store is silent
Silent Stores E1 (Store X=5) Load X E1 E2 Memory (X=5)  Exploiting Silent Stores avoid failed speculation if store is silent Load X==5? E2 Load X  Store X=5  we can also avoid failed speculation by exploiting a recently-discovered phenomenon called silent stores a store is silent if the value it is storing is the same as that already in memory, and hence it has no effect in this case, we predict that a store will be silent, or not modify memory, and then squash it and replace it with a load that verifies that the store was in fact silent. this way, any dependent speculative load will be successful Memory (X=5)

silent stores are prevalent
and this graph shows the fraction of non-stack stores in the regions of code that we speculatively parallelize that are silent (green) and we see that they are prevalent silent stores are prevalent

Impact of Exploiting Silent Stores
and hence squashing silent stores leads to decent improvement for most benchmarks in fact, we have achieved most of the benefits of memory value prediction without all the hassle B=Baseline, SS=Exploit Silent Stores most of the benefits of memory value prediction

 Techniques for Improving Value Communication
Outline  Our Support for Thread-Level Speculation  Techniques for Improving Value Communication  When Prediction is Best When Synchronization is Best Hardware-Inserted Dynamic Synchronization Reducing the Critical Forwarding Path Combining the Techniques Conclusions next we’ll look at techniques for when synchronization is best including hardware-inserted dynamic synchronization and reducing the critical forwarding path through instruction prioritization

Hardware-Inserted Dynamic Synchronization
With Dynamic Sync. Store *p Load *q E2  E1 (stall) Memory E1 E2 Load *q  Store *p  besides prediction, another option to avoid failed speculation is to insert synchronization this will work well if there is a case of a frequent dependence that the compiler missed and did not synchronize this way, a load dependent on a store from a previous epoch can be stalled until the store has executed and hence avoid failed speculation Memory avoid failed speculation

Hardware-Inserted Dynamic Synchronization
and then additionally requiring the load to have caused at least 4 violations since the last reset before synchronizing. none of these throttling attempts remedy the situation for vpr, although overall the technique is successful, giving an average improvement of 9% and allowing vortex to speed up for the first time B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum overall average improvement of 9%

Reduce the Critical Forwarding Path
Wait Load X Store X Signal Overview Big Critical Path Small Critical Path decreases execution time critical path stall execution time as a reminder, our next technique improves performance when synchronization is the right thing to do, by reducing the critical forwarding path and hence increasing the amount of parallel overlap thereby decreasing execution time

Prioritizing the Critical Forwarding Path
 Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path Prioritization With  Load r1=X op r2=r1,r3 op r5=r6,r7 critical path op r6=r5,r8 Store r2,X this figure shows the critical forwarding path as it was generated by the compiler we mark the input chain of the critical store, and give those instructions high issue priority thereby shrinking the critical forwarding path between the synchronized load and store Signal mark the input-chain of the critical store give marked instructions high issue priority

Critical Path Prioritization
does prioritization have any effect? the green portion indicates the fraction of all issued instructions that were marked as on the critical forwarding path and were actually issued earlier than they would have without prioritization. we find that some reordering actually happens some reordering

Impact of Prioritizing the Critical Path
however, the performance impact is minimal, with only mcf and parser showing slight improvements part of the problem is that most benchmarks are more limited by failed speculation (red) than by synchronization (white) hence the compiler has done an adequate job of scheduling the critical forwarding paths and so we do not recommend this hardware technique since it’s benefit is not worth the complexity B=Baseline, S=Prioritizing Critical Path not much benefit, given the complexity

 Combining the Techniques
Outline  Our Support for Thread-Level Speculation  Techniques for Improving Value Communication  Combining the Techniques Conclusions finally, lets look at how these techniques perform when combined

Combining the Techniques
Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. only synchronize memory values that are unpredictable dynamic sync. logic checks prediction confidence synchronize if not confident all of the techniques are orthogonal except for memory value prediction and dynamic synchronization we want to synchronize only memory values that are unpredictable we can accomplish this cooperative behaviour by having the synchronization logic check the prediction confidence for a given load and only synchronize when confidence is low.

Combining the Techniques
and the fourth bar is the perfect-prediction result for comparison we achieve very close to the ideal for m88ksim, and have improved crafty and mcf significantly. we have also improved performance for compress and parser, although they still do not yet speed up. we also observe that failed speculation is still significant, indicating directions for future research B=Baseline, A=All But Dyn. Sync., D=All, P=Perfect Prediction close to ideal for m88ksim and vpr

prediction is effective  synchronization has mixed results
Conclusions Prediction memory value prediction: effective when throttled forwarded value prediction: effective when throttled silent stores: prevalent and effective Synchronization dynamic synchronization: can help or hurt hardware prioritization: ineffective, if compiler is good      prediction is effective  synchronization has mixed results

BACKUPS

Goals 1) Parallelize general-purpose programs
difficult problem 2) Keep hardware support simple and minimal avoid large, specialized structures preserve the performance of non-TLS workloads 3) Take full advantage of the compiler region selection, synchronization, optimization This work is differentiated by the following goals First, to parallelize general-purpose programs while our support works for numeric applications, we concentrate on the more difficult problem of parallelizing integer codes Second, we want to preserve the performance of non-speculative programs. We don’t want to degrade the performance of the memory system, nor have an architecture that is overly-specialized for speculative execution. Finally, we want to fully exploit the compiler, which for our approach performs a number of important tasks

Potential for Further Improvement
point

Pipeline Parameters Issue Width 4 Functional Units
2Int, 2FP, 1Mem, 1Bra Reorder Buffer Size 128 Integer Multiply 12 cycles Integer Divide 76 cycles All Other Integer 1 cycle FP Divide 15 cycles FP Square Root 20 cycles All Other FP 2 cycles Branch Prediction GShare (16KB, 8 history bits)

Memory Parameters Cache Line Size 32B Instruction Cache
32KB, 4-way set-assoc Data Cache 32KB, 2-way set-assoc, 2 banks Unified Secondary Cache 2MB, 4-way set-assoc, 4 banks Miss Handlers 16 for data, 2 for insts Crossbar Interconnect 8B per cycle per bank Minimum Miss Latency to Secondary Cache 10 cycles Minimum Miss Latency to Local Memory 75 cycles Main Memory Bandwidth 1 access per 20 cycles

When Prediction is Best
Predicting under TLS only update predictor for successful epochs cost of misprediction is high: must re-execute epoch each epoch requires a logically-separate predictor Differentiation from previous work: loop induction variables optimized by compiler larger regions of code, hence larger number of memory dependences between epochs while value prediction for uniprocessors is fairly well understood, value prediction under TLS is relatively new, and there are some subtle differences in its support. first, we only want to update the predictor for successful epochs---this is similar to uniprocessor value prediction in the midst of branch speculation but at a larger scale. we will need either large buffer predictor updates until commit time, or the ability to backup and restore the value predictor when speculation fails. second, the cost of misprediction is high, since we must reexecute the entire epoch for a misprediction, hence we must be selective about what we predict. finally, each epoch requires a logically-separate predictor, since multiple epochs may need to simultaneouslly predict different versions of the same location. this work is differentiated from previous work on value prediction for TLS because we have already removed the easy-to-predict loop induction variables, and because we parallelize larger regions of code

Benchmark Statistics: SPECint2000
Application Name Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance BZIP2 98.1% 1 251.5 CRAFTY 36.1% 34 30.8 1315.7 GZIP 70.4% 1307.0 2064.8 MCF 61.0% 9 206.2 198.9 PARSER 36.4% 41 271.1 19.4 PERLBMK 10.3% 10 65.1 2.4 VORTEX2K 12.7% 6 1994.3 3.4 VPR 80.1% 90.2 6.3

Benchmark Statistics: SPECint95
Application Name Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance COMPRESS 75.5% 7 188.2 68.4 GO 31.3% 40 2252.7 56.2 IJPEG 90.6% 23 1499.8 33.8 LI 17.0% 3 176.4 124.9 M88KSIM 56.5% 6 840.4 50.2 PERL 43.9% 4 137.3 2.2

Application Name Avg. Exposed Loads per Epoch Incorrect Correct Not Confident COMPRESS 12.0 0.3% 31.8% 67.9% CRAFTY 4.5 3.0% 48.6% 48.3% GO 7.8 2.5% 41.2% 56.2% GZIP 66.6 1.4% 52.8% 45.7% M88KSIM 7.5 1.2% 90.9% 7.7% MCF 2.5 1.7% 34.9% 63.3% PARSER 3.6 3.2% 48.7% 48.0% VORTEX2K 25.4 2.8% 64.9% 32.2% VPR 6.3 3.6% 49.8% 46.4% lets look at some statistics for predicting memory values glancing quickly at this table we see that for most benchmarks, there are 12 or less exposed loads per epoch and that they are quite predictable, and that our misprediction rates are quite low so we expect memory value prediction to be effective exposed loads are quite predictable

Throttling Prediction Further
On an exposed load: only predict violating loads Load PC On a dependence violation: Exposed Load Table cache tag Violating Loads List Exposed Load Table Load PC one possibility is to instead only predict loads that have recently caused dependence violations. we can gather this information with the use of two devices: first, whenever an exposed load occurs, we use the tag of the associated cache line to index an exposed load table, where we store the PC of the exposed load this table need only be a 16-entry, direct-mapped cache of the most recent exposed loads second, in our hardware support for TLS, whenever a dependence violation occurs there is an associated cache line we can use the cache tag to index the exposed load table and retrieve the PC of the last exposed load, and add this PC to a list of loads that have caused violations now we have the ability to be more selective, and to only predict loads that have caused violations cache tag

Application Name Incorrect Correct Not Confident COMPRESS 3.7% 31.2% 65.1% CRAFTY 5.5% 24.6% 69.7% GO 28.3% 67.9% GZIP 0.2% 98.0% 1.6% M88KSIM 5.4% 91.0% 3.4% MCF 2.5% 48.5% 48.9% PARSER 2.8% 11.6% 85.5% VORTEX2K 2.2% 81.9% 15.7% VPR 26.4% 70.7% and forwarded values are also quite predictable, ranging from 11 to 98% of all synchronized loads note also that misprediction rates are still low synchronized loads are also predictable

Dynamic, Non-Stack, Silent Stores silent stores are prevalent
Application Name Dynamic, Non-Stack, Silent Stores COMPRESS 80% CRAFTY 16% GO GZIP 4% M88KSIM 57% MCF 19% PARSER 12% VORTEX2K 84% VPR 26% we see here that the number silent stores are prevalent in the regions of code that we speculatively parallelize silent stores are prevalent

Critical Path Prioritization
Application Name Issued Insts That Are High Priority and Issued Early COMPRESS 7.1% CRAFTY 6.8% GO 12.9% GZIP 3.6% M88KSIM 9.1% MCF 9.9% PARSER 9.7% VORTEX2K VPR 4.7% this table shows the percent of all instructions issued in speculatively parallelized regions that were given high priority by our algorithm and actually issued earlier than they would have otherwise. we see that we have reordered a significant number of instructions significant reordering

Improving Value Communication for Thread-Level Speculation

Similar presentations

Presentation on theme: "Improving Value Communication for Thread-Level Speculation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Value Communication for Thread-Level Speculation

Similar presentations

Presentation on theme: "Improving Value Communication for Thread-Level Speculation"— Presentation transcript:

Similar presentations

About project

Feedback