Download presentation
Presentation is loading. Please wait.
Published byFlorence Armstrong Modified over 9 years ago
1
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University
2
2 Improving Value Communication…Steffan Carnegie Mellon Multithreaded Machines Are Everywhere How can we use them? Parallelism! C P C SUN MAJC, IBM Power4, Sibyte SB-1250 ALPHA 21464, Intel Xeon Threads C C P C P
3
3 Improving Value Communication…Steffan Carnegie Mellon Automatic Parallelization Proving independence of threads is hard: –complex control flow –complex data structures –pointers, pointers, pointers –run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)
4
4 Improving Value Communication…Steffan Carnegie Mellon Retry TLS E1 E2 E3 Load Thread-Level Speculation Epoch1 Epoch2 Epoch3 exploit available thread-level parallelism Load Store Time
5
5 Improving Value Communication…Steffan Carnegie Mellon Speculate good when p != q Store *p Load *q E1 E2 Memory
6
6 Improving Value Communication…Steffan Carnegie Mellon Synchronize (and forward) good when p == q Store *p Load *q E1 E2 Memory Signal Wait (stall) Store *p Load *q E1 E2 Memory (Speculate)
7
7 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path decreases execution time critical path stall execution time
8
8 Improving Value Communication…Steffan Carnegie Mellon Predict good when p == q and *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Store *p Load *q E1 E2 Memory Signal Wait (stall) (Synchronize) Store *p Load *q E1 E2 Memory (Speculate)
9
9 Improving Value Communication…Steffan Carnegie Mellon Improving on Compile-Time Decisions Predict Speculate Synchronize Compiler Speculate Synchronize Hardware reduce critical forwarding path reduce critical forwarding path is there any potential benefit?
10
10 Improving Value Communication…Steffan Carnegie Mellon Potential for Improving Value Communication efficient value communication is key U=Un-optimized, P=Perfect Prediction (4 Processors)
11
11 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation –Compiler Support –Experimental Framework –Baseline Performance Techniques for Improving Value CommunicationTechniques for Improving Value Communication Combining the TechniquesCombining the Techniques ConclusionsConclusions
12
12 Improving Value Communication…Steffan Carnegie Mellon Compiler Support (SUIF1.3 and gcc) 1) Where to speculate –use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS –insert new TLS-specific instructions –synchronizes/forwards register values 3) Optimization –eliminate dependences due to loop induction variables –algorithm to schedule the critical forwarding path compiler plays a crucial role
13
13 Improving Value Communication…Steffan Carnegie Mellon Experimental Framework Benchmarks –from SPECint95 and SPECint2000, -O3 optimization Underlying architecture –4-processor, single-chip multiprocessor –speculation supported by coherence Simulator –superscalar, similar to MIPS R10K –models all bandwidth and contention detailed simulation! C C P C P Crossbar
14
14 Improving Value Communication…Steffan Carnegie Mellon Compiler Performance compiler optimization is effective S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized
15
15 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication Techniques for Improving Value Communication –When Prediction is Best Memory Value PredictionMemory Value Prediction Forwarded Value PredictionForwarded Value Prediction Silent StoresSilent Stores –When Synchronization is Best Combining the TechniquesCombining the Techniques ConclusionsConclusions
16
16 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Store *p Load *q E1 E2 Memory avoid failed speculation if *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Prediction With Value
17
17 Improving Value Communication…Steffan Carnegie Mellon Value Predictor Configuration Context Stride Confidence Aggressive hybrid predictor –1K x 3-entry context and 1K-entry stride –2-bit, up/down, saturating confidence counters predict only when confident no prediction >? load PC predicted value
18
18 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Only predict exposed loads –hardware tracks which words are speculatively modified –use to determine whether a load is exposed predict only exposed loads Store X E1 Load X E2 not exposed exposed
19
19 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction exposed loads are fairly predictable
20
20 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction effective if properly throttled B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads
21
21 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Store X Load X E1 E2 Signal Wait avoid synchronization stall if X is predictable Store X Load X E1 E2 Value Predictor Prediction With Value (stall)
22
22 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction forwarded values are also fairly predictable
23
23 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction only predict loads that have caused stalls B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s
24
24 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Store X=5 Load X E1 E2 Memory (X=5) (Store X=5) Load X E1 E2 Memory (X=5) Exploiting Silent Stores avoid failed speculation if store is silent Load X==5?
25
25 Improving Value Communication…Steffan Carnegie Mellon Silent Stores silent stores are prevalent
26
26 Improving Value Communication…Steffan Carnegie Mellon Impact of Exploiting Silent Stores most of the benefits of memory value prediction B=Baseline, SS=Exploit Silent Stores
27
27 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication Techniques for Improving Value Communication When Prediction is Best When Prediction is Best –When Synchronization is Best Hardware-Inserted Dynamic SynchronizationHardware-Inserted Dynamic Synchronization Reducing the Critical Forwarding PathReducing the Critical Forwarding Path Combining the TechniquesCombining the Techniques ConclusionsConclusions
28
28 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization Store *p Load *q E1 E2 Memory avoid failed speculation With Dynamic Sync. Store *p Load *q E2 E1 (stall) Memory
29
29 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization overall average improvement of 9% B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum
30
30 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path decreases execution time critical path stall execution time
31
31 Improving Value Communication…Steffan Carnegie Mellon Prioritizing the Critical Forwarding Path Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path mark the input-chain of the critical storemark the input-chain of the critical store give marked instructions high issue prioritygive marked instructions high issue priority Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path Prioritization With
32
32 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization some reordering
33
33 Improving Value Communication…Steffan Carnegie Mellon Impact of Prioritizing the Critical Path not much benefit, given the complexity B=Baseline, S=Prioritizing Critical Path
34
34 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication Techniques for Improving Value Communication Combining the Techniques Combining the Techniques ConclusionsConclusions
35
35 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. –only synchronize memory values that are unpredictable –dynamic sync. logic checks prediction confidence –synchronize if not confident
36
36 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques close to ideal for m88ksim and vpr B=Baseline, A=All But Dyn. Sync., D=All, P=Perfect Prediction
37
37 Improving Value Communication…Steffan Carnegie Mellon Conclusions Prediction –memory value prediction: effective when throttled –forwarded value prediction: effective when throttled –silent stores: prevalent and effective Synchronization –dynamic synchronization: can help or hurt –hardware prioritization: ineffective, if compiler is good prediction is effective synchronization has mixed results
38
38 Improving Value Communication…Steffan Carnegie Mellon BACKUPS
39
39 Improving Value Communication…Steffan Carnegie Mellon Goals 1) Parallelize general-purpose programs –difficult problem 2) Keep hardware support simple and minimal –avoid large, specialized structures –preserve the performance of non-TLS workloads 3) Take full advantage of the compiler –region selection, synchronization, optimization
40
40 Improving Value Communication…Steffan Carnegie Mellon Potential for Further Improvement point
41
41 Improving Value Communication…Steffan Carnegie Mellon Pipeline Parameters Issue Width 4 Functional Units 2Int, 2FP, 1Mem, 1Bra Reorder Buffer Size 128 Integer Multiply 12 cycles Integer Divide 76 cycles All Other Integer 1 cycle FP Divide 15 cycles FP Square Root 20 cycles All Other FP 2 cycles Branch Prediction GShare (16KB, 8 history bits)
42
42 Improving Value Communication…Steffan Carnegie Mellon Memory Parameters Cache Line Size 32B Instruction Cache 32KB, 4-way set-assoc Data Cache 32KB, 2-way set-assoc, 2 banks Unified Secondary Cache 2MB, 4-way set-assoc, 4 banks Miss Handlers 16 for data, 2 for insts Crossbar Interconnect 8B per cycle per bank Minimum Miss Latency to Secondary Cache 10 cycles Minimum Miss Latency to Local Memory 75 cycles Main Memory Bandwidth 1 access per 20 cycles
43
43 Improving Value Communication…Steffan Carnegie Mellon When Prediction is Best Predicting under TLS –only update predictor for successful epochs –cost of misprediction is high: must re-execute epoch –each epoch requires a logically-separate predictor Differentiation from previous work: –loop induction variables optimized by compiler –larger regions of code, hence larger number of memory dependences between epochs
44
44 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint2000 ApplicationName Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance BZIP298.1%1251.5451596.0 CRAFTY36.1%3430.81315.7 GZIP70.4%11307.02064.8 MCF61.0%9206.2198.9 PARSER36.4%41271.119.4 PERLBMK10.3%1065.12.4 VORTEX2K12.7%61994.33.4 VPR80.1%690.26.3
45
45 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint95 Application Name Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance COMPRESS 75.5%7188.268.4 GO 31.3%402252.756.2 IJPEG 90.6%231499.833.8 LI 17.0%3176.4124.9 M88KSIM 56.5%6840.450.2 PERL 43.9%4137.32.2
46
46 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Application Name Avg. Exposed Loads per Epoch IncorrectCorrect Not Confident COMPRESS12.00.3%31.8%67.9% CRAFTY4.53.0%48.6%48.3% GO7.82.5%41.2%56.2% GZIP66.61.4%52.8%45.7% M88KSIM7.51.2%90.9%7.7% MCF2.51.7%34.9%63.3% PARSER3.63.2%48.7%48.0% VORTEX2K25.42.8%64.9%32.2% VPR6.33.6%49.8%46.4% exposed loads are quite predictable
47
47 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Further cache tag Load PC Exposed Load Table On an exposed load: only predict violating loads Load PC On a dependence violation: Load PC Exposed Load Table cache tag Violating Loads List
48
48 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Application Name IncorrectCorrect Not Confident COMPRESS3.7%31.2%65.1% CRAFTY5.5%24.6%69.7% GO3.7%28.3%67.9% GZIP0.2%98.0%1.6% M88KSIM5.4%91.0%3.4% MCF2.5%48.5%48.9% PARSER2.8%11.6%85.5% VORTEX2K2.2%81.9%15.7% VPR2.8%26.4%70.7% synchronized loads are also predictable
49
49 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Application Name Dynamic, Non-Stack, Silent Stores COMPRESS80% CRAFTY16% GO16% GZIP4% M88KSIM57% MCF19% PARSER12% VORTEX2K84% VPR26% silent stores are prevalent
50
50 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization Application Name Issued Insts That Are High Priority and Issued Early COMPRESS7.1% CRAFTY6.8% GO12.9% GZIP3.6% M88KSIM9.1% MCF9.9% PARSER9.7% VORTEX2K3.6% VPR4.7% significant reordering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.