Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Similar presentations


Presentation on theme: "1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia."— Presentation transcript:

1 1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University

2 2 Improving Value Communication…Steffan Carnegie Mellon Multithreaded Machines Are Everywhere  How can we use them? Parallelism! C P C SUN MAJC, IBM Power4, Sibyte SB-1250 ALPHA 21464, Intel Xeon Threads C C P C P

3 3 Improving Value Communication…Steffan Carnegie Mellon Automatic Parallelization Proving independence of threads is hard: –complex control flow –complex data structures –pointers, pointers, pointers –run-time inputs How can we make the compiler’s job feasible?  Thread-Level Speculation (TLS)

4 4 Improving Value Communication…Steffan Carnegie Mellon  Retry TLS E1 E2 E3 Load Thread-Level Speculation Epoch1 Epoch2 Epoch3  exploit available thread-level parallelism Load Store Time

5 5 Improving Value Communication…Steffan Carnegie Mellon Speculate  good when p != q Store *p Load *q E1 E2 Memory

6 6 Improving Value Communication…Steffan Carnegie Mellon Synchronize (and forward)  good when p == q Store *p Load *q E1 E2 Memory Signal Wait (stall) Store *p Load *q E1 E2 Memory (Speculate)

7 7 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path  decreases execution time critical path stall execution time

8 8 Improving Value Communication…Steffan Carnegie Mellon Predict  good when p == q and *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Store *p Load *q E1 E2 Memory Signal Wait (stall) (Synchronize) Store *p Load *q E1 E2 Memory (Speculate)

9 9 Improving Value Communication…Steffan Carnegie Mellon Improving on Compile-Time Decisions Predict Speculate Synchronize Compiler Speculate Synchronize Hardware reduce critical forwarding path reduce critical forwarding path  is there any potential benefit?

10 10 Improving Value Communication…Steffan Carnegie Mellon Potential for Improving Value Communication  efficient value communication is key U=Un-optimized, P=Perfect Prediction (4 Processors)

11 11 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation  Our Support for Thread-Level Speculation –Compiler Support –Experimental Framework –Baseline Performance Techniques for Improving Value CommunicationTechniques for Improving Value Communication Combining the TechniquesCombining the Techniques ConclusionsConclusions

12 12 Improving Value Communication…Steffan Carnegie Mellon Compiler Support (SUIF1.3 and gcc) 1) Where to speculate –use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS –insert new TLS-specific instructions –synchronizes/forwards register values 3) Optimization –eliminate dependences due to loop induction variables –algorithm to schedule the critical forwarding path  compiler plays a crucial role

13 13 Improving Value Communication…Steffan Carnegie Mellon Experimental Framework Benchmarks –from SPECint95 and SPECint2000, -O3 optimization Underlying architecture –4-processor, single-chip multiprocessor –speculation supported by coherence Simulator –superscalar, similar to MIPS R10K –models all bandwidth and contention  detailed simulation! C C P C P Crossbar

14 14 Improving Value Communication…Steffan Carnegie Mellon Compiler Performance  compiler optimization is effective S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized

15 15 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication  Techniques for Improving Value Communication –When Prediction is Best Memory Value PredictionMemory Value Prediction Forwarded Value PredictionForwarded Value Prediction Silent StoresSilent Stores –When Synchronization is Best Combining the TechniquesCombining the Techniques ConclusionsConclusions

16 16 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Store *p Load *q E1 E2 Memory  avoid failed speculation if *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Prediction With Value 

17 17 Improving Value Communication…Steffan Carnegie Mellon Value Predictor Configuration Context Stride Confidence Aggressive hybrid predictor –1K x 3-entry context and 1K-entry stride –2-bit, up/down, saturating confidence counters  predict only when confident no prediction >? load PC predicted value

18 18 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Only predict exposed loads –hardware tracks which words are speculatively modified –use to determine whether a load is exposed  predict only exposed loads Store X E1 Load X E2 not exposed exposed

19 19 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction  exposed loads are fairly predictable

20 20 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction  effective if properly throttled B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads

21 21 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Store X Load X E1 E2 Signal Wait  avoid synchronization stall if X is predictable Store X Load X E1 E2 Value Predictor Prediction With Value (stall)

22 22 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction  forwarded values are also fairly predictable

23 23 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction  only predict loads that have caused stalls B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s

24 24 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Store X=5 Load X E1 E2 Memory (X=5)  (Store X=5) Load X E1 E2 Memory (X=5) Exploiting Silent Stores  avoid failed speculation if store is silent Load X==5?

25 25 Improving Value Communication…Steffan Carnegie Mellon Silent Stores  silent stores are prevalent

26 26 Improving Value Communication…Steffan Carnegie Mellon Impact of Exploiting Silent Stores  most of the benefits of memory value prediction B=Baseline, SS=Exploit Silent Stores

27 27 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication  Techniques for Improving Value Communication When Prediction is Best When Prediction is Best –When Synchronization is Best Hardware-Inserted Dynamic SynchronizationHardware-Inserted Dynamic Synchronization Reducing the Critical Forwarding PathReducing the Critical Forwarding Path Combining the TechniquesCombining the Techniques ConclusionsConclusions

28 28 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization Store *p Load *q E1 E2 Memory  avoid failed speculation With Dynamic Sync. Store *p Load *q E2 E1 (stall) Memory 

29 29 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization  overall average improvement of 9% B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum

30 30 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path  decreases execution time critical path stall execution time

31 31 Improving Value Communication…Steffan Carnegie Mellon Prioritizing the Critical Forwarding Path Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path mark the input-chain of the critical storemark the input-chain of the critical store give marked instructions high issue prioritygive marked instructions high issue priority Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path Prioritization With

32 32 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization  some reordering

33 33 Improving Value Communication…Steffan Carnegie Mellon Impact of Prioritizing the Critical Path  not much benefit, given the complexity B=Baseline, S=Prioritizing Critical Path

34 34 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication Techniques for Improving Value Communication Combining the Techniques  Combining the Techniques ConclusionsConclusions

35 35 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. –only synchronize memory values that are unpredictable –dynamic sync. logic checks prediction confidence –synchronize if not confident

36 36 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques  close to ideal for m88ksim and vpr B=Baseline, A=All But Dyn. Sync., D=All, P=Perfect Prediction

37 37 Improving Value Communication…Steffan Carnegie Mellon Conclusions Prediction –memory value prediction: effective when throttled –forwarded value prediction: effective when throttled –silent stores: prevalent and effective Synchronization –dynamic synchronization: can help or hurt –hardware prioritization: ineffective, if compiler is good  prediction is effective    synchronization has mixed results

38 38 Improving Value Communication…Steffan Carnegie Mellon BACKUPS

39 39 Improving Value Communication…Steffan Carnegie Mellon Goals 1) Parallelize general-purpose programs –difficult problem 2) Keep hardware support simple and minimal –avoid large, specialized structures –preserve the performance of non-TLS workloads 3) Take full advantage of the compiler –region selection, synchronization, optimization

40 40 Improving Value Communication…Steffan Carnegie Mellon Potential for Further Improvement  point

41 41 Improving Value Communication…Steffan Carnegie Mellon Pipeline Parameters Issue Width 4 Functional Units 2Int, 2FP, 1Mem, 1Bra Reorder Buffer Size 128 Integer Multiply 12 cycles Integer Divide 76 cycles All Other Integer 1 cycle FP Divide 15 cycles FP Square Root 20 cycles All Other FP 2 cycles Branch Prediction GShare (16KB, 8 history bits)

42 42 Improving Value Communication…Steffan Carnegie Mellon Memory Parameters Cache Line Size 32B Instruction Cache 32KB, 4-way set-assoc Data Cache 32KB, 2-way set-assoc, 2 banks Unified Secondary Cache 2MB, 4-way set-assoc, 4 banks Miss Handlers 16 for data, 2 for insts Crossbar Interconnect 8B per cycle per bank Minimum Miss Latency to Secondary Cache 10 cycles Minimum Miss Latency to Local Memory 75 cycles Main Memory Bandwidth 1 access per 20 cycles

43 43 Improving Value Communication…Steffan Carnegie Mellon When Prediction is Best Predicting under TLS –only update predictor for successful epochs –cost of misprediction is high: must re-execute epoch –each epoch requires a logically-separate predictor Differentiation from previous work: –loop induction variables optimized by compiler –larger regions of code, hence larger number of memory dependences between epochs

44 44 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint2000 ApplicationName Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance BZIP298.1%1251.5451596.0 CRAFTY36.1%3430.81315.7 GZIP70.4%11307.02064.8 MCF61.0%9206.2198.9 PARSER36.4%41271.119.4 PERLBMK10.3%1065.12.4 VORTEX2K12.7%61994.33.4 VPR80.1%690.26.3

45 45 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint95 Application Name Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance COMPRESS 75.5%7188.268.4 GO 31.3%402252.756.2 IJPEG 90.6%231499.833.8 LI 17.0%3176.4124.9 M88KSIM 56.5%6840.450.2 PERL 43.9%4137.32.2

46 46 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Application Name Avg. Exposed Loads per Epoch IncorrectCorrect Not Confident COMPRESS12.00.3%31.8%67.9% CRAFTY4.53.0%48.6%48.3% GO7.82.5%41.2%56.2% GZIP66.61.4%52.8%45.7% M88KSIM7.51.2%90.9%7.7% MCF2.51.7%34.9%63.3% PARSER3.63.2%48.7%48.0% VORTEX2K25.42.8%64.9%32.2% VPR6.33.6%49.8%46.4%  exposed loads are quite predictable

47 47 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Further cache tag Load PC Exposed Load Table On an exposed load:  only predict violating loads Load PC On a dependence violation: Load PC Exposed Load Table cache tag Violating Loads List

48 48 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Application Name IncorrectCorrect Not Confident COMPRESS3.7%31.2%65.1% CRAFTY5.5%24.6%69.7% GO3.7%28.3%67.9% GZIP0.2%98.0%1.6% M88KSIM5.4%91.0%3.4% MCF2.5%48.5%48.9% PARSER2.8%11.6%85.5% VORTEX2K2.2%81.9%15.7% VPR2.8%26.4%70.7%  synchronized loads are also predictable

49 49 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Application Name Dynamic, Non-Stack, Silent Stores COMPRESS80% CRAFTY16% GO16% GZIP4% M88KSIM57% MCF19% PARSER12% VORTEX2K84% VPR26%  silent stores are prevalent

50 50 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization Application Name Issued Insts That Are High Priority and Issued Early COMPRESS7.1% CRAFTY6.8% GO12.9% GZIP3.6% M88KSIM9.1% MCF9.9% PARSER9.7% VORTEX2K3.6% VPR4.7%  significant reordering


Download ppt "1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia."

Similar presentations


Ads by Google