1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

A scheme to overcome data hazards

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

CS 352H: Computer Systems Architecture

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

The University of Adelaide, School of Computer Science

‘99 ACM/IEEE International Symposium on Computer Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

Antonia Zhai, Christopher B. Colohan,

Exploring Value Prediction with the EVES predictor

Levels of Parallelism within a Single Processor

EE 382N Guest Lecture Wish Branches

Address-Value Delta (AVD) Prediction

Improving Value Communication for Thread-Level Speculation

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University

2 Improving Value Communication…Steffan Carnegie Mellon Multithreaded Machines Are Everywhere  How can we use them? Parallelism! C P C SUN MAJC, IBM Power4, Sibyte SB-1250 ALPHA 21464, Intel Xeon Threads C C P C P

3 Improving Value Communication…Steffan Carnegie Mellon Automatic Parallelization Proving independence of threads is hard: –complex control flow –complex data structures –pointers, pointers, pointers –run-time inputs How can we make the compiler’s job feasible?  Thread-Level Speculation (TLS)

4 Improving Value Communication…Steffan Carnegie Mellon  Retry TLS E1 E2 E3 Load Thread-Level Speculation Epoch1 Epoch2 Epoch3  exploit available thread-level parallelism Load Store Time

5 Improving Value Communication…Steffan Carnegie Mellon Speculate  good when p != q Store *p Load *q E1 E2 Memory

6 Improving Value Communication…Steffan Carnegie Mellon Synchronize (and forward)  good when p == q Store *p Load *q E1 E2 Memory Signal Wait (stall) Store *p Load *q E1 E2 Memory (Speculate)

7 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path  decreases execution time critical path stall execution time

8 Improving Value Communication…Steffan Carnegie Mellon Predict  good when p == q and *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Store *p Load *q E1 E2 Memory Signal Wait (stall) (Synchronize) Store *p Load *q E1 E2 Memory (Speculate)

9 Improving Value Communication…Steffan Carnegie Mellon Improving on Compile-Time Decisions Predict Speculate Synchronize Compiler Speculate Synchronize Hardware reduce critical forwarding path reduce critical forwarding path  is there any potential benefit?

10 Improving Value Communication…Steffan Carnegie Mellon Potential for Improving Value Communication  efficient value communication is key U=Un-optimized, P=Perfect Prediction (4 Processors)

11 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation  Our Support for Thread-Level Speculation –Compiler Support –Experimental Framework –Baseline Performance Techniques for Improving Value CommunicationTechniques for Improving Value Communication Combining the TechniquesCombining the Techniques ConclusionsConclusions

12 Improving Value Communication…Steffan Carnegie Mellon Compiler Support (SUIF1.3 and gcc) 1) Where to speculate –use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS –insert new TLS-specific instructions –synchronizes/forwards register values 3) Optimization –eliminate dependences due to loop induction variables –algorithm to schedule the critical forwarding path  compiler plays a crucial role

13 Improving Value Communication…Steffan Carnegie Mellon Experimental Framework Benchmarks –from SPECint95 and SPECint2000, -O3 optimization Underlying architecture –4-processor, single-chip multiprocessor –speculation supported by coherence Simulator –superscalar, similar to MIPS R10K –models all bandwidth and contention  detailed simulation! C C P C P Crossbar

14 Improving Value Communication…Steffan Carnegie Mellon Compiler Performance  compiler optimization is effective S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized

15 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication  Techniques for Improving Value Communication –When Prediction is Best Memory Value PredictionMemory Value Prediction Forwarded Value PredictionForwarded Value Prediction Silent StoresSilent Stores –When Synchronization is Best Combining the TechniquesCombining the Techniques ConclusionsConclusions

16 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Store *p Load *q E1 E2 Memory  avoid failed speculation if *q is predictable Store *p Load *q E1 E2 Memory Value Predictor Prediction With Value 

17 Improving Value Communication…Steffan Carnegie Mellon Value Predictor Configuration Context Stride Confidence Aggressive hybrid predictor –1K x 3-entry context and 1K-entry stride –2-bit, up/down, saturating confidence counters  predict only when confident no prediction >? load PC predicted value

18 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Only predict exposed loads –hardware tracks which words are speculatively modified –use to determine whether a load is exposed  predict only exposed loads Store X E1 Load X E2 not exposed exposed

19 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction  exposed loads are fairly predictable

20 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction  effective if properly throttled B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads

21 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Store X Load X E1 E2 Signal Wait  avoid synchronization stall if X is predictable Store X Load X E1 E2 Value Predictor Prediction With Value (stall)

22 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction  forwarded values are also fairly predictable

23 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction  only predict loads that have caused stalls B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s

24 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Store X=5 Load X E1 E2 Memory (X=5)  (Store X=5) Load X E1 E2 Memory (X=5) Exploiting Silent Stores  avoid failed speculation if store is silent Load X==5?

25 Improving Value Communication…Steffan Carnegie Mellon Silent Stores  silent stores are prevalent

26 Improving Value Communication…Steffan Carnegie Mellon Impact of Exploiting Silent Stores  most of the benefits of memory value prediction B=Baseline, SS=Exploit Silent Stores

27 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication  Techniques for Improving Value Communication When Prediction is Best When Prediction is Best –When Synchronization is Best Hardware-Inserted Dynamic SynchronizationHardware-Inserted Dynamic Synchronization Reducing the Critical Forwarding PathReducing the Critical Forwarding Path Combining the TechniquesCombining the Techniques ConclusionsConclusions

28 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization Store *p Load *q E1 E2 Memory  avoid failed speculation With Dynamic Sync. Store *p Load *q E2 E1 (stall) Memory 

29 Improving Value Communication…Steffan Carnegie Mellon Hardware-Inserted Dynamic Synchronization  overall average improvement of 9% B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum

30 Improving Value Communication…Steffan Carnegie Mellon Reduce the Critical Forwarding Path Wait Load X Store X Signal OverviewBig Critical PathSmall Critical Path  decreases execution time critical path stall execution time

31 Improving Value Communication…Steffan Carnegie Mellon Prioritizing the Critical Forwarding Path Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path mark the input-chain of the critical storemark the input-chain of the critical store give marked instructions high issue prioritygive marked instructions high issue priority Load r1=X Store r2,X Signal op r2=r1,r3 op r5=r6,r7 op r6=r5,r8 critical path Prioritization With

32 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization  some reordering

33 Improving Value Communication…Steffan Carnegie Mellon Impact of Prioritizing the Critical Path  not much benefit, given the complexity B=Baseline, S=Prioritizing Critical Path

34 Improving Value Communication…Steffan Carnegie Mellon Outline Our Support for Thread-Level Speculation Our Support for Thread-Level Speculation Techniques for Improving Value Communication Techniques for Improving Value Communication Combining the Techniques  Combining the Techniques ConclusionsConclusions

35 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. –only synchronize memory values that are unpredictable –dynamic sync. logic checks prediction confidence –synchronize if not confident

36 Improving Value Communication…Steffan Carnegie Mellon Combining the Techniques  close to ideal for m88ksim and vpr B=Baseline, A=All But Dyn. Sync., D=All, P=Perfect Prediction

37 Improving Value Communication…Steffan Carnegie Mellon Conclusions Prediction –memory value prediction: effective when throttled –forwarded value prediction: effective when throttled –silent stores: prevalent and effective Synchronization –dynamic synchronization: can help or hurt –hardware prioritization: ineffective, if compiler is good  prediction is effective    synchronization has mixed results

38 Improving Value Communication…Steffan Carnegie Mellon BACKUPS

39 Improving Value Communication…Steffan Carnegie Mellon Goals 1) Parallelize general-purpose programs –difficult problem 2) Keep hardware support simple and minimal –avoid large, specialized structures –preserve the performance of non-TLS workloads 3) Take full advantage of the compiler –region selection, synchronization, optimization

40 Improving Value Communication…Steffan Carnegie Mellon Potential for Further Improvement  point

41 Improving Value Communication…Steffan Carnegie Mellon Pipeline Parameters Issue Width 4 Functional Units 2Int, 2FP, 1Mem, 1Bra Reorder Buffer Size 128 Integer Multiply 12 cycles Integer Divide 76 cycles All Other Integer 1 cycle FP Divide 15 cycles FP Square Root 20 cycles All Other FP 2 cycles Branch Prediction GShare (16KB, 8 history bits)

42 Improving Value Communication…Steffan Carnegie Mellon Memory Parameters Cache Line Size 32B Instruction Cache 32KB, 4-way set-assoc Data Cache 32KB, 2-way set-assoc, 2 banks Unified Secondary Cache 2MB, 4-way set-assoc, 4 banks Miss Handlers 16 for data, 2 for insts Crossbar Interconnect 8B per cycle per bank Minimum Miss Latency to Secondary Cache 10 cycles Minimum Miss Latency to Local Memory 75 cycles Main Memory Bandwidth 1 access per 20 cycles

43 Improving Value Communication…Steffan Carnegie Mellon When Prediction is Best Predicting under TLS –only update predictor for successful epochs –cost of misprediction is high: must re-execute epoch –each epoch requires a logically-separate predictor Differentiation from previous work: –loop induction variables optimized by compiler –larger regions of code, hence larger number of memory dependences between epochs

44 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint2000 ApplicationName Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance BZIP298.1% CRAFTY36.1% GZIP70.4% MCF61.0% PARSER36.4% PERLBMK10.3% VORTEX2K12.7% VPR80.1%

45 Improving Value Communication…Steffan Carnegie Mellon Benchmark Statistics: SPECint95 Application Name Portion of Dynamic Execution Parallelized Number of Unique Parallelized Regions Average Epoch Size (dynamic insts) Average Number of Epochs Per Dynamic region Instance COMPRESS 75.5% GO 31.3% IJPEG 90.6% LI 17.0% M88KSIM 56.5% PERL 43.9%

46 Improving Value Communication…Steffan Carnegie Mellon Memory Value Prediction Application Name Avg. Exposed Loads per Epoch IncorrectCorrect Not Confident COMPRESS %31.8%67.9% CRAFTY4.53.0%48.6%48.3% GO7.82.5%41.2%56.2% GZIP %52.8%45.7% M88KSIM7.51.2%90.9%7.7% MCF2.51.7%34.9%63.3% PARSER3.63.2%48.7%48.0% VORTEX2K %64.9%32.2% VPR6.33.6%49.8%46.4%  exposed loads are quite predictable

47 Improving Value Communication…Steffan Carnegie Mellon Throttling Prediction Further cache tag Load PC Exposed Load Table On an exposed load:  only predict violating loads Load PC On a dependence violation: Load PC Exposed Load Table cache tag Violating Loads List

48 Improving Value Communication…Steffan Carnegie Mellon Forwarded Value Prediction Application Name IncorrectCorrect Not Confident COMPRESS3.7%31.2%65.1% CRAFTY5.5%24.6%69.7% GO3.7%28.3%67.9% GZIP0.2%98.0%1.6% M88KSIM5.4%91.0%3.4% MCF2.5%48.5%48.9% PARSER2.8%11.6%85.5% VORTEX2K2.2%81.9%15.7% VPR2.8%26.4%70.7%  synchronized loads are also predictable

49 Improving Value Communication…Steffan Carnegie Mellon Silent Stores Application Name Dynamic, Non-Stack, Silent Stores COMPRESS80% CRAFTY16% GO16% GZIP4% M88KSIM57% MCF19% PARSER12% VORTEX2K84% VPR26%  silent stores are prevalent

50 Improving Value Communication…Steffan Carnegie Mellon Critical Path Prioritization Application Name Issued Insts That Are High Priority and Issued Early COMPRESS7.1% CRAFTY6.8% GO12.9% GZIP3.6% M88KSIM9.1% MCF9.9% PARSER9.7% VORTEX2K3.6% VPR4.7%  significant reordering