Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture News, 2003

Agenda  Overview (IMT, state-of-art) ‏  IMT enhancements  Key results  Critique  Relation to Term Project

Implicitly Multithreaded Processor (IMT) ‏  SMT with speculation  Optimizations to basic SMT support  Average perf. improvement of 24% Max: 69%

State-of-the-art  Pentium 4 HT  IBM POWER5  MIPS MT

Speculative SMT operation  When branch encountered, start executing likely path “speculatively” i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)  Overcome cost, overhead with savings in execution time and power (but worth the effort) ‏  Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.  If dependence violation, squash thread and restart execution

How to buffer speculative data?  Load/Store Queue (LSQ) ‏  Buffers data (along with its address) ‏  Helps enforce dependency check  Makes rollback possible  Cache-based approaches

IMT : Most significant improvements  Assistance from Multiscalar compiler  Resource- and dependence-aware fetch policy  Multiplexing threads on a single hardware context  Overlapping thread startup operations with previous threads execution

What does Compiler do?  Extracts threads from program (loops) ‏  Generates thread descriptor data about registers read and written and control flow exits (for rename tables)  Annotates instructions with special codes (“forward” & “release”) for dependence checking

Fetch Policy  Hardware keeps track of resource utilization  Resource requirement prediction from past four execution instances  When dependencies exist (detected from compiler- generated data), bias towards non-speculative threads  Goal is to reduce number of thread squashes

Multiplexing threads on a single hardware context  Observations:  Threads usually short  Number of contexts less (2-8) ‏ Hence frequent switching, less overlap

Multiplexing (contd.) ‏  Larger threads can lead to:  Speculation buffer overflow  Increased dependence mis-speculation  Hence thread squashing  Each execution context can further support multiple threads (3-6) ‏

Multiplexing: Required Hardware  Per context per thread:  Program Counter  Register rename table  LSQ shared among threads running on 1 execution context

Multiplexing: Implementation Issues  LSQ shared but it needs to maintain loads and stores for each thread separately  Therefore, create “gaps” for yet-to-be-fetched instructions / data  If space falls short, squash subsequent thread  What if threads from one program are mapped to different contexts?  IMT searches through other contexts  Easier to have multiple LSQs per context per thread but not good cost and power consumption

Register renaming  Required because multiple threads may use same registers  Separate rename tables  Master Rename Table (global) ‏  Local Rename Table (per thread) ‏  Pre-assign table (per thread) ‏

Register renaming: Flow  Thread Invocation:  Copy from Master table into Local table (to reflect current status) ‏  Also use “create” and “use” mask of thread descriptor (to for dependence check) ‏  Before every subsequent thread invocation:  Pre-assign rename maps into Pre-assign table  Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.

Hiding thread startup delay  Rename tables to be setup before execution begins  Occupies table bandwidth, hence cannot be done for a number of threads in parallel  Hence overlap setting up of rename tables with previous thread’s execution

Load/Store Queue  Per context  Speculative load / store: Search through current and other contexts for dependence  No searching for non-speculative loads  Searching can take time, so schedules load-dependent instructions accordingly

Key Results

 Average improvement: 24%  Reduction in data dependence stalls  Little overhead of optimizations  Not all benchmark programs

Assuming 2-3 threads per context, 6-8 LSQ entries per thread. Performance relative to IMT with unlimited resources

ICOUNT: Favor least number of instructions remaining to be executed Biased-ICOUNT: Favor non-speculative threads Worst-case resource estimation Reduced thread squashing

TME: Executes both paths of an unpredictable branch (but such branches uncommon)‏ DMT: –Hardware-selection of threads. So spawns threads on backward- branch or function call instead of loops. –Also spawns threads out of order. So lower accuracy of branch prediction.

Critique

Compiler Support  Improvement in applications compiled using Multiscalar compiler  Scientific computing applications, not for desktop applications

LSQ Limitations  LSQ size deciding the size of speculative thread  Pentium 4 (without SMT): 48 Loads, 24 Stores  Pentium 4 HT: 24 Loads, 12 Stores per thread  IBM Power5: 32 Loads, 32 Stores per thread

LSQ Limitations: Alternative  Cache-based approach i.e. Partition the cache to support different versions  Extra support required, but scalable

Register file size  IMT considers register file sizes of 128 and up.  Pentium 4 (as well as HT): Register file size = 128  IBM POWER5: Register file size = 80

Searching LSQ  Since loads and stores organized as per thread, search involves all locations of other threads.  If loads/stores organized according to addresses then lesser values to search.  Can make use of associativity of cache

Searching LSQ (contd.) ‏

So how is performance still high?  Assistance from Compiler  Resource and dependency-aware fetching  Multiple threads on an execution context  Overlapping rename table creation with execution

Term project  “Cache-based throughput improvement techniques for Speculative SMT processors”  Optimizations from IMT  Increasing granularity to reduce number of thread squashes

Thank you

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Similar presentations

Presentation on theme: "Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Similar presentations

Presentation on theme: "Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture."— Presentation transcript:

Similar presentations

About project

Feedback