Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Organization and Architecture
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Multiscalar processors
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Computer Organization and Architecture
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
CS203 – Advanced Computer Architecture ILP and Speculation.
COMP 740: Computer Architecture and Implementation
Instruction Level Parallelism
Multiscalar Processors
Simultaneous Multithreading
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Levels of Parallelism within a Single Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Lecture 8: Dynamic ILP Topics: out-of-order processors
Advanced Computer Architecture
Translation Buffers (TLB’s)
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Translation Buffers (TLB’s)
Dynamic Hardware Prediction
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Translation Buffers (TLBs)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Review What are the advantages/disadvantages of pages versus segments?
Presentation transcript:

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture News, 2003

Agenda  Overview (IMT, state-of-art) ‏  IMT enhancements  Key results  Critique  Relation to Term Project

Implicitly Multithreaded Processor (IMT) ‏  SMT with speculation  Optimizations to basic SMT support  Average perf. improvement of 24% Max: 69%

State-of-the-art  Pentium 4 HT  IBM POWER5  MIPS MT

Speculative SMT operation  When branch encountered, start executing likely path “speculatively” i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)  Overcome cost, overhead with savings in execution time and power (but worth the effort) ‏  Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.  If dependence violation, squash thread and restart execution

How to buffer speculative data?  Load/Store Queue (LSQ) ‏  Buffers data (along with its address) ‏  Helps enforce dependency check  Makes rollback possible  Cache-based approaches

IMT : Most significant improvements  Assistance from Multiscalar compiler  Resource- and dependence-aware fetch policy  Multiplexing threads on a single hardware context  Overlapping thread startup operations with previous threads execution

What does Compiler do?  Extracts threads from program (loops) ‏  Generates thread descriptor data about registers read and written and control flow exits (for rename tables)  Annotates instructions with special codes (“forward” & “release”) for dependence checking

Fetch Policy  Hardware keeps track of resource utilization  Resource requirement prediction from past four execution instances  When dependencies exist (detected from compiler- generated data), bias towards non-speculative threads  Goal is to reduce number of thread squashes

Multiplexing threads on a single hardware context  Observations:  Threads usually short  Number of contexts less (2-8) ‏ Hence frequent switching, less overlap

Multiplexing (contd.) ‏  Larger threads can lead to:  Speculation buffer overflow  Increased dependence mis-speculation  Hence thread squashing  Each execution context can further support multiple threads (3-6) ‏

Multiplexing: Required Hardware  Per context per thread:  Program Counter  Register rename table  LSQ shared among threads running on 1 execution context

Multiplexing: Implementation Issues  LSQ shared but it needs to maintain loads and stores for each thread separately  Therefore, create “gaps” for yet-to-be-fetched instructions / data  If space falls short, squash subsequent thread  What if threads from one program are mapped to different contexts?  IMT searches through other contexts  Easier to have multiple LSQs per context per thread but not good cost and power consumption

Register renaming  Required because multiple threads may use same registers  Separate rename tables  Master Rename Table (global) ‏  Local Rename Table (per thread) ‏  Pre-assign table (per thread) ‏

Register renaming: Flow  Thread Invocation:  Copy from Master table into Local table (to reflect current status) ‏  Also use “create” and “use” mask of thread descriptor (to for dependence check) ‏  Before every subsequent thread invocation:  Pre-assign rename maps into Pre-assign table  Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.

Hiding thread startup delay  Rename tables to be setup before execution begins  Occupies table bandwidth, hence cannot be done for a number of threads in parallel  Hence overlap setting up of rename tables with previous thread’s execution

Load/Store Queue  Per context  Speculative load / store: Search through current and other contexts for dependence  No searching for non-speculative loads  Searching can take time, so schedules load-dependent instructions accordingly

Key Results

 Average improvement: 24%  Reduction in data dependence stalls  Little overhead of optimizations  Not all benchmark programs

Assuming 2-3 threads per context, 6-8 LSQ entries per thread. Performance relative to IMT with unlimited resources

ICOUNT: Favor least number of instructions remaining to be executed Biased-ICOUNT: Favor non-speculative threads Worst-case resource estimation Reduced thread squashing

TME: Executes both paths of an unpredictable branch (but such branches uncommon)‏ DMT: –Hardware-selection of threads. So spawns threads on backward- branch or function call instead of loops. –Also spawns threads out of order. So lower accuracy of branch prediction.

Critique

Compiler Support  Improvement in applications compiled using Multiscalar compiler  Scientific computing applications, not for desktop applications

LSQ Limitations  LSQ size deciding the size of speculative thread  Pentium 4 (without SMT): 48 Loads, 24 Stores  Pentium 4 HT: 24 Loads, 12 Stores per thread  IBM Power5: 32 Loads, 32 Stores per thread

LSQ Limitations: Alternative  Cache-based approach i.e. Partition the cache to support different versions  Extra support required, but scalable

Register file size  IMT considers register file sizes of 128 and up.  Pentium 4 (as well as HT): Register file size = 128  IBM POWER5: Register file size = 80

Searching LSQ  Since loads and stores organized as per thread, search involves all locations of other threads.  If loads/stores organized according to addresses then lesser values to search.  Can make use of associativity of cache

Searching LSQ (contd.) ‏

So how is performance still high?  Assistance from Compiler  Resource and dependency-aware fetching  Multiple threads on an execution context  Overlapping rename table creation with execution

Term project  “Cache-based throughput improvement techniques for Speculative SMT processors”  Optimizations from IMT  Increasing granularity to reduce number of thread squashes

Thank you