Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Lecture 6: Multicore Systems
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
Chapter 17 Parallel Processing.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
How Multi-threading can increase on-chip parallelism
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Hyper-Threading Technology Architecture and Microarchitecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Processor Level Parallelism 1
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Electrical and Computer Engineering
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Limits to ILP Conflicting studies of amount
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Hardware Multithreading
The University of Adelaide, School of Computer Science
Presentation transcript:

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle Proceedings of ISCA ` 95, Italy Presented by : Amit Gaur

Overview Instruction Level Parallelism vs. Thread Level Parallelism Motivation Simulation Environment and Workload Simultaneous Multithreading Models Performance Analysis Extensions in Design Single Chip Multiprocessing Summary Current Implementations Retrospective

Instruction Level Parallelism Superscalar processors Shortcomings: a) Instruction Dependencies b) long latencies within single thread

Thread Level Parallelism Traditional Multithreaded Architecture Exploit parallelism at application level Multiple threads: Inherent Parallelism Attack Vertical Waste: memory and functional unit latencies E.g.: Server applications, online transaction processing, web services

Need for Simultaneous Multithreading Attack vertical as well as horizontal waste Fetch instructions from multiple threads each cycle Exploit all parallelism: full utilization of execution resources Decrease in wasted issue slots Comparison with superscalar,fine-grain multithreaded processor, single-chip,multiple issue multiprocessors

Simulation Environment Emulation based instruction level simulation Model on Alpha AXP extended for wide superscalar execution and multithreaded execution Support for increased single stream parallelism,more flexible instruction issue, improved branch prediction, and larger higher bandwidth caches Code generated using Multiflow trace scheduling compiler(static scheduling)

Simulation Environment(Continued) 10 functional units(4 integer, 2 floating point, 3 Load/Store, 1 Branch) All units pipelined In-order issue of dependence free instructions with 8 instruction per thread window L1 and L2 cache are on-chip 2048 entry, 2 bit branch prediction history table maintained Support for upto 8 hardware contexts

Workload Specifications SPEC92 Benchmark suite simulated To obtain TLP, distinct program allocated to each thread :Parallel workload based on multiprogramming Executable generated with lowest single thread execution time used

Limitations of Superscalar Processors

Superscalar Performance Degradation Overlap in a number of delaying causes Completely eliminating any 1 cause will not result in performance increase 61% vertical waste and 39% horizontal waste Tackle both using simultaneous multithreading

Simultaneous Multithreading Models Fine Grain Multithreading: 1 thread issues instructions in each cycle SM:Full Simultaneous Issue: All eight threads compete for each issue slot, each cycle=> Maximum flexibility. SM:Single Issue, SM: Dual Issue, SM:Four Issue: limits the number of instructions each thread can issue, or have active in the scheduling window, each cycle. SM: Limited Connection: Each hardware context is connected to exactly one type of functional unit=> Least Dynamic of all Models.

Hardware Complexities of Models

Design Challenges in SMT processors Issue slot usage limited by imbalances in resource needs and resource availability Number of active threads, limitations on buffer sizes, instruction mix from multiple threads Hardware complexity: need to implement superscalar along with thread level parallelism Use of priority threads can result in throughput reduction as pipeline less likely to have instruction mix from different threads Mixing many threads also compromises performancce of individual threads. Tradeoff- small number of active threads, even smaller number of preferred threads

From Superscalar to SMT SMT is an out of order superscalar extended with hardware to support multiple threads Multiple Thread Support: a) per-thread program counters b) per-thread return stacks c) per-thread bookkeeping for instruction retirement,trap and instruction dispatch from prefetch queue d) thread identifiers eg. With BTB and TLB entries Should SMT processors speculate?? Determine role of instruction speculation in SMT.

Instruction Speculation Speculation executes ‘probable’ instructions to hide branch latencies Processor fetches on a hardware based prediction Correct prediction - Keep going Incorrect prediction - Rollback SMT has 2 ways to deal with branch delay stalls a) Speculation b) Fetch/Issue from other threads SMT and Speculation: Speculation can be wasteful on SMT as one thread’s speculative instructions can compete with & replace another’s non speculative instructions

Performance Evaluation of SMT

Performance Evaluation(Contd.) Fine Grain MT: Max Speedup is 2.1. No gain in vertical waste reduction after 4 threads SMT models: Speedup ranges from 3.5 to 4.2, with issue rate reaching 6.3 IPC 4 issue model gets nearly same performance as full issue, dual issue is at 94% of full issue at 8 threads As ratio of threads to issue slots increases performance of models increases. Tradeoff between number of hardware contexts and hardware complexity. Adverse effect of competition for sharing of resources -> lowest priority thread runs slowest More strain on caches due to reduced locality- increase in I and D cache misses Overall increase in instruction throughput

Extensions: Alternative cache Design for SMT Comparison of private per thread caches(L1) to shared caches for Instructions and Data. Shared caches optimize for small number of threads Shared d-cache outperforms private d-cache for all configurations. Private I-caches perform better at high number of threads.

Speculation in SMT

SMT vs. Single chip Multiprocessing Similarities: use of multiple register sets, multiple functional units, need for high issue bandwidth on single chip Differences: Multiprocessor uses static allocation of resources, SM processor allows resource allocation to change every cycle. Same configuration used for testing performance: a) 8KB private I-cache and D-cache b) 256 KB 4-way set assoc.. L2 cache c) 2 MB direct mapped L3 cache Attempt to bias the test in favor of MP

Test Results

Test Results(Contd.) Test A,B,C : high ratio of FU and threads to issue bandwidth- greater opportunity to utilize issue bandwidth. Test D repeats A but SMT Processor has 10 FU’s. It still outperforms Multiprocessor Test E & F- MP is allowed greater issue bandwidth even then SMT processor shows better performance Test G -both have 8 FU’s and 8 issues per cycle, however SMT processor has 8 contexts and Multiprocessor has 2 processor (2 register sets)- SMT processor has 2.5 greater performance

Summary Simultaneous Multithreading combines facilities of superscalar as well as multithreaded architectures It has the ability to boost utilization of resources by dynamically scheduling functional units among multiple threads Comparison of several models of SMT have been done with wide superscalar, fine-grain multithreaded, and single chip, multiple issue multiprocessing architectures The results of simulation show that: a) a simultaneous multithreaded architecture with proper configuration can achieve 4 times instruction throughput of a single-threaded wide superscalar with the same issue width b)simultaneous multithreading outperforms fine-grain multithreading by a factor of 2. c)simultaneous multiprocessor is superior in performance to a multiple issue multiprocessor, given same hardware resources

Commercial Machines MemoryLogix - SMT processor for mobile devices. Sun Microsystems has announced a 4-SMT- processor CMP. Hyper-Threading Technology (Intel® Xeon® Architecture) Clearwater Networks, a Los Gatos-based startup, was building an 8-context SMT network processor. Compaq Computer Corp. designed a 4-context SMT processor, Alpha (EV-8)

In Retrospect The design of SMT architecture was influenced by previous projects like the Tera, MIT Alewife and M- machine SMT was different from previous projects as it addressed a more complete and descriptive goal as compared to previous designs. The idea was to utilize thread level parallelism in place of lack of instruction level parallelism Aim was to target mainstream processor designs like the Alpha 21164