Simultaneous Multithreading in Superscalar Processors

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Pipelining What is it? How does it work? What are the benefits? What could go wrong? By Derek Closson.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Hyper-Threading Technology Architecture and Microarchitecture

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

EKT303/4 Superscalar vs Super-pipelined.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Processor Level Parallelism 1

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

William Stallings Computer Organization and Architecture 8th Edition

Simultaneous Multithreading

Multi-core processors

Computer Structure Multi-Threading

Multiclustered and Multithreaded Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Bruhadeshwar Meltdown Bruhadeshwar

The fetch-execute cycle

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Parallel and Multiprocessor Architectures

Instruction Level Parallelism and Superscalar Processors

Superscalar Processors & VLIW Processors

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Instruction Level Parallelism and Superscalar Processors

Chapter 2: Data Manipulation

How to improve (decrease) CPI

Fine-grained vs Coarse-grained multithreading

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Chapter 2: Data Manipulation

Levels of Parallelism within a Single Processor

Chapter 2: Data Manipulation

Presentation transcript:

Simultaneous Multithreading in Superscalar Processors By Connor Sample

What is Simultaneous Multithreading (SMT)? Describes the ability for a processor to execute multiple instructions from multiple distinct threads at the same time. Goal: Increased processor throughput as well as optimized utilization of system resources. This handles the two major bottlenecks modern processor have: Long latency Per-thread parallelism Utilized on superscalar processors

Superscalar processors Looks to execute multiple instructions from a single thread during a cycle This can lead to a loss of cycle resources, as the processor is tied up Typically pipelined, though the two are separate performance enhancement techniques: Superscalar: Multiple instructions, single execution unit Pipelining: Multiple instructions, multiple execution units Processor board from a CRAY T3e supercomputer

Superscalar processors + SMT Regular superscalar processor Multithreaded Architecture SMT Architecture

Architecture SMT architecture draws directly from traditional superscalar processors: The fetch unit grabs as many instructions as is allowed Instructions are decoded and passed to a register renaming unit, which maps local registers to a physical register pulled from a set of unused registers to eliminate false dependency Instructions end up in one of two queues to await issuing Instructions are issued whenever their needed resources become available Once an instruction is complete it is retired, and the associated register is freed for use by another instruction

Architecture

Architectural Additions Some additions are needed to this architecture to enable SMT: Extra program counters Mechanism for the fetch unit to select a specific program counter Per-thread instruction retirement, queue flushing, and catches Larger registers to accommodate increased usage demands

Pipeline Changes Increased register size causes a proportional increase in register access times. This is address by allocating an extra cycle to register operations: First cycle moves register info closer to final destination Second cycle issues instructions and associated values to be executed

Advantages Increased throughput Increased power efficiency Architecture changes minimal and simple to implement Performance gains will increase with improvements in technology

Disadvantages Potential bottlenecks for performance: Fetching: Fetching bandwidth becomes an issue with larger multithreading sizes Memory: Memory throughput is unaffected by the multithreading Registers: Increased register sizes decreases access speed, and unused registers could contribute to bogging the process down Performance gains are dependent on program’s ability to utilize it, meaning developers have to specifically build with SMT in mind. If SMT proves is a detriment to program performance, developers will have to code features to disable it.