Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.

Lecture 6: Multicore Systems

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

How Multi-threading can increase on-chip parallelism

Pipelined Processor II CPSC 321 Andreas Klappenecker.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Parallel Computer Architectures

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Hyper-Threading Technology Architecture and Microarchitecture

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Processor Architecture

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Processor Level Parallelism 1

PipeliningPipelining Computer Architecture (Fall 2006)

COMP 740: Computer Architecture and Implementation

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Simultaneous Multithreading

Multi-core processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Computer Architecture: Multithreading (I)

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Hardware Multithreading

Lecture 22: Multithreading

Presentation transcript:

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker

Plan T November 16: Multithreading R November 18: Quantum Computing T November 23: QC + Exam prep R November 25: Thanksgiving M November 29: Review ??? T November 30: Exam R December 02: Summary and Outlook T December 07: move to November 29?

Announcements Office hours 2:00pm-3:00pm Bonfire memorial

Parallelism Hardware parallelism all current architectures Instruction-level parallelism superscalar processor, VLIW processor Thread-level parallelism Niagara, Pentium 4,... Process parallelism MIMD computer

What is a Thread? A thread is a sequence of instructions that can be executed in parallel with other sequences. Threads typically share the same resources and have a minimal context.

Threads Definition 1 Different threads within the same process share the same address space, but have separate copies of the register file, PC, and stack Definition 2 Alternatively, different threads have separate copies of the register file, PC, and page table (more relaxed than previous definition). One can use a multiple issue, out-of-order, execution engine.

Why Thread-Level Parallelism? Extracting instruction-level parallelism is non-trivial: hazards and stalls data dependencies structural limitations static optimization limits

Von Neumann Execution Model Each node is an instruction. The pink arrow indicates a static scheduling of the instructions. If an instruction stalls (e.g. due to a cache miss) then the entire program must wait for the stalled instruction to resume execution.

The Dataflow Execution Model Each node represents an instruction. The instructions are not scheduled until run-time. If an instruction stalls, other instructions can still execute, provided their input data is available.

The Multithreaded Execution Model Each node represents an instruction and each gray region represents a thread. The instructions within each thread are statically scheduled while the threads themselves are dynamically scheduled. If an instruction stalls, the thread stalls but other threads can continue execution.

Single-Threaded Processors Memory access latency can dominate the processing time, because each time a cache miss occurs hundreds of clock cycles can be lost when a single-threaded processor is waiting for the memory. Top: Increasing the clock speed improves the processing time, but does not affect the memory access time.

Multi-Threaded Processors

Multithreading Types Coarse-grained multithreading If a thread faces a costly stall, switch to another thread. Usually flushes the pipe before switching threads. Fine-grained multithreading interleave the issue of instruction from multiple threads (cycle-by-cycle), skipping the threads that are stalled. Instructions issued in any given cycle comes from the same thread.

Scalar Execution Dependencies reduce throughput and utilization.

Superscalar Execution

Chip Multiprocessor

Fine-Grained Multithreading Instructions issues in the same cycle come from the same thread

Fine-Grained Multithreading Threads are switched every clock cycle, in round robin fashion, among active threads Throughput is improved, instructions can be issued every cycle Single-thread performance is decreased, because one thread is expected to get just every n-th clock cycle among n processes Fine-grained multithreading requires hardware modifications to keep track of threads (separate register files, renaming tables and commit buffers)

Multithreading Types A single thread cannot effectively use all functional units of a multiple issue processor Simultaneous multithreading uses multiple issue slots in each clock cycle for different threads. More flexible than fine grained MT.

Simultaneous Multithreading

Comparison Superscalar: looks at multiple instructions from same process, both horizontal and vertical waste. Multithreaded: minimizes vertical waste: tolerate long latency operations Simultaneous Multithreading Selects instructions from any "ready" thread

Superscalar Multithreaded SMT Issue slots

SMT Issues

A Glance at a Pentium 4 Chip Picture courtesy of Tom’s hardware guide

The Pipeline Trace cache

Intel’s Hyperthreading Patent

Pentium 4 Pipeline 1.Trace cache access, predictor 5 clock cycles Microoperation queue 2.Reorder buffer allocation, register renaming 4 clock cycles functional unit queues 3.Scheduling and dispatch unit 5 clock cycles 4.Register file access 2 clock cycles 5.Execution 1 clock cycle reorder buffer 6.Commit 3 clock cycles (total: 20 clock cycles)

PACT XPP The XPP processes a stream of data using configurable arithmetic-logic units. The architecture owes much to dataflow processing.

A Matrix-Vector Multiplication Graphic courtesy of PACT

Basic Idea Replace the von Neumann instruction stream with fixed instruction scheduling by a configuration stream. Process streams of data as opposed to processing of small data entities.

von Neumann vs. XPP

Basic Components of XPP Processing Arrays Packet oriented communication network Hierarchical configuration manager tree A set of I/O modules Supports the execution of multiple data flow applications running in parallel.

Four Processing Arrays Graphics courtesy of PACT. SCM is short for supervising configuration manager

Data Processing

Event Packets

XPP 64-A

Further Reading PACT XPP – A Reconfigurable Data Processing Architecture by Baumgarte, May, Nueckel, Vorbach and Weinhardt