Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

CH14 Instruction Level Parallelism and Superscalar Processors
Machine cycle.
Lecture 4: CPU Performance
Computer Organization and Architecture
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
1 Recap (Pipelining). 2 What is Pipelining? A way of speeding up execution of tasks Key idea : overlap execution of multiple taks.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 17 Parallel Processing.
CSCE 212 Quiz 9 – 3/30/11 1.What is the clock cycle time based on for single-cycle and for pipelining? 2.What two actions can be done to resolve data hazards?
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining By Toan Nguyen.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
November 18, 2005 PACL and ASC Processor Research Overview 1 Research Overview Parallel and Associative Computing Group and the ASC Processor Group Kent.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Chapter One Introduction to Pipelined Processors.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Processor Level Parallelism 1
PipeliningPipelining Computer Architecture (Fall 2006)
Real-World Pipelines Idea Divide process into independent stages
PACL and ASC Processor Research Overview
CDA3101 Recitation Section 8
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
Multi-core processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Single Clock Datapath With Control
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Instruction Level Parallelism and Superscalar Processors
Pipelining and Vector Processing
Superscalar Processors & VLIW Processors
Instruction Level Parallelism and Superscalar Processors
Simultaneous Multithreading in Superscalar Processors
Serial versus Pipelined Execution
Topic 6: Pipelining and Pipelined Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 11: Alternative Architectures
Pipelining.
Created by Vivi Sahfitri
Pipelining.
Presentation transcript:

Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University

Organization of an ASC Computer

Broadcast/Reduction Bottleneck Time to perform a broadcast or reduction increases as the number of PEs increases Even for a moderate number of PEs, this time can dominate the machine cycle time Pipelining reduces the cycle time but increases the latency Additional latency causes pipeline hazards

Instruction Types Scalar instructions  Execute entirely within the control unit Broadcast/Parallel instructions  Execute within the PE array  Use the broadcast network to transfer instruction and data Reduction instructions  Execute within the PE array  Use the broadcast network to transfer instruction and data  Use the reduction network to combine data from PEs

Scalar Pipeline Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory Access (M) Write Back (W)

Hazards in a Scalar Pipeline

Unified SIMD Pipeline Broadcast (B1...Bn) Reduction (R1...Rn) Number of stages is variable All instructions go through every stage

Diversified SIMD Pipeline Separate paths for each instruction type so instructions only go through stages that they use Stalls less often than a unified pipeline organization

Hazards

Multithreading Pipelining alone cannot eliminate hazards caused by broadcast and reduction latencies Solution: use instructions from multiple threads to keep the pipeline full Instructions from different threads are independent so they cannot generate stalls due to data dependencies As long as there are a sufficient number of threads, it is possible to fill any number of stall cycles

Types of Multithreading Coarse-grain multithreading switches to a new thread when the current thread encounters a high latency operation Fine-grain multithreading switches to a new thread every clock cycle Simultaneous multithreading can issue instructions from multiple threads in the same clock cycle For a SIMD processor, fine-grain or simultaneous multithreading is necessary as pipeline stalls are relatively short and occur frequently

Multithreaded Control Unit

Reduction Hazard with a Single Thread

Reduction Hazard with Multiple Threads

Execution Time vs. Latency

Throughput vs. Latency

Multithreaded ASC vs. MASC A multithreaded ASC computer can execute at most one instruction in a cycle A MASC computer with j instruction streams can execute up to j instructions in a cycle In multithread ASC each thread can access every PE In MASC each instruction stream can only access its partition of PEs A multithreaded MASC computer could combine the advantages of both

ASC

Multithreaded ASC

MASC

Multithreaded MASC

Multithreaded ASC Processor In order to validate simulation results and estimate hardware costs, a prototype processor was developed Targeted for an Altera Cyclone II (EP2C35) FPGA Using an FPGA makes it possible to get detailed measurements of speed and hardware cost

Additional Enhancements Flags (logical values) are a first-class data types with their own set of registers and instructions Extra reduction operators  Count Responders  Sum Hardware semaphores for thread synchronization

Synthesis Results Targeted for an Altera Cyclone II FPGA (EP2C35) 16 x 16-bit PEs 16 hardware threads Clock speed: 75 MHz