The Vector-Thread Architecture

Slides:

Advertisements

Similar presentations

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

The University of Adelaide, School of Computer Science

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Processor System Architecture

1 Computer System Overview OS-1 Course AA

Multiscalar processors

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.

Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Pipelining By Toan Nguyen.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Processor Structure & Operations of an Accumulator Machine

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.

Overview von Neumann Architecture Computer component Computer function

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

بسم الله الرحمن الرحيم MEMORY AND I/O.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.

CS 352H: Computer Systems Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.

Advanced Architectures

Chapter 1 Computer System Overview

15-740/ Computer Architecture Lecture 3: Performance

Low-power Digital Signal Processing for Mobile Phone chipsets

Distributed Processors

Multiscalar Processors

Embedded Systems Design

Assembly Language for Intel-Based Computers, 5th Edition

Introduction of microprocessor

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Memory chips Memory chips have two main properties that determine their application, storage capacity (size) and access time(speed). A memory chip contains.

An Introduction to Microprocessor Architecture using intel 8085 as a classic processor

Vector Processing => Multimedia

Instruction Scheduling for Instruction-Level Parallelism

Basic Processing Unit Unit- 7 Engineered for Tomorrow CSE, MVJCE.

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Comparison of Two Processors

Morgan Kaufmann Publishers Computer Organization and Assembly Language

Chapter 5: Computer Systems Organization

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

* From AMD 1996 Publication #18522 Revision E

Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

The Vector-Thread Architecture

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Chapter 1 Computer System Overview

Computer Architecture

Loop-Level Parallelism

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

The Vector-Thread Architecture By: Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanović Presented by: Andrew P. Wilson

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

Motivation Parallelism and Locality are key application characteristics Conventional sequential ISAs provide minimal support for encoding parallelism and locality Result: high-performance implementations devote much area and power to on-chip structures to: extract parallelism support arbitrary global communication

Motivation Large areas and power overheads are justified for even small performance improvements Many applications have parallelism that can be statically determined ISAs that can expose more parallelism require less area and power don’t have to devote resources to dynamically determine dependencies

Motivation ISAs that allow locality to be expressed reduce need for long range communication and complex interconnections Challenge: develop an efficient encoding of parallel dependency graphs for the microarchitecture that’ll execute the dependency graph

Motivation SCALE Vector-Thread Architecture Designed for low-power and high-performance embedded applications Benchmarks show embedded domains can be mapped efficiently to SCALE Multiple types of parallelism are exploited simultaneously

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

VT Abstract Model Vector-Thread Architecture: Benefits Unified vector and multithreaded execution models Consists of a conventional scalar control processor and an array of slave virtual processors (VPs) Benefits Large amounts of structural parallelism can be compactly encoded Simple microarchitecture High performance at low power by avoiding complex control and datapath structures and by reducing activity on long wires

Virtual Processor Vector VT Abstract Model Control Processor Virtual Processor Vector Virtual Processor Virtual Processor Virtual Processor

VT Abstract Model Control processor Virtual Processor Vector Gives work out to the Virtual Processors Virtual Processor Vector Array of Virtual Processors Two separate instruction sets Well suited to loops, each VP executes a single iteration of the loop while the control processor manages the execution

VT Abstract Model Virtual Processor Has set of registers and executes strings of Risc-like instructions packaged into atomic instruction blocks (AIBs) AIBs can be obtained in two ways: The control processor can broadcast AIBs to all VPs (data-parallel code) using a vector-fetch command or to a specific VP using a VP-fetch command The VPs can fetch their own AIBs (thread-parallel code) using a thread-fetch command No automatic program counter or implicit instruction fetch mechanism; all AIBs must be explicitly requested by the control processor or the VP itself

VT Abstract Model Vector-Fetch example: vector-vector add loop AIB consists of two loads, an add, and a store AIB is sent to all VPs via vector-fetch command All VPs execute the same instructions but on different data elements depending on VP index number vl iterations of the loop executed at once r0 = VP index r1, r2 = input vector base addresses r3 = output vector base address

VT Abstract Model Thread-fetch example: pointer-chasing VP thread Thread-fetch example: pointer-chasing Thread-fetches can be predicated VP thread persists until all no more fetches occur and the current AIB is complete Next command from control processor is ignored until the VP thread is finished thread-fetch thread-fetch

VT Abstract Model Vector-fetching and thread-fetching combined

VT Abstract Model VPs are connected in a unidirectional ring Data can be transferred from VP(n) to VP(n+1) Cross-VP data transfers Dynamically scheduled Resolve when data becomes available

Cross-VP start/stop queue VT Abstract Model Cross-VP start/stop queue

VT Abstract Model Cross-VP Data Transfer example: saturating parallel prefix sum Initial value pushed into cross-VP start/stop queue Result either popped from cross-VP start/stop queue or consumed during next execution of the AIB r0 = VP index r1, r2 = input vector base addresses r3, r4 = min and max saturation values Cross-VP Data Transers Vector-Fetch AIB Vector-Fetch AIB Vector-Fetch AIB

VT Abstract Model VPs can be used as free-running threads as well, operating independently from the control processor and retrieving data from a shared work queue

VT Abstract Model Benefits Parallelism and locality maintained at a high granularity Common code can be executed by the Control Processor AIBs reduce instruction fetching overhead Vector-fetch commands explicitly encode parallelism and instruction locality, high-performance, amortized control overhead Vector-memory commands avoid separate load and store requests for each element and can be used to exploit memory data-parallelism Cross-vp data transfers explicitly encode fined grain communication and synchronization with little overhead

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

VT Physical Model Control processor Vector-thread unit (VTU) Conventional scalar unit Vector-thread unit (VTU) array of processing lanes VPs striped across the lanes Each lane contains: physical registers holding the VP states functional units

VT Physical Model functional units are time-multiplexed across the VPs

VT Physical Model Each lane contains a command management unit and an execution cluster Lane Lane Lane Lane CMU CMU CMU CMU Execution Cluster Execution Cluster Execution Cluster Execution Cluster

VT Physical Model Command Management Unit Buffers commands from control processor Holds pending thread-fetch addresses for VPs Holds tags for lane’s AIB cache Chooses a vector-fetch, VP-fetch, or thread-fetch command to process Fetch contains address/AIB tag If AIB is not in cache, request is sent to AIB fill unit When AIB is in cache, an execute directive is generated and sent to a queue in the Execution Cluster repeat

VT Physical Model AIB Fill Unit Retrieves requested AIBs from the primary cache One lane’s request is handled at a time unless lanes are using vector-fetch commands when the AIB will broadcast the AIB to all lanes simultaneously

VT Physical Model Execution Cluster To process execution directive cluster reads VP instructions one by one from the AIB cache and executes them for the appropriate VP All instructions in the AIB are executed for one VP before moving on to the next Virtual register indices in the AIB instructions are combined with active VP number to create an index into the physical register file Thread-fetch instructions are sent to the CMU with the requested AIB address and the VP’s pending thread-fetch register is updated Lanes are interconnected with a unidirectional ring network for cross-VP data transfers

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

SCALE VT Architecture Control Processor Vector-thread unit MIPS-based Each lane has a single CMU but multiple execution clusters with independent register sets AIB instructions target specific clusters Source operands must be local to cluster Results can be written to any cluster

SCALE VT Architecture Execution Clusters All support basic integer operations Cluster 0 supports memory accesses Cluster 1 supports fetch instructions Cluster 3 supports integer multiply and divides Clusters can be enhanced and more can be added Each cluster within has its own predicate register

SCALE VT Architecture Registers Registers in each cluster are either shared or private Private registers preserve their values between AIBs Shared registers may be overwritten by a different VP, may be used as temporary state within an AIB Two additional chain registers Associated with the two ALU operands, can be used to avoid reading and writing the register file Cluster 0 has an additional chain register through which all data for VP stores must pass (store-data register) The Control processor configures each VP by indicating how many shared and private registers it requires in each cluster Determines maximum number of VPs that can be supported Typically done once outside each loop

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

SCALE Code Example Decoder example: C code Non-vectorizable Table Look-ups loop carried dependencies

SCALE Code Example Decoder example: control processor code Configure VPs Vector-Writes Push onto cross-VP start/stop queue Pop off of cross-VP start/stop queue

SCALE Code Example Decoder example: AIB code executed by each VP

SCALE Code Example Decoder example: cluster usage

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

SCALE Microarchitecture Clusters support three types of hardware micro-ops Compute-op: performs RISC-like operations Transport-op: sends data to another cluster Writeback-op: receives data sent from another cluster Transport and writeback ops are used for inter-cluster data transfers Data dependencies are synchronized with handshake signals Transport and writebacks are queued so execution can continue while waiting for external clusters to receive or send data

SCALE Microarchitecture Transport and Writeback ops

SCALE Microarchitecture Memory Access Decoupling Memory is only accessed through cluster 0 Load data queue used to buffer the data and preserve correct ordering Decoupled store queue used to buffer stores Can be targeted by transport-ops directly Queues allow cluster to continue working without waiting for a store or load to resolve

SCALE Microarchitecture Decoupled store queue Load data queue

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

SCALE Prototype Single-issue MIPS processor Four 32-bit lanes with four execution clusters each 32KB shared primary cache 32 registers per cluster Supports up to 128 VPs L1 Cache is 32-way set associative Area ~10mm2 400 MHz target

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

Evaluation Detailed cycle-level, execution-driven microarchitectural simulator Default parameters

Evaluation EEMBC benchmarks Can be run “out-of-the-box” or optimized Drawbacks Performance can depend greatly on programmer effort Optimizations used for reported results are often unpublished

Evaluation Results SCALE competitive with larger more complex processors SCALE performance scales well as lanes are added Large speed-ups possible when algorithms are extensively tuned for highly-parallel processors

Evaluation dasds

Evaluation Register usage Resulting vector lengths

Evaluation Compared Processors AMD Au1100 Philips TriMedia TM 1300 Similar to SCALE Philips TriMedia TM 1300 Five-issue VLIW 32-bit datapath 166 MHz, 32kB L1 IC, 16kB L1 DC 125 MHz 32-bit memory port Motorola PowerPC (MPC7447) Four-issue out-of-order superscalar 1.3 GHz, 32kB L1 IC and DC, 512kB L2 133 MHz 64-bit memory port Altivec SIMD unit 128-bit datapath Four execution units

Evaluation Compared Processors (cont’d) VIRAM BOPS Manta Four 64-bit lanes 200 MHz, 13MB embedded DRAM with 256bits each of load and store data, 4 independent addresses per cycle BOPS Manta Clustered VLIW DSP with four clusters Each cluster can execute up to five ipcs, 64-bit datapaths 136 MHz, 128kB on-chip memory 138 MHz 32-bit memory port TI TMS TMS320C6416 Clustered VLIW DSP with two clusters Each cluster can execute up to four ipcs 720 MHz, 16kB IC, 16kB DC, 1MB on-chip SRAM 720MHz 64-bit memory interface

Evaluation

Evaluation

Agenda Motivation Vector-Thread Abstract Model Vector-Thread Physical Model SCALE Vector-Thread Architecture Overview Code Example Microarchitecture Prototype Evaluation Conclusion

Conclusion Vector-Thread Architecture Allows software to more efficiently encode parallelism and locality Enables high-performance implementations that are efficient in area and power Supports all types of parallelism SCALE shows well suited to embedded applications Relatively small design provides competitive performance Widely applicable in other application domains