The Vector-Thread Architecture

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Programmability Issues
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 17 Parallel Processing.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Lecture 38: Compiling for Modern Architectures 03 May 02
COMP 740: Computer Architecture and Implementation
Advanced Architectures
Computer Architecture Principles Dr. Mike Frank
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Design of Digital Circuits Lecture 21: GPUs
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
The Vector-Thread Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
ECE/CS 552: Pipelining to Superscalar
CSL718 : VLIW - Software Driven ILP
Vector Processing => Multimedia
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Coe818 Advanced Computer Architecture
How to improve (decrease) CPI
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Mattan Erez The University of Texas at Austin
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
/ Computer Architecture and Design
Mattan Erez The University of Texas at Austin
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Levels of Parallelism within a Single Processor
Hardware Multithreading
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science ronny@mit.edu www.cag.lcs.mit.edu/scale Boston Area Architecture Workshop (BARC) January 30th, 2003

Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach which can flexibly take advantage of many forms of parallelism available in different applications – instruction, loop, data, thread The key goal of the vector-thread architecture is efficiency – high performance with low power consumption and small area A clean, compiler-friendly programming model is key to realizing these goals

Instruction Parallelism Independent instructions can execute concurrently Super-scalar architectures dynamically schedule instructions in hardware to enable out-of-order and parallel execution Software statically schedules parallel instructions on a VLIW machine Super-scalar VLIW track instr. dependencies

Loop Parallelism Operations from disjoint iterations of a loop can execute in parallel VLIW architectures use software pipelining to statically schedule instructions from different loop iterations to execute concurrently load add store prev next loop: iter. 0 VLIW load iter. 1 add load iter. 2 store add load iter. 3 store add load iter. 4 software pipeline store add load store add store

Data Parallelism A single operation can be applied in parallel across a set of data In vector architectures, one instruction identifies a set of independent operations which can execute in parallel Control overhead can be amortized Vector

Thread Parallelism Separate threads of control can execute concurrently Multiprocessor architectures allow different threads to execute at the same time on different processors Multithreaded architectures execute multiple threads at the same time to better utilize a single set of processing resources SMT Multiprocessor

Vector-Thread Architecture Overview Data parallelism – start with vector architecture Thread parallelism – give execution units local control Loop parallelism – allow fine-grain dataflow communication between execution units Instruction parallelism – add wide issue

Vector Architecture Programming Model Using VPs for Vectorizable Loops VP(N-1) vector instruction control thread A control thread interacts with a set of virtual processors (VPs) VPs contain registers and execution units VPs execute instructions under slave control Each iteration in a vectorizable loop mapped to its own VP (w. stripmining) Using VPs for Vectorizable Loops for (i=0; i<N; i++) C[i] = A[i] + B[i]; VP0 VP1 VP(N-1) i=0 i=1 i=N-1 loadA loadA loadA loadB loadB loadB vector-execute: load A vector-execute: load B add add add vector-execute: add vector-execute: store store store store

Vector Microarchitecture Lane 0 Lane 1 Lane 2 Lane 3 Microarchitecture VP12 VP13 VP14 VP15 VP8 VP9 VP10 VP11 VP4 VP5 VP6 VP7 VP0 VP1 VP2 VP3 from control processor Execution on Vector Processor Lanes contain regfiles and execution units – VPs map to lanes and share physical resources Operations execute in parallel across lanes and sequentially for each VP mapped to a lane – control overhead amortized to save energy Lane 0 Lane 1 Lane 2 Lane 3 loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadA loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB loadB add add add add vector-execute: load A add add add add add add add add vector-execute: load B add add add add vector-execute: store store store store add store store store store vector-execute: store store store store store store store store store

Vector-Thread Architecture Programming Model VP0 VP1 VP(N-1) micro-threaded control slave control cross-VP communication Vector of Virtual Processors (similar to traditional vector architecture) VPs are decoupled – local instruction queues break the rigid synchronization of vector architectures Under slave control, the control thread sends instructions to all VPs Under micro-threaded control, each VP fetches its own instructions Cross-VP communication allows each VP to send data to its successor

Using VPs for Do-Across Loops for (i=0; i<N; i++) { x = x + A[i]; C[i] = x; } VP0 VP1 VP(N-1) i=0 i=1 i=(N-1) load load load recv recv recv vector-execute: add add add load recv send send send add AIB store store store send store VPs execute atomic instruction blocks (AIB) Each iteration in a data dependent loop is mapped to its own VP Cross-VP send and recv operations communicate do-across results from one VP to the next VP (next iteration in time)

Vector-Thread Microarchitecture Lane 0 Lane 1 Lane 2 Lane 3 VP12 VP13 VP14 VP15 execute directives VP8 VP9 VP10 VP11 VP4 VP5 VP6 VP7 VP0 VP1 VP2 VP3 Instr. fill Instr. cache do-across network VPs striped across lanes as in traditional vector machine Lanes have small instruction cache (e.g. 32 instr’s), decoupled execution Execute directives point to atomic instruction blocks and indicate which VP(s) the AIB should be executed for – generated by control thread vector-execute command, or VP fetch instruction Do-across network includes dataflow handshake signals – receiver stalls until data is ready

Do-Across Execution Lane 0 Lane 1 Lane 2 Lane 3 vector-execute: load load load load recv load add recv send recv add store add send send recv load store add store send recv load store add recv send load store Dataflow execution resolves do-across dependencies dynamically Independent instructions execute in parallel – performance adapts to software critical path Instruction fetch overhead amortized across loop iterations add send recv store add load send recv load store add send recv load store add send recv add load store send recv store add load send recv load store add

Micro-Threading VPs VP0 VP1 VP(N-1) VPs also have the ability to fetch their own instructions enabling each VP to execute its own thread of control Control thread can send a vector fetch instruction to all VPs (i.e. vector fork) – allows efficient thread startup Control thread can stall until micro-threads “finish” (stop fetching instructions) Enables data-dependent control flow within a loop iteration (alternative to predication)

Loop Parallelism and Architectures Loops are ubiquitous and contain ample parallelism across iterations Super-scalar: must track dependencies between all instructions in a loop body (and correctly predict branches) before executing instruction in the subsequent iteration… and do this repeatedly for each loop iteration VLIW: software pipelining exposes parallelism, but requires static scheduling which is difficult and inadequate with dynamic latencies and dependencies Vector: efficient, but limited to do-all loops, no do-across Vector-thread: Software efficiently exposes parallelism, and dynamic dataflow automatically adapts to critical path. Uses simple in-order execution units, and amortizes instruction fetch overhead across loop iterations

Using the Vector-Thread Architecture Multi-paradigm Support Control Thread Virtual Processors Vector DO-ALL Loop Loop Energy Efficiency Performance & DO-ACROSS Loop Threads ILP Micro-threading Vector-Threading The Vector-Thread Architecture seeks to efficiently exploit the available parallelism in any given application Using the same set of resources, it can flexibly transition from pure data parallel operation, to parallel loop execution with do-across dependencies, to fine-grain multi-threading

Configurable I/D Cache SCALE-0 Overview Tile Outstanding Trans. Table Clustered Virtual Processor ctrl proc Vector Thread Unit CMMU 256b 128b Network Interface Cluster 3 32b 128b 128b Local Regfile FP-MUL 4x128b 32KB L1 Configurable I/D Cache Cluster 2 Local Regfile FP-ADD ALU L/S Instruction Issue Unit to tile memory Lane Inter-Cluster Communication Prev-VP Next-VP Cluster 1 Local Regfile IALU Cluster 0 (Mem) Local Regfile IALU