The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Instruction-Level Parallelism (ILP)
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Instruction Level Parallelism (ILP) Colin Stevens.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 17 Parallel Processing.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
How Multi-threading can increase on-chip parallelism
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Mark Hampton and Krste Asanović April 9, 2008 Compiling for Vector-Thread Architectures MIT Computer Science and Artificial Intelligence Laboratory University.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original.
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Lecture 38: Compiling for Modern Architectures 03 May 02
Computer Architecture Principles Dr. Mike Frank
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
CSL718 : VLIW - Software Driven ILP
Superscalar Processors & VLIW Processors
Levels of Parallelism within a Single Processor
Coe818 Advanced Computer Architecture
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Mattan Erez The University of Texas at Austin
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
Levels of Parallelism within a Single Processor
How to improve (decrease) CPI
Loop-Level Parallelism
Presentation transcript:

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science Boston Area Architecture Workshop (BARC) January 30th, 2003

Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach which can flexibly take advantage of many forms of parallelism available in different applications – instruction, loop, data, thread The key goal of the vector-thread architecture is efficiency – high performance with low power consumption and small area A clean, compiler-friendly programming model is key to realizing these goals

Instruction Parallelism Super-scalar VLIW Independent instructions can execute concurrently Super-scalar architectures dynamically schedule instructions in hardware to enable out-of-order and parallel execution Software statically schedules parallel instructions on a VLIW machine track instr. dependencies

Loop Parallelism VLIW Operations from disjoint iterations of a loop can execute in parallel VLIW architectures use software pipelining to statically schedule instructions from different loop iterations to execute concurrently load add store load add store load add store iter. 0 iter. 1 iter. 2 software pipeline load add store prev next loop: load add store iter. 3 load add store iter. 4

Data Parallelism Vector A single operation can be applied in parallel across a set of data In vector architectures, one instruction identifies a set of independent operations which can execute in parallel Control overhead can be amortized

Thread Parallelism Multiprocessor Separate threads of control can execute concurrently Multiprocessor architectures allow different threads to execute at the same time on different processors Multithreaded architectures execute multiple threads at the same time to better utilize a single set of processing resources SMT

Vector-Thread Architecture Overview Data parallelism – start with vector architecture Thread parallelism – give execution units local control Loop parallelism – allow fine-grain dataflow communication between execution units Instruction parallelism – add wide issue

Vector Architecture VP0VP1VP(N-1) vector instruction control thread Programming Model A control thread interacts with a set of virtual processors (VPs) VPs contain registers and execution units VPs execute instructions under slave control Each iteration in a vectorizable loop mapped to its own VP (w. stripmining) loadA loadB add store i=0i=1i=N-1 for (i=0; i<N; i++) C[i] = A[i] + B[i]; load A load B add store VP0VP1VP(N-1) vector-execute: loadA loadB add store loadA loadB add store Using VPs for Vectorizable Loops

Vector Microarchitecture loadA add store loadA loadB add store Lane 0Lane 1Lane 2Lane 3 Lanes contain regfiles and execution units – VPs map to lanes and share physical resources Operations execute in parallel across lanes and sequentially for each VP mapped to a lane – control overhead amortized to save energy Execution on Vector Processor from control processor VP0 VP4 VP8 VP12 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 Lane 0Lane 1Lane 2Lane 3 Microarchitecture load A load B add store vector-execute:

Vector-Thread Architecture VP0VP1VP(N-1) slave control micro-threaded control Programming Model cross-VP communication Vector of Virtual Processors (similar to traditional vector architecture) VPs are decoupled – local instruction queues break the rigid synchronization of vector architectures Under slave control, the control thread sends instructions to all VPs Under micro-threaded control, each VP fetches its own instructions Cross-VP communication allows each VP to send data to its successor

Using VPs for Do-Across Loops recv load add send i=0 for (i=0; i<N; i++) { x = x + A[i]; C[i] = x; } VP0 store recv load add send i=1 VP1 store recv load add send i=(N-1) VP(N-1) store VPs execute atomic instruction blocks (AIB) Each iteration in a data dependent loop is mapped to its own VP Cross-VP send and recv operations communicate do-across results from one VP to the next VP (next iteration in time) load add store recv send vector-execute: AIB

Vector-Thread Microarchitecture VP0 VP4 VP8 VP12 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 Lane 0 Lane 1Lane 2Lane 3 Microarchitecture execute directives Instr. cache Instr. fill do-across network VPs striped across lanes as in traditional vector machine Lanes have small instruction cache (e.g. 32 instr’s), decoupled execution Execute directives point to atomic instruction blocks and indicate which VP(s) the AIB should be executed for – generated by control thread vector-execute command, or VP fetch instruction Do-across network includes dataflow handshake signals – receiver stalls until data is ready

Do-Across Execution recv load add send store recv load add send store recv load add send store recv load add send store recv load add send store recv load add send store recv load add send store recv load add send store Lane 0Lane 1Lane 2Lane 3 Dataflow execution resolves do- across dependencies dynamically Independent instructions execute in parallel – performance adapts to software critical path Instruction fetch overhead amortized across loop iterations recv load add send store recv load add send store recv load add load add store recv send vector-execute:

Micro-Threading VPs VPs also have the ability to fetch their own instructions enabling each VP to execute its own thread of control Control thread can send a vector fetch instruction to all VPs (i.e. vector fork) – allows efficient thread startup Control thread can stall until micro-threads “finish” (stop fetching instructions) Enables data-dependent control flow within a loop iteration (alternative to predication) VP0VP1VP(N-1)

Loop Parallelism and Architectures Loops are ubiquitous and contain ample parallelism across iterations Super-scalar: must track dependencies between all instructions in a loop body (and correctly predict branches) before executing instruction in the subsequent iteration… and do this repeatedly for each loop iteration VLIW: software pipelining exposes parallelism, but requires static scheduling which is difficult and inadequate with dynamic latencies and dependencies Vector: efficient, but limited to do-all loops, no do-across Vector-thread: Software efficiently exposes parallelism, and dynamic dataflow automatically adapts to critical path. Uses simple in-order execution units, and amortizes instruction fetch overhead across loop iterations

Using the Vector-Thread Architecture Control Thread DO-ACROSS Loop Micro- threading DO-ALL Loop Virtual Processors The Vector-Thread Architecture seeks to efficiently exploit the available parallelism in any given application Using the same set of resources, it can flexibly transition from pure data parallel operation, to parallel loop execution with do-across dependencies, to fine-grain multi-threading Vector-Threading ILP Threads Vector Loop Performance & Energy Efficiency Multi-paradigm Support

SCALE-0 Overview ctrl proc Vector Thread Unit 32b 128b 4x128b 128b CMMU Network Interface Outstanding Trans. Table 128b 256b 32KB L1 Configurable I/D Cache IALU Cluster 0 (Mem) IALU Cluster 1 FP-ADD Cluster 2 FP-MUL Cluster 3 Local Regfile Inter-Cluster Communication Prev-VP Next-VP Local Regfile Local Regfile Local Regfile Clustered Virtual Processor Tile