Mattan Erez The University of Texas at Austin

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Instruction-Level Parallelism (ILP)

C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 1 Advanced Computers Architecture Lecture 4 By Rohit Khokher Department.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Instruction Level Parallelism (ILP) Colin Stevens.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Multiscalar processors

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!

Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

EKT303/4 Superscalar vs Super-pipelined.

Lecture 3: Computer Architectures

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

15-740/ Computer Architecture Lecture 3: Performance

Computer Architecture Principles Dr. Mike Frank

William Stallings Computer Organization and Architecture 8th Edition

buses, crossing switch, multistage network.

ESE532: System-on-a-Chip Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

/ Computer Architecture and Design

CDA 3101 Spring 2016 Introduction to Computer Organization

Lecture 5: GPU Compute Architecture

Instruction Level Parallelism and Superscalar Processors

Mattan Erez The University of Texas at Austin

Morgan Kaufmann Publishers The Processor

The University of Texas at Austin

Introduction, Focus, Overview

Mattan Erez The University of Texas at Austin

Levels of Parallelism within a Single Processor

Lecture 5: GPU Compute Architecture for the last time

Lecture 11: Memory Data Flow Techniques

Instruction Level Parallelism and Superscalar Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Coe818 Advanced Computer Architecture

buses, crossing switch, multistage network.

Mattan Erez The University of Texas at Austin

Instruction Level Parallelism (ILP)

/ Computer Architecture and Design

The Vector-Thread Architecture

Mattan Erez The University of Texas at Austin

Overview Prof. Eric Rotenberg

Created by Vivi Sahfitri

Mattan Erez The University of Texas at Austin

Introduction, Focus, Overview

Mattan Erez The University of Texas at Austin

How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.

Lecture 1 An Overview of High-Performance Computer Architecture

The University of Adelaide, School of Computer Science

Guest Lecturer: Justin Hsia

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Lecture 5 – Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Outline Principles of parallel execution Pipelining Parallel HW (multiple ALUs) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Parallel Execution Concurrency Communication Synchronization what are the multiple resources? Communication and storage Synchronization Implicit? Explicit? what is being shared ? What is being partitioned ? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Parallelism – Circuits Circuits operate concurrently (always) Communicate through wires and registers Synchronize by: clock signals (including async) design (wave-pipelined) more? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Parallelism – Components Collections of circuits memories branch predictors cache scheduler … includes ALUs, memory channels, cores but will discuss later. Operate concurrently Communicate by wires, registers, and memory synchronize with clock and signals Often pipelined EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Reminders This is not a microarchitecture class We will be discussing microarch. of various stream processors though Details deferred to later in the semester This class is not a replacement for Parallel Computer Architecture class We only superficially cover many details of parallel architectures Focus on parallelism and locality at the same time EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Outline Cache oblivious algorithms “sub-talk” Principles of parallel execution Pipelining Parallel HW (multiple ALUs) Analyze by shared resources Analyze by synch/comm mechanisms ILP, DLP, and TLP organizations EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 What are the parallel resources? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E W C 2 3 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r4 3:add r6, r5, r3 1 F D I R E W C 2 3 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:add r5, r1, r3 3:add r6, r2, r3 1 F D I R E W C 2 3 Communication and synchronization mechanisms? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a pipelined processor dispatch (issue) reg. access execute write-back commit fetch decode sequencer 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I R E W C 2 L 3 4 EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Simplified view of a OOO pipelined processor dispatch (issue) issue (dispatch) reg. access execute write-back commit fetch decode rename sequencer scheduler 1:add r4, r1, r2 2:ld r5, r4 3:add r6, r5, r3 4:add r7, r1, r3 1 F D I d R E W C 2 L 3 4 Communication and synchronization mechanisms? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Pipelining Summary Pipelining is using parallelism to hide latency Do useful work while waiting for other work to finish Multiple parallel components, not multiple instances of same component Examples: Execution pipeline Memory pipelines Issue multiple requests to memory without waiting for previous requests to complete Software pipelines Overlap different software blocks to hide latency: computation/communication EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Resources in a parallel processor/system Execution ALUs Cores/processors Control Sequencers Instructions OOO schedulers State Registers Memories Networks EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Communication and synchronization Clock – explicit compiler order Explicit signals (e.g., dependences) Implicit signals (e.g., flush/stall) More for pipelining than multiple ALUs Communication Bypass networks Registers Memory Explicit (over some network) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Organizations for ILP (for multiple ALUs) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Superscalar (ILP for multiple ALUs) sequencer scheduler Synchronization Explicit signals (dependences) Communication Bypass, registers, mem Shared Sequencer, OOO, registers, memories, net, ALUs Partitioned Instructions memory hierarchy How many ALUs? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

SMT/TLS (ILP for multiple ALUs) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

SMT/TLS (ILP for multiple ALUs) sequencer scheduler sequencer Synchronization Explicit signals (dependences) Communication Bypass, registers, mem Shared OOO, registers, memories, net, ALUs Partitioned Sequencer, Instructions, arch. registers memory hierarchy Why is this ILP? How many threads? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

VLIW (ILP for multiple ALUs) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

VLIW (ILP for multiple ALUs) sequencer Synchronization Clock+compiler Communication Registers, mem, bypass Shared Sequencer, OOO, registers, memories, net Partitioned Instructions, ALUs memory hierarchy How many ALUs? EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Explicit Dataflow (ILP for multiple ALUs) EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Explicit Dataflow (ILP for multiple ALUs) sequencer scheduler scheduler scheduler scheduler Synchronization Explicit signals Communication Registers+explicit Shared Sequencer, memories, net Partitioned Instructions, OOO, ALUs memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

DLP for multiple ALUs From HW – this is SIMD EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

SIMD (DLP for multiple ALUs) sequencer Synchronization Clock+compiler Communication Explicit Shared Sequencer, instructions Partitioned Registers, memories, ALUs Sometimes: memories, net Local Mem Local Mem Local Mem Local Mem memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Vectors (DLP for multiple ALUs) sequencer Local Mem Local Mem Local Mem Local Mem memory hierarchy Vectors: memory addresses are part of single-instruction and not part of multiple-data EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

TLP for multiple ALUs EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

MIMD – shared memory (TLP for multiple ALUs) sequencer sequencer sequencer sequencer scheduler scheduler scheduler scheduler Synchronization Explicit, memory Communication Memory Shared Memories, net Partitioned Sequencer, instructions, OOO, ALUs, registers, some nets memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

MIMD – distributed memory sequencer sequencer sequencer sequencer scheduler scheduler scheduler scheduler Synchronization Explicit Communication Shared Net Partitioned Sequencer, instructions, OOO, ALUs, registers, some nets, memories memory hierarchy memory hierarchy memory hierarchy memory hierarchy EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Summary of communication and synchronization Style Synchronization Communication Superscalar explicit signals (RS) registers+bypass VLIW clock+compiler registers (bypass?) Dataflow explicit signals registers+explicit SIMD explicit MIMD memory+explicit EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Summary of sharing in ILP HW Style Seq Inst OOO Regs Mem ALUs Net Superscalar S P SMT/TLS VLIW N/A Dataflow B EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez

Summary of sharing in DLP and TLP Style Seq Inst OOO Regs Mem ALUs Net Vector S N/A P s B SIMD p MIMD S/P EE38N(20) Spring 2015 -- Lecture 5 (c) Mattan Erez