Samira Khan University of Virginia Sep 5, 2018

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

The University of Adelaide, School of Computer Science

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Computer Architecture: Out-of-Order Execution

Data Structures and Algorithms in Parallel Computing Lecture 1.

Computer Architecture: Out-of-Order Execution II

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011.

Samira Khan University of Virginia Jan 26, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models The content and concept of this course.

Computer Architecture Lecture 14: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/18/2015.

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014.

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.

Design of Digital Circuits Lecture 20: SIMD Processors

Design of Digital Circuits Lecture 19: Approaches to Concurrency

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Computer Architecture: SIMD and GPUs (Part I)

Advanced Architectures

18-447: Computer Architecture Lecture 30B: Multiprocessors

Caches Samira Khan March 23, 2017.

15-740/ Computer Architecture Lecture 4: Pipelining

15-740/ Computer Architecture Lecture 3: Performance

Precise Exceptions and Out-of-Order Execution

Computer Architecture Lecture 14: Out-of-Order Execution

buses, crossing switch, multistage network.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/12/2011

Multiscalar Processors

Design of Digital Circuits Lecture 21: GPUs

Prof. Onur Mutlu Carnegie Mellon University

Parallel Processing - introduction

Computer Architecture: SIMD and GPUs (Part II)

Fall 2012 Parallel Computer Architecture Lecture 22: Dataflow I

Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/16/2015

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

15-740/ Computer Architecture Lecture 7: Pipelining

Samira Khan University of Virginia Sep 4, 2017

Morgan Kaufmann Publishers

Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

COMP4211 : Advance Computer Architecture

Pipelining and Vector Processing

Single Thread Parallelism

Single Thread Parallelism

Computer Architecture Dataflow (Part II) and Systolic Arrays

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Multivector and SIMD Computers

Samira Khan University of Virginia Sep 10, 2018

buses, crossing switch, multistage network.

Overview Parallel Processing Pipelining

Lecture 3: Evolution of the Uniprocessor

Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Computer Architecture Lecture 8: SIMD Processors and GPUs

Chapter 12 Pipelining and RISC

Chapter 4 Multiprocessors

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Lecture 5: Pipeline Wrap-up, Static ILP

COMPUTER ORGANIZATION AND ARCHITECTURE

Samira Khan University of Virginia Jan 23, 2019

Computer Architecture Lecture 15: Dataflow and SIMD

Samira Khan University of Virginia Jan 30, 2019

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 3/19/2012

Prof. Onur Mutlu Carnegie Mellon University

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Samira Khan University of Virginia Sep 5, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 5, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from last lecture Fundamental concepts Computing models

LAST LECTURE RECAP Why Study Computer Architecture? Von Neumann Model Data Flow Architecture

REVIEW: THE DATA FLOW MODEL Von Neumann model: An instruction is fetched and executed in control flow order As specified by the instruction pointer Sequential unless explicit control flow instruction Dataflow model: An instruction is fetched and executed in data flow order i.e., when its operands are ready i.e., there is no instruction pointer Instruction ordering specified by data flow dependence Each instruction specifies “who” should receive the result An instruction can “fire” whenever all operands are received Potentially many instructions can execute at the same time Inherently more parallel

MORE ON DATA FLOW In a data flow machine, a program consists of data flow nodes A data flow node fires (fetched and executed) when all it inputs are ready i.e. when all inputs have tokens Data flow node and its ISA representation

DATA FLOW NODES

An Example

What does this model perform? val = a ^ b

What does this model perform? val = a ^ b val =! 0

What does this model perform? val = a ^ b val =! 0 val &= val - 1

What does this model perform? val = a ^ b val =! 0 val &= val - 1; dist = 0 dist++;

Hamming Distance int hamming_distance (unsigned a, unsigned b) { int dist = 0; unsigned val = a ^ b; // Count the number of bits set while (val != 0) { // A bit is set, so increment the count and clear the bit dist++; val &= val - 1; } // Return the number of differing bits return dist;

Hamming Distance Number of positions at which the corresponding symbols are different. The Hamming distance between: "karolin" and "kathrin" is 3 1011101 and 1001001 is 2 2173896 and 2233796 is 3

RICHARD HAMMING Best known for Hamming Code Won Turing Award in 1968 Was part of the Manhattan Project Worked in Bell Labs for 30 years You and Your Research is mainly his advice to other researchers Had given the talk many times during his life time http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Dataflow Machine: Instruction Templates b + *7 - * y x 1 2 3 4 5 Destination 1 Destination 2 Operand 1 Operand 2 Opcode + 3L 4L 1 2 3 4 5 * 3R 4R - 5L + 5R * out Presence bits Each arc in the graph has an operand slot in the program Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974 One of the papers assigned for review next week

DATA FLOW CHARACTERISTICS Data-driven execution of instruction-level graphical code Nodes are operators Arcs are data (I/O) As opposed to control-driven execution Only real dependencies constrain processing No sequential I-stream No program counter Operations execute asynchronously Execution triggered by the presence of data

DATA FLOW ADVANTAGES/DISADVANTAGES Very good at exploiting irregular parallelism Only real dependencies constrain processing Disadvantages Debugging difficult (no precise state) Interrupt/exception handling is difficult (what is precise state semantics?) Too much parallelism? (Parallelism control needed) High bookkeeping overhead (tag matching, data storage) Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited

DATA FLOW SUMMARY Availability of data determines order of execution A data flow node fires when its sources are ready Programs represented as data flow graphs (of nodes) Data Flow at the ISA level has not been (as) successful Data Flow implementations under the hood (while preserving sequential ISA semantics) have been successful Out of order execution Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986.

OOO EXECUTION: RESTRICTED DATAFLOW An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? Why would we like to? In other words, how can we have a large instruction window?

OOO Review What do the following two pieces of code have in common (with respect to execution in the previous design)? Answer: First ADD stalls the whole pipeline! ADD cannot dispatch because its source registers unavailable Later independent instructions cannot get executed IMUL R3  R1, R2 ADD R3  R3, R1 ADD R1  R6, R7 IMUL R5  R6, R8 ADD R7  R9, R9 LD R3  R1 (0) ADD R3  R3, R1 ADD R1  R6, R7 IMUL R5  R6, R8 ADD R7  R9, R9

IN-ORDER VS. OUT-OF-ORDER In order dispatch + precise exceptions: Out-of-order dispatch + precise exceptions: 16 vs. 12 cycles IMUL R3  R1, R2 ADD R3  R3, R1 ADD R1  R6, R7 IMUL R5  R6, R8 ADD R7  R3, R5 F D E R W F D STALL E R W F STALL D E R W F D E E R W F D STALL E R W IMUL R3  R1, R2 ADD R3  R3, R1 ADD R1  R6, R7 IMUL R5  R6, R8 ADD R7  R3, R5 F D E R W F D WAIT E R W F D E R W F D E R W F D WAIT E R W

OOO EXECUTION: RESTRICTED DATAFLOW An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? Why would we like to? In other words, how can we have a large instruction window?

DATAFLOW GRAPH FOR OUR EXAMPLE MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 ADD R5  R5, R11

* R1 R2 X (R3) MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6

* + R1 R2 R2 R6 X (R3) b (R7) MUL R3  R1, R2 ADD R5  R3, R4

* + + R1 R2 R2 R6 R8 R9 X (R3) b (R7) c (R10) MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 R1 R2 R2 R6 R8 R9 ADD R5  R5, R11 * + + X (R3) b (R7) c (R10)

* + + + R1 R2 R2 R6 R8 R9 X (R3) b (R7) c (R10) X R4 a (R5) MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 R1 R2 R2 R6 R8 R9 ADD R5  R5, R11 * + + X (R3) b (R7) c (R10) X R4 + a (R5)

* + + + * R1 R2 R2 R6 R8 R9 X (R3) b (R7) c (R10) R5 b c a (R5) MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 R1 R2 R2 R6 R8 R9 ADD R5  R5, R11 * + + X (R3) b (R7) c (R10) R5 + b c * a (R5) y (R11)

* + + * + + R1 R2 R2 R6 R8 R9 X (R3) b (R7) c (R10) R5 y (R11) a (R5) MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 R1 R2 R2 R6 R8 R9 ADD R5  R5, R11 * + + X (R3) b (R7) c (R10) R5 * y (R11) + a (R5) a y + d (R5) Register Alias Table

OOO EXECUTION: RESTRICTED DATAFLOW An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? Why would we like to? In other words, how can we have a large instruction window?

ANOTHER WAY OF EXPLOITING PARALLELISM SIMD: Concurrency arises from performing the same operations on different pieces of data MIMD: Concurrency arises from performing different operations on different pieces of data Control/thread parallelism: execute different threads of control in parallel  multithreading, multiprocessing Idea: Use multiple processors to solve a problem

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SIMD PROCESSING Concurrency arises from performing the same operations on different pieces of data Single instruction multiple data (SIMD) E.g., dot product of two vectors Contrast with thread (“control”) parallelism Concurrency arises from executing different threads of control in parallel Contrast with data flow Concurrency arises from executing different operations in parallel (in a data driven manner) SIMD exploits instruction-level parallelism Multiple instructions concurrent: instructions happen to be the same

SIMD PROCESSING Single instruction operates on multiple data elements In time or in space Multiple processing elements Time-space duality Array processor: Instruction operates on multiple data elements at the same time Vector processor: Instruction operates on multiple data elements in consecutive time steps

ARRAY VS. VECTOR PROCESSORS ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same op @ same time Different ops @ time LD VR  A[3:0] ADD VR  VR, 1 MUL VR  VR, 2 ST A[3:0]  VR LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 AD3 MU2 ST1 Different ops @ same space MU3 ST2 Time Same op @ space ST3 Space Space

SCALAR PROCESSING Conventional form of processing (von Neumann model) add r1, r2, r3

SIMD ARRAY PROCESSING Array processor

VLIW PROCESSING Very Long Instruction Word We will get back to this later

VECTOR PROCESSORS A vector is a one-dimensional array of numbers Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is one whose instructions operate on vectors rather than scalar (single data) values Basic requirements Need to load/store vectors  vector registers (contain vectors) Need to operate on vectors of different lengths  vector length register (VLEN) Elements of a vector might be stored apart from each other in memory  vector stride register (VSTR) Stride: distance between two elements of a vector

VECTOR PROCESSOR ADVANTAGES + No dependencies within a vector Pipelining, parallelization work well Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work Reduces instruction fetch bandwidth + Highly regular memory access pattern Interleaving multiple banks for higher memory bandwidth Prefetching + No need to explicitly code loops Fewer branches in the instruction sequence

VECTOR PROCESSOR DISADVANTAGES -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

VECTOR PROCESSOR LIMITATIONS -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

SCALAR CODE EXAMPLE For I = 1 to 50 Scalar code C[i] = (A[i] + B[i]) / 2 Scalar code MOVI R0 = 50 1 MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1 X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ --R0, X 2 ;decrement and branch if NZ 304 dynamic instructions

SCALAR CODE EXECUTION TIME Scalar execution time on an in-order processor with 1 bank First two loads in the loop cannot be pipelined 2*11 cycles 4 + 50*40 = 2004 cycles 1 2 3 15 16 44 32 4 5 26 27 30 31 42 43 11 cycles 11 cycles 4 cycles 11 cycles MOVI R0 MOVAR1 MOVAR2 LD R4 MOVAR3 LD R5 ADD R6 SHIFT ST MEM DEC

SCALAR CODE EXECUTION TIME Scalar execution time on an in-order processor with 16 banks (word-interleaved) First two loads in the loop can be pipelined 4 + 50*30 = 1504 cycles 1 2 3 15 16 32 4 5 6 17 20 21 22 33 34 11 cycles 4 cycles 11 cycles MOVI R0 MOVAR1 MOVAR2 LD R4 MOVAR3 LD R5 SHIFT ST MEM DEC ADD R6

SCALAR CODE EXECUTION TIME Why 16 banks? 11 cycle memory access latency Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

VECTOR CODE EXAMPLE A loop is vectorizable if each iteration is independent of any other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1 7 dynamic instructions

V0 = A[0..49] Total cycle to load V0: 11 + VLN – 1 = 60 cycles 1 2 3 12 13 60 59 Issue MemReq 0 Issue MemReq 1 Issue MemReq 2 MemReq 0 done MemReq 1 done MemReq 2 done MemReq 48 done MemReq 49 done Total cycle to load V0: 11 + VLN – 1 = 60 cycles

VECTOR CODE EXECUTION TIME No chaining i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding) 16 memory banks (word-interleaved) 285 cycles

FURTHER READING: SIMD Recommended H&P, Appendix on Vector Processors Russell, “The CRAY-1 computer system,” CACM 1978.

VECTOR MACHINE EXAMPLE: CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

Samira Khan University of Virginia Sep 5, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 5, 2018 The content and concept of this course are adapted from CMU ECE 740