Samira Khan University of Virginia Sep 4, 2017

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Introduction to Parallel Processing Ch. 12, Pg
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
Parallel Computing.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Lecture 3: Computer Architectures
15-740/ Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011.
Computer Architecture Lecture 14: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/18/2015.
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014.
15-740/ Computer Architecture Lecture 2: ISA, Tradeoffs, Performance Prof. Onur Mutlu Carnegie Mellon University.
Samira Khan University of Virginia Feb 2, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: ISA Tradeoffs The content and concept of this course.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.
1 Lecture 5a: CPU architecture 101 boris.
Computer Architecture: VLIW, DAE, Systolic Arrays Prof. Onur Mutlu Carnegie Mellon University.
Design of Digital Circuits Lecture 19: Approaches to Concurrency
These slides are based on the book:
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Computer Architecture: SIMD and GPUs (Part I)
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Advanced Architectures
18-447: Computer Architecture Lecture 30B: Multiprocessors
Computer Architecture Lecture 2: Fundamental Concepts and ISA
15-740/ Computer Architecture Lecture 4: Pipelining
15-740/ Computer Architecture Lecture 3: Performance
Computer Architecture: Parallel Processing Basics
Computer Architecture Principles Dr. Mike Frank
15-740/ Computer Architecture Lecture 4: ISA Tradeoffs
buses, crossing switch, multistage network.
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/12/2011
Parallel Processing - introduction
Samira Khan University of Virginia Sep 6, 2017
CS161 – Design and Architecture of Computer Systems
15-740/ Computer Architecture Lecture 7: Pipelining
Morgan Kaufmann Publishers
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/14/2013
EE 193: Parallel Computing
CS775: Computer Architecture
Pipelining and Vector Processing
Array Processor.
Data Structures and Algorithms in Parallel Computing
Computer Architecture Dataflow (Part II) and Systolic Arrays
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Multivector and SIMD Computers
Samira Khan University of Virginia Sep 10, 2018
Samira Khan University of Virginia Sep 12, 2018
buses, crossing switch, multistage network.
Overview Parallel Processing Pipelining
Samira Khan University of Virginia Sep 5, 2018
What is Computer Architecture?
COMS 361 Computer Organization
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
COMPUTER ORGANIZATION AND ARCHITECTURE
Computer Architecture Lecture 15: Dataflow and SIMD
Samira Khan University of Virginia Jan 30, 2019
Design of Digital Circuits Lecture 23a: Systolic Arrays and Beyond
Presentation transcript:

Samira Khan University of Virginia Sep 4, 2017 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 4, 2017 The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from last lecture Fundamental concepts Computing models ISA Tradeoffs

LOGISTICS Review 2 Due Wednesday Participation Project list Dennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA 1974. Kung, H. T., "Why Systolic Architectures?", IEEE Computer 1982. Participation Discuss in Piazza and class (5% grade) Project list Will be open Wednesday, start early Be prepared to spend time on the project

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SIMD PROCESSING Single instruction operates on multiple data elements In time or in space Multiple processing elements Time-space duality Array processor: Instruction operates on multiple data elements at the same time Vector processor: Instruction operates on multiple data elements in consecutive time steps

SCALAR CODE EXAMPLE For I = 1 to 50 Scalar code C[i] = (A[i] + B[i]) / 2 Scalar code MOVI R0 = 50 1 MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1 X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ --R0, X 2 ;decrement and branch if NZ 304 dynamic instructions

VECTOR CODE EXAMPLE A loop is vectorizable if each iteration is independent of any other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1 7 dynamic instructions

SCALAR CODE EXECUTION TIME Scalar execution time on an in-order processor with 1 bank First two loads in the loop cannot be pipelined 2*11 cycles 4 + 50*40 = 2004 cycles Scalar execution time on an in-order processor with 16 banks (word-interleaved) First two loads in the loop can be pipelined 4 + 50*30 = 1504 cycles Why 16 banks? 11 cycle memory access latency Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

VECTOR CODE EXECUTION TIME No chaining i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding) 16 memory banks (word-interleaved) 285 cycles

VECTOR PROCESSOR DISADVANTAGES -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

VECTOR PROCESSOR LIMITATIONS -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

FURTHER READING: SIMD Recommended H&P, Appendix on Vector Processors Russell, “The CRAY-1 computer system,” CACM 1978.

VECTOR MACHINE EXAMPLE: CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES? Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

SYSTOLIC ARCHITECTURES Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

SYSTOLIC COMPUTATION EXAMPLE Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

SYSTOLIC ARRAYS: PROS AND CONS Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

AGENDA Logistics Review from last lecture Fundamental concepts Computing models ISA Tradeoffs

LEVELS OF TRANSFORMATION ISA Agreed upon interface between software and hardware SW/compiler assumes, HW promises What the software writer needs to know to write system/user programs Microarchitecture Specific implementation of an ISA Not visible to the software Microprocessor ISA, uarch, circuits “Architecture” = ISA + microarchitecture Problem Algorithm Program/Language ISA ISA is the interface between hardware and software… It is a contract that the hardware promises to satisfy. Builder/user interface Microarchitecture Logic Circuits

ISA VS. MICROARCHITECTURE What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?

ISA Instructions Memory Call, Interrupt/Exception Handling Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes Memory Address space, Addressability, Alignment Virtual memory management Call, Interrupt/Exception Handling Access Control, Priority/Privilege I/O Task Management Power and Thermal Management Multi-threading support, Multiprocessor support

Microarchitecture Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software Pipelining In-order versus out-of-order instruction execution Memory access scheduling policy Speculative execution Superscalar processing (multiple instruction issue?) Clock gating Caching? Levels, size, associativity, replacement policy Prefetching? Voltage/frequency scaling? Error correction?

Samira Khan University of Virginia Sep 4, 2017 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 4, 2017 The content and concept of this course are adapted from CMU ECE 740