15-740/18-740 Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

The University of Adelaide, School of Computer Science

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Advanced Computer Architectures

Interconnect Basics 1. Where Is Interconnect Used? To connect components Many examples  Processors and processors  Processors and memories (banks) 

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Multi-core architectures. Single-core computer Single-core CPU chip.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Data Structures and Algorithms in Parallel Computing Lecture 1.

1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Outline Why this subject? What is High Performance Computing?

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Lecture 3: Computer Architectures

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

My Coordinates Office EM G.27 contact time:

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 14: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/18/2015.

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014.

Vector computers.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Design of Digital Circuits Lecture 20: SIMD Processors

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Computer Architecture: SIMD and GPUs (Part I)

COMP 740: Computer Architecture and Implementation

Advanced Architectures

18-447: Computer Architecture Lecture 30B: Multiprocessors

15-740/ Computer Architecture Lecture 3: Performance

15-740/ Computer Architecture Lecture 28: SIMD and GPUs

Computer Architecture: Parallel Processing Basics

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/12/2011

Design of Digital Circuits Lecture 21: GPUs

Parallel Processing - introduction

Computer Architecture: SIMD and GPUs (Part II)

Samira Khan University of Virginia Sep 4, 2017

Morgan Kaufmann Publishers

Computer Architecture: Multithreading (I)

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Samira Khan University of Virginia Sep 5, 2018

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 14: Prefetching

Computer Architecture Lecture 8: SIMD Processors and GPUs

Chapter 4 Multiprocessors

Computer Architecture Lecture 15: Dataflow and SIMD

Samira Khan University of Virginia Jan 30, 2019

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 3/19/2012

Presentation transcript:

15-740/ Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011

Readings Referenced Today ISA and Compilers (Review Set 2)  Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer  Wulf, “Compilers and Computer Architecture,” IEEE Computer Alternative ISA and execution models  Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA  Russell, “The CRAY-1 computer system,” CACM  Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro

Readings Referenced Today (Perhaps) On-chip networks  Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC  Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro Main memory controllers  Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security  Rixner et al., “Memory Access Scheduling,” ISCA Architecture reference manuals  Digital Equipment Corp., “VAX Architecture Handbook,”  Intel Corp. “Intel 64 and IA-32 Architectures Software Developer’s Manual” 3

Review Set 2 Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer Wulf, “Compilers and Computer Architecture,” IEEE Computer Due next Wednesday 4

CALCM Seminar Today “Efficient, Heterogeneous, Parallel Processing: The Design of a Micropolygon Rendering Pipeline” Kayvon Fatahalian, CMU 4-5 pm, September 14, Wednesday Singleton Room, Roberts Engineering Hall minar_11_09_14 minar_11_09_14 Attend and provide a review online  Review Set 3 5

Review: Data Parallelism Concurrency arises from performing the same operations on different pieces of data  Single instruction multiple data (SIMD)  E.g., dot product of two vectors Contrast with thread (“control”) parallelism  Concurrency arises from executing different threads of control in parallel Contrast with data flow  Concurrency arises from executing different operations in parallel (in a data driven manner) SIMD exploits instruction-level parallelism  Multiple instructions concurrent: instructions happen to be the same 6

Review: SIMD Processing Single instruction operates on multiple data elements  In time or in space Multiple processing elements Time-space duality  Array processor: Instruction operates on multiple data elements at the same time  Vector processor: Instruction operates on multiple data elements in consecutive time steps 7

Review: Vector Processors A vector is a one-dimensional array of numbers Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is one whose instructions operate on vectors rather than scalar (single data) values Basic requirements  Need to load/store vectors  vector registers (contain vectors)  Need to operate on vectors of different lengths  vector length register (VLEN)  Elements of a vector might be stored apart from each other in memory  vector stride register (VSTR) Stride: distance between two elements of a vector 8

Review: Vector Processors (II) A vector instruction performs an operation on each element in consecutive cycles  Vector functional units are pipelined  Each pipeline stage operates on a different data element Vector instructions allow deeper pipelines  No intra-vector dependencies  no hardware interlocking within a vector  No control flow within a vector  Known stride allows prefetching of vectors into memory 9

Review: Vector Processor Advantages + No dependencies within a vector  Pipelining, parallelization work well  Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work  Reduces instruction fetch bandwidth + Highly regular memory access pattern  Interleaving multiple banks for higher memory bandwidth  Prefetching + No need to explicitly code loops  Fewer branches in the instruction sequence 10

Review: Vector ISA Advantages Compact encoding  one short instruction encodes N operations Expressive, tells hardware that these N operations:  are independent  use the same functional unit  access disjoint registers  access registers in same pattern as previous instructions  access a contiguous block of memory (unit-stride load/store)  access memory in a known pattern (strided load/store) Scalable  can run the same code in parallel pipelines (lanes) 11

Review: Scalar Code Example For I = 1 to 50  C[i] = (A[i] + B[i]) / 2 Scalar code MOVI R0 = 501 MOVA R1 = A1 MOVA R2 = B1 MOVA R3 = C1 X: LD R4 = MEM[R1++]11 ;autoincrement addressing LD R5 = MEM[R2++]11 ADD R6 = R4 + R54 SHFR R7 = R6 >> 11 ST MEM[R3++] = R7 11 DECBNZ --R0, X2 ;decrement and branch if NZ dynamic instructions

Review: Vector Code Example A loop is vectorizable if each iteration is independent of any other For I = 0 to 49  C[i] = (A[i] + B[i]) / 2 Vectorized loop: MOVI VLEN = 501 MOVI VSTR = 11 VLD V0 = A11 + VLN - 1 VLD V1 = B11 + VLN – 1 VADD V2 = V0 + V14 + VLN - 1 VSHFR V3 = V2 >> 11 + VLN - 1 VST C = V311 + VLN – dynamic instructions

Scalar Code Execution Time Scalar execution time on an in-order processor with 1 bank  First two loads in the loop cannot be pipelined 2*11 cycles  *40 = 2004 cycles Scalar execution time on an in-order processor with 16 banks (word-interleaved)  First two loads in the loop can be pipelined  *30 = 1504 cycles Why 16 banks?  11 cycle memory access latency  Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency 14

Vector Code Execution Time No chaining  i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding) 16 memory banks (word-interleaved) 285 cycles 15

Vector Processor Disadvantages -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA

Vector Processor Limitations -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks 17

Further Reading: SIMD and GPUs Recommended  H&P, Appendix on Vector Processors  Russell, “The CRAY-1 computer system,” CACM  Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro

Vector Machine Example: CRAY-1 CRAY-1 Russell, “The CRAY-1 computer system,” CACM Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers 19

Another Way of Exploiting Parallelism SIMD:  Concurrency arises from performing the same operations on different pieces of data MIMD:  Concurrency arises from performing different operations on different pieces of data Control/thread parallelism: execute different threads of control in parallel  multithreading, multiprocessing  Idea: Use multiple processors to solve a problem 20

Flynn’s Taxonomy of Computers Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements  Array processor  Vector processor MISD: Multiple instructions operate on single data element  Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)  Multiprocessor  Multithreaded processor 21

On-Chip Network Based Multi-Core Systems A scalable multi-core is a distributed system on a chip 22 R PE R R R R R R R R R R R R R R R To East To PE To West To North To South Input Port with Buffers Control Logic Crossbar R Router PE Processing Element (Cores, L2 Banks, Memory Controllers, Accelerators, etc) Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA)

Idea of On-Chip Networks Problem: Connecting many cores with a single bus is not scalable  Single point of connection limits communication bandwidth What if multiple core pairs want to communicate with each other at the same time?  Electrical loading on the single bus limits bus frequency Idea: Use a network to connect cores  Connect neighboring cores via short links  Communicate between cores by routing packets over the network Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC

Advantages/Disadvantages of NoCs Advantages compared to bus + More links  more bandwidth  multiple core-to-core transactions can occur in parallel in the system (no single point of contention)  higher performance + Links are short and less loaded  high frequency + More scalable system  more components/cores can be supported on the network than on a single bus + Eliminates single point of failure Disadvantages - Requires routers that can route data/control packets  costs area, power, complexity - Maintaining cache coherence is more complex 24

Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth, electrical loading  reduced frequency) - High contention 25

Crossbar Every node connected to every other Good for small number of nodes + Least contention in the network: high bandwidth - Expensive - Not scalable due to quadratic cost Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II 26

Mesh O(N) cost Average latency: O(sqrt(N)) Easy to layout on-chip: regular and equal-length links Path diversity: many ways to get from one node to another Used in Tilera 100-core And many on-chip network prototypes 27

Torus Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle Torus avoids this problem + Higher path diversity than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths 28

Torus, continued Weave nodes to make inter-node latencies ~constant 29

Example NoC: 100-core Tilera Processor  Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro