Parallelism Lecture notes from MKP and S. Yalamanchili.

Slides:

Advertisements

Similar presentations

Parallelism Lecture notes from MKP and S. Yalamanchili.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Review: Multiprocessor Systems (MIMD)

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter 17 Parallel Processing.

Chapter 7 Multicores, Multiprocessors, and Clusters.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

CMPE 421 Parallel Computer Architecture Multi Processing 1.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Thread Level Parallelism (TLP) Lecture notes from MKP and S. Yalamanchili.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

PipeliningPipelining Computer Architecture (Fall 2006)

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

18-447: Computer Architecture Lecture 30B: Multiprocessors

Distributed Processors

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Parallel Processing - introduction

Simultaneous Multithreading

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Parallelism Lecture notes from MKP and S. Yalamanchili.

Pipeline Architecture since 1985

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Thread Level Parallelism (TLP)

Morgan Kaufmann Publishers

/ Computer Architecture and Design

Pipelining: Advanced ILP

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers The Processor

The University of Adelaide, School of Computer Science

Hardware Multithreading

Multiprocessors - Flynn’s taxonomy (1966)

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

/ Computer Architecture and Design

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

CSC3050 – Computer Architecture

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Virtual Memory Lecture notes from MKP and S. Yalamanchili.

The University of Adelaide, School of Computer Science

Guest Lecturer: Justin Hsia

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Parallelism Lecture notes from MKP and S. Yalamanchili

Overview Goal: Understand how to scale performance via parallelism Execute multiple instructions in parallel – instruction level parallelism (ILP) Break up a program into multiple parallel instruction streams – thread level parallelism (TLP) Process multiple data items in parallel – data level parallelism (DLP) Consequences Coordinating parallelism for correctness What about caching?

Reading Section 6.1 – 6.7 Section 4.10, 5.10

Execute Iterations in Parallel? Parallelism Sources main: la $t0, L1 li $t1, 4 add $t2, $zero, $zero Loop: lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4 addi $t1, $t1, -1 bne $t1, $zero, loop bgt $t2, $0, then move $s0, $t2 j exit then: move $s1, $t2 exit: Execute Iterations in Parallel? Execute in Parallel?

Characterizing Parallelism Today serial computing cores (von Neumann model) Data Streams Single instruction multiple data stream computing, e.g., SSE SISD SIMD Instruction Streams MISD MIMD Today’s Multicore Characterization due to M. Flynn* Difference between parallel and concurrent *M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t

Instruction Level Parallelism (ILP) Lecture notes from MKP and S. Yalamanchili

Assessing Performance Ideal CPI is increased by dependencies Performance impact on CPI can be assessed by computing the impact on a per instruction basis Increase in CPI = Base CPI + Probability_of_event * penalty_for_event For example, an event may be a branch misprediction or the occurrence of a data hazard The probability is computed for the occurrence of the event on an instruction Examples: pipelined processors

Instruction-Level Parallelism (ILP)(4.10) Morgan Kaufmann Publishers 12 September, 2018 Instruction-Level Parallelism (ILP)(4.10) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage  shorter clock cycle Multiple issue Replicate pipeline stages  multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice Chapter 4 — The Processor

Morgan Kaufmann Publishers 12 September, 2018 Multiple Issue Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor

MIPS with Static Dual Issue Morgan Kaufmann Publishers 12 September, 2018 MIPS with Static Dual Issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store n + 8 n + 12 n + 16 n + 20 Chapter 4 — The Processor

MIPS with Static Dual Issue Morgan Kaufmann Publishers 12 September, 2018 MIPS with Static Dual Issue ALU computation Address computation Aligned Instruction Pair ALU operation Load/store Chapter 4 — The Processor

Instruction Level Parallelism (ILP) Multiple instructions in EX at the same time IF ID MEM WB Single (program) thread of execution Issue multiple instructions from the same instruction stream Average CPI<1 Often called out of order (OOO) cores

Dynamically Scheduled Core Morgan Kaufmann Publishers 12 September, 2018 Dynamically Scheduled Core Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Chapter 4 — The Processor

Limits of ILP Dependencies! main: la $t0, L1 li $t1, 4 add $t2, $zero, $zero Loop: lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4 addi $t1, $t1, -1 bne $t1, $zero, loop bgt $t2, $0, then move $s0, $t2 j exit then: move $s1, $t2 exit: Dependencies!

AMD Bulldozer forum.beyond3d.comb

Instruction Level Parallelism (ILP) AMD Bobcat ECE 6100 Later in this course ECE 6100 Instruction Level Parallelism (ILP) Later in this course http://hothardware.com

Intel Atom http://www.anandtech.com

Thread Level Parallelism (TLP) Lecture notes from MKP and S. Yalamanchili

Morgan Kaufmann Publishers 12 September, 2018 Overview Goal: Higher performance through parallelism Job-level (process-level) parallelism High throughput for independent jobs Application-level parallelism Single program (mulitiple threads) run on multiple processors  speedup Each core can operate concurrently and in parallel Multiple threads may operate in a time sliced fashion on a single core Chapter 7 — Multicores, Multiprocessors, and Clusters

Thread Level Parallelism (TLP) (6.3, 6.4) Multiple threads of execution Exploit ILP in each thread Exploit concurrent execution across threads

Instruction and Data Streams Morgan Kaufmann Publishers 12 September, 2018 Instruction and Data Streams Taxonomy due to M. Flynn Data Streams Single Multiple Instruction Streams SISD: Intel Pentium 4 SIMD: SSE instructions of x86 MISD: No examples today MIMD: Intel Xeon e5345 Example: Multithreading (MT) in a single address space Chapter 7 — Multicores, Multiprocessors, and Clusters

Recall The Executable Format Object file ready to be linked and loaded header text static data reloc symbol table debug An executable instance or Process Linker Loader Load creates an executable instance or process  allocate memory, copy data and code initialize registers and stack, and jumps to the entry point Static Libraries What does a loader do?

Process A process is a running program with state Code Static data Heap Stack DLL’s A process is a running program with state Stack, memory, open files PC, registers The operating system keeps tracks of the state of all processors E.g., for scheduling processes There many processes for the same application E.g., web browser Operating systems class for details

Process Level Parallelism Parallel processes and throughput computing Each process itself does not run any faster

From Processes to Threads Switching processes on a core is expensive A lot of state information to be managed If I want concurrency, launching a process is expensive How about splitting up a single process into parallel computations?  Lightweight processes or threads!

Thread Parallel Execution Process thread

A Thread A separate, concurrently executable instruction stream within a process Minimum amount state to execute on a core Program counter, registers, stack Remaining state shared with the parent process Memory and files Support for creating threads Support for merging/terminating threads Support for synchronization between threads In accesses to shared data Our datapath so far!

Data Parallel Computation A Simple Example Data Parallel Computation

Thread Execution: Basics funcA() Static data Heap Stack PC, registers, stack pointer create_thread(funcB) create_thread(funcA) Thread #2 PC, registers, stack pointer Stack funcA() funcB() end_thread() end_thread() WaitAllThreads() funcB()

Threads Execution on a Single Core Hardware threads Each thread has its own hardware state Switching between threads on each cycle to share the core pipeline – why? Thread #1 lw $t0, label($0) lw $t1, label1($0) and $t2, $t0, $t1 andi $t3, $t1, 0xffff srl $t2, $t2, 12 …… IF ID MEM WB EX lw Interleaved execution Improve utilization ! lw lw Thread #2 lw lw lw lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4 addi $t1, $t1, -1 bne $t1, $zero, loop ……. add lw lw lw and add lw lw lw No pipeline stall on load-to-use hazard!

An Example Datapath From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU

Conventional Multithreading Zero-overhead context switch Duplicated contexts for threads 0:r0 0:r7 1:r0 CtxtPtr Memory (shared by threads) 1:r7 2:r0 2:r7 3:r0 3:r7 Register file Courtesy H. H. Lee

Software Threads PC, registers, stack pointer funcA() Static data Heap Stack PC, registers, stack pointer create_thread(funcB) create_thread(funcA) Thread #2 PC, registers, stack pointer Stack Need to save and restore thread state funcA() funcB() end_thread() end_thread() WaitAllThreads() funcB()

Execution Model: Multithreading Morgan Kaufmann Publishers 12 September, 2018 Execution Model: Multithreading Fine-grain multithreading Switch threads after each cycle Interleave instruction execution Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but does not hide short stalls (e.g., data hazards) If one thread stalls (e.g., I/O), others are executed Chapter 7 — Multicores, Multiprocessors, and Clusters

Simultaneous Multithreading Morgan Kaufmann Publishers 12 September, 2018 Simultaneous Multithreading In multiple-issue dynamically scheduled processors Instruction-level parallelism across threads Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and caches Known as Hyperthreading in Intel terminology Chapter 7 — Multicores, Multiprocessors, and Clusters

Hyper-threading 2 CPU Without Hyper-threading 2 CPU With Hyper-threading Processor Execution Resources Processor Execution Resources Processor Execution Resources Processor Execution Resources Arch State Arch State Arch State Arch State Arch State Arch State Implementation of Hyper-threading adds less than 5% to the chip area Principle: share major logic components (functional units) and improve utilization Architecture State: All core pipeline resources needed for executing a thread

Multithreading with ILP: Examples Morgan Kaufmann Publishers 12 September, 2018 Multithreading with ILP: Examples Chapter 7 — Multicores, Multiprocessors, and Clusters

Thread Synchronization (6.5) Process thread Share data?

Thread Interactions What about shared data? Need synchronization support Several different types of synchronization: we will look at one in detail We are specifically interested in the exposure in the ISA

Example: Communicating Threads Producer Consumer The Producer calls while (1) { while (count == BUFFER_SIZE) ; // do nothing // add an item to the buffer ++count; buffer[in] = item; in = (in + 1) % BUFFER_SIZE; } Thread 1

Example: Communicating Threads Producer Consumer The Consumer calls while (1) { while (count == 0) ; // do nothing // remove an item from the buffer --count; item = buffer[out]; out = (out + 1) % BUFFER_SIZE; } Thread 2

Uniprocessor Implementation count++ could be implemented as register1 = count; register1 = register1 + 1; count = register1; count-- could be implemented as register2 = count; register2 = register2 – 1; count = register2; Consider this execution interleaving: S0: producer execute register1 = count {register1 = 5} S1: producer execute register1 = register1 + 1 {register1 = 6} S2: consumer execute register2 = count {register2 = 5} S3: consumer execute register2 = register2 - 1 {register2 = 4} S4: producer execute count = register1 {count = 6 } S5: consumer execute count = register2 {count = 4}

Synchronization We need to prevent certain instruction interleaving Or at least be able to detect violations! Some sequence of operations (instructions) must happen atomically E.g., register1 = count; register1 = register1 + 1; count = register1; atomic operations/instructions Serializing access to shared resources is a basic requirement of concurrent computation What are critical sections?

The University of Adelaide, School of Computer Science 12 September 2018 Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don’t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction E.g., atomic swap of register ↔ memory Or an atomic pair of instructions Chapter 2 — Instructions: Language of the Computer

Implementing an Atomic Operation // lock object is shared by all threads while (lock.getAndSet(true)) Thread.yield(); Update count; lock.set(false); Atomic Atomic

Synchronization in MIPS The University of Adelaide, School of Computer Science 12 September 2018 Synchronization in MIPS Load linked: ll rt, offset(rs) Store conditional: sc rt, offset(rs) Succeeds if location not changed since the ll Returns 1 in rt Fails if location is changed Returns 0 in rt Example: atomic swap (to test/set lock variable) try: add $t0,$zero,$s4 ;copy exchange value ll $t1,0($s1) ;load linked sc $t0,0($s1) ;store conditional beq $t0,$zero,try ;branch store fails add $s4,$zero,$t1 ;put load value in $s4 Chapter 2 — Instructions: Language of the Computer

Other Synchronization Primitives test&set(lock) Atomically read and set a lock variable swap r1, r2, [r0] With 1/0 values this functions as a lock variable ….and a few others

Commodity Multicore Processor Coherent Shared Memory Programming Model From www.zdnet.com

Core Microarchitecture

Morgan Kaufmann Publishers 12 September, 2018 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning Coordination Communications overhead Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 12 September, 2018 Amdahl’s Law (6.2) Sequential part can limit speedup Example: 100 processors, 90× speedup? Tnew = Tparallelizable/100 + Tsequential Solving: Fparallelizable = 0.999 Need sequential part to be 0.1% of original time Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 12 September, 2018 Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10 (10% of potential) Idealized model Assumes load can be balanced across processors Chapter 7 — Multicores, Multiprocessors, and Clusters

Scaling Example (cont) Morgan Kaufmann Publishers 12 September, 2018 Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Idealized model Assuming load balanced Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 12 September, 2018 Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 × 10 matrix Time = 20 × tadd 100 processors, 32 × 32 matrix Time = 10 × tadd + 1000/100 × tadd = 20 × tadd Constant performance in this example For a fixed size system grow the number of processors to improve performance Chapter 7 — Multicores, Multiprocessors, and Clusters

Cache Coherence (5.10) A shared variable may exist in multiple caches Multiple copies to improve latency This is a really a synchronization problem

Cache Coherence Problem Morgan Kaufmann Publishers 12 September, 2018 Cache Coherence Problem Suppose two CPU cores share a physical address space Write-through caches Time step Event CPU A’s cache CPU B’s cache Memory 1 CPU A reads X 2 CPU B reads X 3 CPU A writes 1 to X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Example (Writeback Cache) Rd? Rd? Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory X= -100 Courtesy H. H. Lee

Morgan Kaufmann Publishers 12 September, 2018 Coherence Defined Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes)  read returns written value P1 writes X; P2 reads X (sufficiently later)  read returns written value c.f. CPU B reading X after step 3 in example P1 writes X, P2 writes X  all processors see writes in the same order End up with the same final value for X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Coherence Protocols Morgan Kaufmann Publishers 12 September, 2018 Cache Coherence Protocols Operations performed by caches in multiprocessors to ensure coherence Migration of data to local caches Reduces bandwidth for shared memory Replication of read-shared data Reduces contention for access Snooping protocols Each cache monitors bus reads/writes Directory-based protocols Caches and memory record sharing status of blocks in a directory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Invalidating Snooping Protocols Morgan Kaufmann Publishers 12 September, 2018 Invalidating Snooping Protocols Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory CPU A reads X Cache miss for X CPU B reads X CPU A writes 1 to X Invalidate for X 1 CPU B read X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Programming Model: Message Passing (6.7) Morgan Kaufmann Publishers 12 September, 2018 Programming Model: Message Passing (6.7) Each processor has private physical address space Hardware sends/receives messages between processors Chapter 7 — Multicores, Multiprocessors, and Clusters

Parallelism Write message passing programs Explicit send and receive of data Rather than accessing data in shared memory Process 2 Process 2 send() receive() receive() send()

High Performance Computing theregister.co.uk zdnet.com The dominant programming model is message passing Scales well but requires programmer effort Science problems have fit this model well to date

A Simple MPI Program The Message Passing Interface (MPI)Library #include <stdio.h> #include <stdlib.h> #include <mpi.h> #include <math.h> int main(argc,argv) int argc; char *argv[]; { int myid, numprocs; int tag,source,destination,count; int buffer; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); tag=1234; source=0; destination=1; count=1; if(myid == source){ buffer=5678; MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD); printf("processor %d sent %d\n",myid,buffer); } if(myid == destination){ MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status); printf("processor %d got %d\n",myid,buffer); } MPI_Finalize(); } The Message Passing Interface (MPI)Library From http://geco.mines.edu/workshop/class2/examples/mpi/c_ex01.c

A Simple MPI Program #include "mpi.h" #include <stdio.h> #include <math.h> int main( int argc, char *argv[] ) { int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (1) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } } MPI_Finalize(); return 0; http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.htm

Loosely Coupled Clusters Morgan Kaufmann Publishers 12 September, 2018 Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters

Data Level Parallelism (DLP) Lecture notes from MKP and S. Yalamanchili

Data Parallelism Image Processing Search Records Each thread searches each group of records in parallel Process each square in parallel – data parallel computation

Characterizing Parallelism Today serial computing cores (von Neumann model) Data Streams Single instruction multiple data stream computing, e.g., SSE SISD SIMD Instruction Streams MISD MIMD Today’s Multicore Characterization due to M. Flynn* *M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t

Parallelism Categories From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Multimedia (3.6, 3.7, 3.8, 6.3) Lower dynamic range and precision requirements Do not need 32-bits! Inherent parallelism in the operations

Vector Computation Operate on multiple data elements (vectors) at a time Flexible definition/use of registers Registers hold integers, floats (SP), doubles DP) 128-bit Register 1x128 bit integer 2x64-bit double precision 4 x 32-bit single precision 8x16 short integers

Processing Vectors When is this more efficient? Memory vector registers When is this not efficient? Think of 3D graphics, linear algebra and media processing

Programming Model: SIMD Morgan Kaufmann Publishers 12 September, 2018 Programming Model: SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Data Level Parallelism Chapter 7 — Multicores, Multiprocessors, and Clusters

Case Study: Intel Streaming SIMD Extensions 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15 8, 16, 32, 64 bit integers (SSE2) 32-bit (SP) and 64-bit (DP) floating point Signed/unsigned integer operations IEEE 754 floating point support Reading Assignment: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

Instruction Categories Floating point instructions Arithmetic, movement Comparison, shuffling Type conversion, bit level Integer Other e.g., cache management ISA extensions! Advanced Vector Extensions (AVX) Successor to SSE register memory

Morgan Kaufmann Publishers 12 September, 2018 Arithmetic View Graphics and media processing operates on vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data) Saturating operations On overflow, result is largest representable value c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video 4x16-bit 2x32-bit Chapter 3 — Arithmetic for Computers

More complex example (matrix multiply) in Section 3.8 – using AVX SSE Example // A 16byte = 128bit vector struct struct Vector4 { float x, y, z, w; }; // Add two constant vectors and return the resulting vector Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ) { Vector4 Ret_Vector; __asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector; } More complex example (matrix multiply) in Section 3.8 – using AVX From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

Intel Xeon Phi www.techpowerup.com www.anandtech.com www.anandtech.com

Data Parallel vs. Traditional Vector Vector Architecture Vector Register A Vector Register C Vector Register B pipelined functional unit Data Parallel Architecture registers Process each square in parallel – data parallel computation

ISA View Separate core data path CPU/Core SIMD Registers $0 $1 $31 XMM0 XMM1 XMM15 ALU Multiply Divide Vector ALU Hi Lo Separate core data path Can be viewed as a co-processor with a distinct set of instructions

Study Guide What is the difference between hardware MT and software MT Distinguish between TLP and ILP Given two threads of computation, the MIPS pipeline, fine grained MT, show the state of the pipeline after 7 cycles How many threads do you need with fine grain FT before your branch penalties are no longer a problem? With coarse grain MT on a datapath with full fowarding, can you still have load-to-use hazards?

Study Guide (cont.) Name two differences between the coherent shared memory and message passing architectures What limits how much performance you can get from ILP? Which do you think would be more energy efficient, ILP or TLP? Why? (FYI: This is a complex issue)

Glossary Atomic operations Coarse grain MT Fine grained MT Grid computing Hyperthreading Instruction Level Parallelism (ILP) Multithreading Message Passing Interface (MPI) Simultaneous MT Strong scaling Swap instruction Test & set operation Thread Level Parallelism (TLP) Weak scaling