Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.

Multiprocessor Systems

2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply) Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Uniprocessor performance only 2X / 5 yrs? Conventional Wisdom (CW) in Computer Architecture

3 Uniprocessor Performance (SPECint) RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006  Sea change in chip design: multiple “cores” or processors per chip 3X

Evolution from Single Core to Multi-Core “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’05Intel/’06IBM/’04Sun/’05 Processors/chip 2228 Threads/Processor 1224 Threads/chip 24432 Procrastination : to keep delaying something that must be done

5 Flynn’s Taxonomy Flynn classified by data and control streams in 1966 SIMD  Data Level Parallelism MIMD  Thread Level Parallelism MIMD popular because Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Flynn’s Taxonomy SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Single data stream operated by successive functional units SIMD (Single Instruction Multiple Data) Instruction stream executed by multiple processors on different data Simple programming model, low overhead Examples: Connection Machine and Vector Processors MIMD (Multiple Instruction Multiple Data) is the most general Flexibility for parallel applications and multi-programmed systems Processors can run independent processes or applications Processors can also run threads belonging to one parallel application Uses off-the-shelf microprocessors and components

Major MIMD Styles Symmetric Multiprocessors (SMP) Main memory is shared and equally accessible by all processors Called also Uniform Memory Access (UMA) Bus based or interconnection network based Distributed memory multiprocessors Distributed Shared Memory (DSM) multiprocessors Distributed memories are shared and can be accessed by all processors Non-uniform memory access (NUMA) Latency varies between local and remote memory access Message-Passing multiprocessors, multi-computers, and clusters Distributed memories are NOT shared Each processor can access its own local memory Processors communicate by sending and receiving messages

Shared Memory Architecture Any processor can directly reference any physical memory Any I/O controller to any physical memory Operating system can run on any processor OS uses shared memory to coordinate Communication occurs implicitly as result of loads and stores Wide range of scale Few to hundreds of processors Memory may be physically distributed among processors History dates to early 1960s Shared Physical Memory Processor I/O Processor

Shared Memory Organizations P 1 $ Interconnection network $ P n Mem Dance Hall (UMA) P 1 $ Interconnection network $ P n Mem Distributed Shared Memory (NUMA) P 1 $ $ P n Mem I/O devices Bus-based Shared Memory P 1 Switch Main memory P n Interleaved Cache Shared Cache

Bus-Based Symmetric Multiprocessors Symmetric access to main memory from any processor Dominate the server market Building blocks for larger systems Attractive as throughput servers and for parallel programs I/O system Main memory Bus P1 Multilevel Cache Pn Multilevel Cache  Uniform access via loads/stores  Automatic data movement and coherent replication in caches  Cheap and powerful extension to uniprocessors  Key is extension of memory hierarchy to support multiple processors

Shared Address Space Programming Model A process is a virtual address space With one or more threads of control Part of the virtual address space is shared by processes Multiple threads share the address space of a single process All communication is through shared memory Achieved by loads and stores Writes by one process/thread are visible to others Special atomic operations for synchronization OS uses shared memory to coordinate processes

Medium and Large Scale Multiprocessors Problem is interconnect: high cost (crossbar) or bandwidth (bus) Centralized memory or uniform memory access (UMA) Latencies to memory uniform, but uniformly large Interconnection network: crossbar or multi-stage Distributed shared memory or non-uniform memory access (NUMA) Distributed memory forms one global shared physical address space Access to local memory is faster than access to remote memory  Interconnection Network Distributed Shared Memory M P $M P $M P $  Interconnection Network Centralized Memory M P $ P $ P $ M M 

Message Passing Architectures Complete computer as a building block Includes processor, memory, and I/O system Easier to build and scale than shared memory architectures Communication via explicit I/O operations Communication integrated at I/O level, Not into memory system Much in common with networks of workstations or clusters However, tight integration between processor and network Network is of higher capability than a local area network Programming model Direct access only to local memory (private address space) Communication via explicit send/receive messages (library or system calls)  Interconnection Network M P $M P $M P $

Message-Passing Abstraction Send specifies receiving process and buffer to be transmitted Receive specifies sending process and buffer to receive into Optional tag on send and matching rule on receive Matching rule: match a specific tag t or any tag or any process Combination of send and matching receive achieves … Pairwise synchronization event Memory to memory copy of message Overheads: copying, buffer management, protection Example: MPI and PVM message passing libraries Address X Send Q, X, t Match Local process address space Address Y Local process address space Process PProcess Q Receive P, Y, t

Variants of Send and Receive Parallel programs using send and receive are quite structured Most often, all nodes execute identical copies of a program Processes can name each other using a simple linear ordering Blocking send: Sender sends a request and waits until the reply is returned Non-blocking send: Sender sends a message and continues without waiting for a reply Blocking receive: Receiver blocks if it tries to receive a message that has not arrived Non-blocking receive: Receiver simply posts a receive without waiting for sender

Evolution of Message-Passing Machines Early machines: FIFO on each link Hardware close to programming model Synchronous send/receive operations Topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz)

Example Intel Paragon Sandia’s Intel Paragon XP/S-based SuperComputer Each card is an SMP with two or more i860 processors and a network interface chip connected to the cache-coherent memory bus. Each node has a DMA engine to transfer contiguous chunks of data to and from the network at a high rate.

Example 1: Matrix Addition (1/3) Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN matrices using P processor. (1) How to partition the array of results (C) for parallel processing among P processes! A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows and N columns, or (2) Block-column partitioning, where each process is assigned N rows and N/p columns. The first is preferred due to row-major storage. (2) Find the number of results assigned to each processes for load balancing There are N 2 results to compute from array C. Each processor is to have N 2 /P for load balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N results. (3) Determine the range of results that must be computed by each process as function of the process ID (Pid). Using block-row partitioning, each process is assigned N/p rows and N columns, This can be achieved if we partition A into block-rows such that each processor will get the range: Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the above formula to work on its range, i.e. parallel SPMD program.

Example 1: Matrix Addition (2/3) (3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition by P processes (threads) by assuming that all arrays are declared as shared data. The SPMD program that will run on each processor will be: Imin=Pid*N/P; Imax= Imin+N/P-1 ; For i= Imin, I < Imax { For j=0, i<N c[i,j] = a[i,j] + b[i,j]; } Note: the range of computed data for each process depends on the Pid.

Example 1: Matrix Addition (3/3) (4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process will run only in its range of data, and the program and data return to master processor (process 0) after completion of all the iterations. {MPI-init; MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI MPI-Comm-rank(MPI-comm-wolrd, &mypid) #Get process id as mypid array_size = N; #Now Process 0 scatter arrays A and B over all the processes: MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B my_range = N/P; #Nbr of rows to processes by each process For i= 0, i < my_range { For j = 0, I < N c[i,j] = a[i,j] + b[i,j]; } MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world) }

Example 2: Summing 100,000 Numbers on 100 Processors sum[Pid] = 0; for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1) sum[Pid] = sum[Pid] + A[i];  Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pid is the processor’s id, i is a private variable)  The processors then coordinate in adding together the partial sums (P is a private variable initialized to 100 (the number of processors)) repeat synch();/*synchronize first if (P%2 != 0 && Pid == 0) sum[0] = sum[0] + sum[P-1]; P = P/2 if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P] until (P == 1); /*final sum in sum[0]

An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P = 10

An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0 P1P2P3P4 P = 10 P = 5 P1 P = 2 P0 P = 1

Message Passing Multiprocessors Each processor has its own private address space Q1 – Processors share data by explicitly sending and receiving information (messages) Q2 – Coordination is built into message passing primitives (send and receive)

Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.

Similar presentations

Presentation on theme: "Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.

Similar presentations

Presentation on theme: "Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can."— Presentation transcript:

Similar presentations

About project

Feedback