High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division
High Performance Parallel Programming Lecture 4:Introduction to Parallel Architectures Distributed Memory Systems
High Performance Parallel Programming Admimistrivia I would like to organise a tutorial session for those students who would like an account on the HPC systems. This tute will cover the actual physical machines we have and how to use the batch system to actually submit parallel jobs. lecture slides: ppt vs pdf?
High Performance Parallel Programming Lecture 3 We introduced parallel architectures and then looked closely at Shared Memory Systems. In particular we looked at the issues relating to cache coherency and false sharing. Note: Distributed Memory machines don’t avoid the problems which cause cache coherency problems - they just offload them to you!
High Performance Parallel Programming Recap - Architecture P P PP Interconnection Network MMMM ° Where is the memory physically located? Memory
High Performance Parallel Programming Distributed Memory Each processor is connected to its own memory and cache and cannot directly access another processors memory ccNUMA is a special case... P P PP Interconnection Network MMMM
High Performance Parallel Programming Topologies Early distributed memory machines were built using internode connections (limited numbers) Messages were forwarded by processors on path There was a strong emphasis on topology in algorithms to minimize number of hops e.g. Hypercube number of hops = log 2 n
High Performance Parallel Programming Network Analogy To have a large number of transfers occurring at once, you need a large number of distinct wires. Networks are like streets: Link = street. Switch = intersection. Distances (hops) = number of blocks traveled. Routing algorithm = travel plan. Properties: Latency: how long to get between nodes in the network. Bandwidth: how much data can be moved per unit time: Bandwidth is limited by the number of wires and the rate at which each wire can accept data.
High Performance Parallel Programming Characteristics of a Network –Topology (how things are connected) Crossbar, ring, 2-D and 2-D torus, hypercube, omega network. –Routing algorithm: Example: all east-west then all north-south (avoids deadlock). –Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. –Flow control (what if there is congestion): Stall, store data temporarily in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, etc.
High Performance Parallel Programming Properties of a Network –Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes. –A network is partitioned into two or more disjoint sub-graphs if some nodes cannot reach others. –The bandwidth of a link = w * 1/t w is the number of wires t is the time per bit –Effective bandwidth is usually lower due to packet overhead. –Bisection bandwidth: sum of the bandwidths of the minimum number of channels which, if removed, would partition the network into two sub-graphs.
High Performance Parallel Programming Network Topology –In the early years of parallel computing, there was considerable research in network topology and in mapping algorithms to topology. –Key cost to be minimized in early years: number of “hops” (communication steps) between nodes. –Modern networks hide hop cost (ie, “wormhole routing”), so the underlying topology is no longer a major factor in algorithm performance. –Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec. However, since some algorithms have a natural topology, it is worthwhile to have some background in this arena.
High Performance Parallel Programming Linear and Ring Topologies –Linear array Diameter = n-1; average distance ~ n/3. Bisection bandwidth = 1. –Torus or Ring Diameter = n/2; average distance ~ n/4. Bisection bandwidth = 2. Natural for algorithms that work with 1D arrays.
High Performance Parallel Programming Meshes and Tori – 2D Diameter = 2 * n Bisection bandwidth = n 2D mesh 2D torus °Generalizes to higher dimensions (Cray T3E uses 3D Torus). ° Natural for algorithms that work with 2D and/or 3D arrays.
High Performance Parallel Programming Hypercubes –Number of nodes n = 2 d for dimension d. Diameter = d. Bisection bandwidth = n/2. – 0d 1d 2d 3d 4d –Popular in early machines (Intel iPSC, NCUBE). Lots of clever algorithms. –Greycode addressing: Each node connected to d others with 1 bit different
High Performance Parallel Programming Trees –Diameter = log n. –Bisection bandwidth = 1. –Easy layout as planar graph. –Many tree algorithms (e.g., summation). –Fat trees avoid bisection bandwidth problem: More (or wider) links near top. Example: Thinking Machines CM-5.
High Performance Parallel Programming Butterflies –Diameter = log n. –Bisection bandwidth = n. –Cost: lots of wires. –Used in BBN Butterfly. –Natural for FFT. O 1
High Performance Parallel Programming Evolution of Distributed Memory Multiprocessors –Special queue connections are being replaced by direct memory access (DMA): Processor packs or copies messages. Initiates transfer, goes on computing. –Message passing libraries provide store-and-forward abstraction: Can send/receive between any pair of nodes, not just along one wire. Time proportional to distance since each processor along path must participate. –Wormhole routing in hardware: Special message processors do not interrupt main processors along path. Message sends are pipelined. Processors don’t wait for complete message before forwarding.
High Performance Parallel Programming Networks of Workstations Todays dominant paridigm Package solutions available –IBM SP2, Compaq Alphaserver SC... Fast commodity interconnects now available –Myrinet, ServerNet... Many applications do not need close coupling –even Ethernet will do. Cheap!!!
High Performance Parallel Programming Latency and Bandwidth Model –Time to send message of length n is roughly. Time= latency + n*cost_per_word = latency + n/bandwidth –Topology is assumed irrelevant. –Often called “ model” and written Time= + n* –Usually >> >> time per flop. One long message is cheaper than many short ones. + n* << n*( + 1* ) Can do hundreds or thousands of flops for cost of one message. –Lesson: Need large computation-to-communication ratio to be efficient.
High Performance Parallel Programming A more detailed performance model: LogP –L: latency across the network (may be variable). –o: overhead (sending and receiving busy time). –g: gap between messages (1/bandwidth). –P: number of processors. –People often group overheads into latency ( model). –Real costs more complicated P M osos oror L (latency)
High Performance Parallel Programming Local Machines NEC SX-4 Parallel Vector Processor –2 cpus, X-bar Compaq ES40 - SMP –4 cpus, X-bar Compaq DS10,PW600 - Network of Workstations –28 cpus, Fast Ethernet VPAC AlphaserverSC - Cluster of SMPs –128 cpus (4 * 32), Quadrics Switch
High Performance Parallel Programming Programming Models Machine architectures and Programming models have a natural affinity, both in their descriptions and in there effects on each other. Algorithms have been developed to fit machines and machines have been built to fit algorithms. Purpose built Supercomputers - ASCI FPGAs - QCDmachines
High Performance Parallel Programming Parallel Programming Models –Control How is parallelism created? What orderings exist between operations? How do different threads of control synchronize? –Data What data is private vs. shared? How is logically shared data accessed or communicated? –Operations What are the atomic operations? –Cost How do we account for the cost of each of the above?
High Performance Parallel Programming Trivial Example –Parallel Decomposition: Each evaluation and each partial sum is a task. –Assign n/p numbers to each of p procs Each computes independent “private” results and partial sum. One (or all) collects the p partial sums and computes the global sum. Two Classes of Data: –Logically Shared The original n numbers, the global sum. –Logically Private The individual function evaluations. What about the individual partial sums? fAi([]) i 0 n 1
High Performance Parallel Programming Model 1: Shared Address Space Program consists of a collection of threads of control. Each has a set of private variables, e.g. local variables on the stack. Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate explicitly by synchronization operations on shared variables -- writing and reading flags, locks or semaphores. Like concurrent programming on a uniprocessor.
High Performance Parallel Programming Shared Address Space PP P... x =...y =..x... Shared Private Machine model - Shared memory
High Performance Parallel Programming Shared Memory Code for Computing a Sum Thread 1 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What could go wrong?
High Performance Parallel Programming Solution via Synchronization ° Instructions from different threads can be interleaved arbitrarily. ° What can final result s stored in memory be? ° Problem: race condition. ° Possible solution: mutual exclusion with locks ° Pitfall in computing a global sum s = local_s1 + local_s2: Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem] Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] Time Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock ° Locks must be atomic (execute completely without interruption).
High Performance Parallel Programming Model 2: Message Passing Program consists of a collection of named processes. Thread of control plus local address space -- NO shared data. Local variables, static variables, common blocks, heap. Processes communicate by explicit data transfers -- matching send and receive pair by source and destination processors (these may be broadcast). Coordination is implicit in every communication event. Logically shared data is partitioned over local processes. Like distributed programming -- program with MPI, PVM.
High Performance Parallel Programming Message Passing P P P... send P0,Xrecv Pn,Y X Y n 0 Machine model - Distributed Memory
High Performance Parallel Programming Computing s = x(1)+x(2) on each processor ° First possible solution: Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° Second possible solution -- what could go wrong? Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° What if send/receive acts like the telephone system? The post office?
High Performance Parallel Programming Model 3: Data Parallel Single sequential thread of control consisting of parallel operations. Parallel operations applied to all (or a defined subset) of a data structure. Communication is implicit in parallel operators and “shifted” data structures. Elegant and easy to understand and reason about. Like marching in a regiment. Used by Matlab. Drawback: not all problems fit this model.
High Performance Parallel Programming Data Parallel A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s: Machine model - SIMD A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines are not popular (CM2), but programming model is. Implemented by mapping n-fold parallelism to p processors. Mostly done in the compilers (HPF = High Performance Fortran).
High Performance Parallel Programming SIMD Architecture –A large number of (usually) small processors. –A single “control processor” issues each instruction. –Each processor executes the same instruction. –Some processors may be turned off on some instructions. –Machines are not popular (CM2), but programming model is. –Implemented by mapping n-fold parallelism to p processors. –Programming Mostly done in the compilers (HPF = High Performance Fortran). Usually modified to Single Program Multiple Data or SPMD
High Performance Parallel Programming Next week - Programming