High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
Parallel System Performance CS 524 – High-Performance Computing.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
1 Introduction to Parallel Architectures and Programming Models.
Communication operations Efficient Parallel Algorithms COMP308.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
CSE 160 – Lecture 2. Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel System Performance CS 524 – High-Performance Computing.
CS267 L5 Distributed Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Storage area network and System area network (SAN)
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming David H. Bailey Based on previous notes by.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
Embedded Computer Architecture 5SAI0 Interconnection Networks
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Parallel Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
CMSC 611: Advanced Computer Architecture
Communication operations
Parallel Processing Architectures
Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming Lecture 4:Introduction to Parallel Architectures Distributed Memory Systems

High Performance Parallel Programming Admimistrivia I would like to organise a tutorial session for those students who would like an account on the HPC systems. This tute will cover the actual physical machines we have and how to use the batch system to actually submit parallel jobs. lecture slides: ppt vs pdf?

High Performance Parallel Programming Lecture 3 We introduced parallel architectures and then looked closely at Shared Memory Systems. In particular we looked at the issues relating to cache coherency and false sharing. Note: Distributed Memory machines don’t avoid the problems which cause cache coherency problems - they just offload them to you!

High Performance Parallel Programming Recap - Architecture P P PP Interconnection Network MMMM ° Where is the memory physically located? Memory

High Performance Parallel Programming Distributed Memory Each processor is connected to its own memory and cache and cannot directly access another processors memory ccNUMA is a special case... P P PP Interconnection Network MMMM

High Performance Parallel Programming Topologies Early distributed memory machines were built using internode connections (limited numbers) Messages were forwarded by processors on path There was a strong emphasis on topology in algorithms to minimize number of hops e.g. Hypercube number of hops = log 2 n

High Performance Parallel Programming Network Analogy To have a large number of transfers occurring at once, you need a large number of distinct wires. Networks are like streets: Link = street. Switch = intersection. Distances (hops) = number of blocks traveled. Routing algorithm = travel plan. Properties: Latency: how long to get between nodes in the network. Bandwidth: how much data can be moved per unit time: Bandwidth is limited by the number of wires and the rate at which each wire can accept data.

High Performance Parallel Programming Characteristics of a Network –Topology (how things are connected) Crossbar, ring, 2-D and 2-D torus, hypercube, omega network. –Routing algorithm: Example: all east-west then all north-south (avoids deadlock). –Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. –Flow control (what if there is congestion): Stall, store data temporarily in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, etc.

High Performance Parallel Programming Properties of a Network –Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes. –A network is partitioned into two or more disjoint sub-graphs if some nodes cannot reach others. –The bandwidth of a link = w * 1/t w is the number of wires t is the time per bit –Effective bandwidth is usually lower due to packet overhead. –Bisection bandwidth: sum of the bandwidths of the minimum number of channels which, if removed, would partition the network into two sub-graphs.

High Performance Parallel Programming Network Topology –In the early years of parallel computing, there was considerable research in network topology and in mapping algorithms to topology. –Key cost to be minimized in early years: number of “hops” (communication steps) between nodes. –Modern networks hide hop cost (ie, “wormhole routing”), so the underlying topology is no longer a major factor in algorithm performance. –Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec. However, since some algorithms have a natural topology, it is worthwhile to have some background in this arena.

High Performance Parallel Programming Linear and Ring Topologies –Linear array Diameter = n-1; average distance ~ n/3. Bisection bandwidth = 1. –Torus or Ring Diameter = n/2; average distance ~ n/4. Bisection bandwidth = 2. Natural for algorithms that work with 1D arrays.

High Performance Parallel Programming Meshes and Tori – 2D Diameter = 2 * n Bisection bandwidth = n 2D mesh 2D torus °Generalizes to higher dimensions (Cray T3E uses 3D Torus). ° Natural for algorithms that work with 2D and/or 3D arrays.

High Performance Parallel Programming Hypercubes –Number of nodes n = 2 d for dimension d. Diameter = d. Bisection bandwidth = n/2. – 0d 1d 2d 3d 4d –Popular in early machines (Intel iPSC, NCUBE). Lots of clever algorithms. –Greycode addressing: Each node connected to d others with 1 bit different

High Performance Parallel Programming Trees –Diameter = log n. –Bisection bandwidth = 1. –Easy layout as planar graph. –Many tree algorithms (e.g., summation). –Fat trees avoid bisection bandwidth problem: More (or wider) links near top. Example: Thinking Machines CM-5.

High Performance Parallel Programming Butterflies –Diameter = log n. –Bisection bandwidth = n. –Cost: lots of wires. –Used in BBN Butterfly. –Natural for FFT. O 1

High Performance Parallel Programming Evolution of Distributed Memory Multiprocessors –Special queue connections are being replaced by direct memory access (DMA): Processor packs or copies messages. Initiates transfer, goes on computing. –Message passing libraries provide store-and-forward abstraction: Can send/receive between any pair of nodes, not just along one wire. Time proportional to distance since each processor along path must participate. –Wormhole routing in hardware: Special message processors do not interrupt main processors along path. Message sends are pipelined. Processors don’t wait for complete message before forwarding.

High Performance Parallel Programming Networks of Workstations Todays dominant paridigm Package solutions available –IBM SP2, Compaq Alphaserver SC... Fast commodity interconnects now available –Myrinet, ServerNet... Many applications do not need close coupling –even Ethernet will do. Cheap!!!

High Performance Parallel Programming Latency and Bandwidth Model –Time to send message of length n is roughly. Time= latency + n*cost_per_word = latency + n/bandwidth –Topology is assumed irrelevant. –Often called “  model” and written Time=  + n*  –Usually  >>  >> time per flop. One long message is cheaper than many short ones.  + n*  << n*(  + 1*  ) Can do hundreds or thousands of flops for cost of one message. –Lesson: Need large computation-to-communication ratio to be efficient.

High Performance Parallel Programming A more detailed performance model: LogP –L: latency across the network (may be variable). –o: overhead (sending and receiving busy time). –g: gap between messages (1/bandwidth). –P: number of processors. –People often group overheads into latency (  model). –Real costs more complicated P M osos oror L (latency)

High Performance Parallel Programming Local Machines NEC SX-4 Parallel Vector Processor –2 cpus, X-bar Compaq ES40 - SMP –4 cpus, X-bar Compaq DS10,PW600 - Network of Workstations –28 cpus, Fast Ethernet VPAC AlphaserverSC - Cluster of SMPs –128 cpus (4 * 32), Quadrics Switch

High Performance Parallel Programming Programming Models Machine architectures and Programming models have a natural affinity, both in their descriptions and in there effects on each other. Algorithms have been developed to fit machines and machines have been built to fit algorithms. Purpose built Supercomputers - ASCI FPGAs - QCDmachines

High Performance Parallel Programming Parallel Programming Models –Control How is parallelism created? What orderings exist between operations? How do different threads of control synchronize? –Data What data is private vs. shared? How is logically shared data accessed or communicated? –Operations What are the atomic operations? –Cost How do we account for the cost of each of the above?

High Performance Parallel Programming Trivial Example –Parallel Decomposition: Each evaluation and each partial sum is a task. –Assign n/p numbers to each of p procs Each computes independent “private” results and partial sum. One (or all) collects the p partial sums and computes the global sum. Two Classes of Data: –Logically Shared The original n numbers, the global sum. –Logically Private The individual function evaluations. What about the individual partial sums? fAi([])  i  0 n  1

High Performance Parallel Programming Model 1: Shared Address Space Program consists of a collection of threads of control. Each has a set of private variables, e.g. local variables on the stack. Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate explicitly by synchronization operations on shared variables -- writing and reading flags, locks or semaphores. Like concurrent programming on a uniprocessor.

High Performance Parallel Programming Shared Address Space PP P... x =...y =..x... Shared Private Machine model - Shared memory

High Performance Parallel Programming Shared Memory Code for Computing a Sum Thread 1 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 What could go wrong?

High Performance Parallel Programming Solution via Synchronization ° Instructions from different threads can be interleaved arbitrarily. ° What can final result s stored in memory be? ° Problem: race condition. ° Possible solution: mutual exclusion with locks ° Pitfall in computing a global sum s = local_s1 + local_s2: Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem] Thread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] Time Thread 1 lock load s s = s+local_s1 store s unlock Thread 2 lock load s s = s+local_s2 store s unlock ° Locks must be atomic (execute completely without interruption).

High Performance Parallel Programming Model 2: Message Passing Program consists of a collection of named processes. Thread of control plus local address space -- NO shared data. Local variables, static variables, common blocks, heap. Processes communicate by explicit data transfers -- matching send and receive pair by source and destination processors (these may be broadcast). Coordination is implicit in every communication event. Logically shared data is partitioned over local processes. Like distributed programming -- program with MPI, PVM.

High Performance Parallel Programming Message Passing P P P... send P0,Xrecv Pn,Y X Y n 0 Machine model - Distributed Memory

High Performance Parallel Programming Computing s = x(1)+x(2) on each processor ° First possible solution: Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote ° Second possible solution -- what could go wrong? Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote ° What if send/receive acts like the telephone system? The post office?

High Performance Parallel Programming Model 3: Data Parallel Single sequential thread of control consisting of parallel operations. Parallel operations applied to all (or a defined subset) of a data structure. Communication is implicit in parallel operators and “shifted” data structures. Elegant and easy to understand and reason about. Like marching in a regiment. Used by Matlab. Drawback: not all problems fit this model.

High Performance Parallel Programming Data Parallel A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s: Machine model - SIMD A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction. Some processors may be turned off on some instructions. Machines are not popular (CM2), but programming model is. Implemented by mapping n-fold parallelism to p processors. Mostly done in the compilers (HPF = High Performance Fortran).

High Performance Parallel Programming SIMD Architecture –A large number of (usually) small processors. –A single “control processor” issues each instruction. –Each processor executes the same instruction. –Some processors may be turned off on some instructions. –Machines are not popular (CM2), but programming model is. –Implemented by mapping n-fold parallelism to p processors. –Programming Mostly done in the compilers (HPF = High Performance Fortran). Usually modified to Single Program Multiple Data or SPMD

High Performance Parallel Programming Next week - Programming