CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming David H. Bailey Based on previous notes by.

Slides:



Advertisements
Similar presentations
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Advertisements

Multiple Processor Systems
Parallel System Performance CS 524 – High-Performance Computing.
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
Wide Area Networks School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 11, Thursday 3/22/2007)
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
Communication operations Efficient Parallel Algorithms COMP308.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University.
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
CSE 160 – Lecture 2. Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel System Performance CS 524 – High-Performance Computing.
CS267 L5 Distributed Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection Networks.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Sami Al-wakeel 1 Data Transmission and Computer Networks The Switching Networks.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Embedded Computer Architecture 5SAI0 Interconnection Networks
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
Basic Communication Operations Carl Tropper Department of Computer Science.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Parallel Architecture
Advanced Computer Networks
Lecture 23: Interconnection Networks
Interconnection topologies
Course Outline Introduction in algorithms and applications
Switching, routing, and flow control in interconnection networks
Communication operations
Introduction to Scalable Interconnection Networks
Storage area network and System area network (SAN)
Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2
Interconnection Networks Contd.
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming David H. Bailey Based on previous notes by James Demmel and David Culler

Recap of Last Lecture °Shared memory processors Caches in individual processors must be kept coherent -- multiple cached copies of same location must be kept equal. Requires clever hardware (see CS258). Distant memory much more expensive to access. °Shared memory programming Starting, stopping threads. Synchronization with barriers, locks. OpenMP is the emerging standard for the shared memory parallel programming model.

Outline °Distributed Memory Architectures Topologies Cost models °Distributed Memory Programming Send and receive operations Collective communication °Sharks and Fish Example Gravity

History and Terminology

Historical Perspective °Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues between nearest neighbors. °Messages were forwarded by processors on path. °There was a strong emphasis on topology in algorithms, in order to minimize the number of hops.

Network Analogy °To have a large number of transfers occurring at once, you need a large number of distinct wires. °Networks are like streets: Link = street. Switch = intersection. Distances (hops) = number of blocks traveled. Routing algorithm = travel plan. °Properties: Latency: how long to get between nodes in the network. Bandwidth: how much data can be moved per unit time: Bandwidth is limited by the number of wires and the rate at which each wire can accept data.

Characteristics of a Network °Topology (how things are connected) Crossbar, ring, 2-D and 2-D torus, hypercube, omega network. °Routing algorithm: Example: all east-west then all north-south (avoids deadlock). °Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. °Flow control (what if there is congestion): Stall, store data temporarily in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, etc.

Properties of a Network °Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes. °A network is partitioned into two or more disjoint sub-graphs if some nodes cannot reach others. °The bandwidth of a link = w * 1/t w is the number of wires t is the time per bit °Effective bandwidth is usually lower due to packet overhead. °Bisection bandwidth: sum of the bandwidths of the minimum number of channels which, if removed, would partition the network into two sub-graphs. Routing and control header Data payload Error code Trailer

Network Topology °In the early years of parallel computing, there was considerable research in network topology and in mapping algorithms to topology. °Key cost to be minimized in early years: number of “hops” (communication steps) between nodes. °Modern networks hide hop cost (ie, “wormhole routing”), so the underlying topology is no longer a major factor in algorithm performance. °Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec. However, since some algorithms have a natural topology, it is worthwhile to have some background in this arena.

Linear and Ring Topologies °Linear array Diameter = n-1; average distance ~ n/3. Bisection bandwidth = 1. °Torus or Ring Diameter = n/2; average distance ~ n/4. Bisection bandwidth = 2. Natural for algorithms that work with 1D arrays.

Meshes and Tori °2D Diameter = 2 *  n Bisection bandwidth = n 2D mesh 2D torus ° Often used as network in machines. ° Generalizes to higher dimensions (Cray T3D used 3D Torus). ° Natural for algorithms that work with 2D and/or 3D arrays.

Hypercubes °Number of nodes n = 2 d for dimension d. Diameter = d. Bisection bandwidth = n/2. ° 0d 1d 2d 3d 4d °Popular in early machines (Intel iPSC, NCUBE). Lots of clever algorithms. See 1996 notes. °Greycode addressing: Each node connected to d others with 1 bit different

Trees °Diameter = log n. °Bisection bandwidth = 1. °Easy layout as planar graph. °Many tree algorithms (e.g., summation). °Fat trees avoid bisection bandwidth problem: More (or wider) links near top. Example: Thinking Machines CM-5.

Butterflies °Diameter = log n. °Bisection bandwidth = n. °Cost: lots of wires. °Used in BBN Butterfly. °Natural for FFT. O 1

Evolution of Distributed Memory Multiprocessors °Special queue connections are being replaced by direct memory access (DMA): Processor packs or copies messages. Initiates transfer, goes on computing. °Message passing libraries provide store-and-forward abstraction: Can send/receive between any pair of nodes, not just along one wire. Time proportional to distance since each processor along path must participate. °Wormhole routing in hardware: Special message processors do not interrupt main processors along path. Message sends are pipelined. Processors don’t wait for complete message before forwarding.

Performance Models

PRAM °Parallel Random Access Memory. °All memory access operations complete in one clock period -- no concept of memory hierarchy (“too good to be true”). °OK for understanding whether an algorithm has enough parallelism at all. °Slightly more realistic: Concurrent Read Exclusive Write (CREW) PRAM.

Latency and Bandwidth Model °Time to send message of length n is roughly. °Topology is assumed irrelevant. °Often called “  model” and written °Usually  >>  >> time per flop. One long message is cheaper than many short ones. Can do hundreds or thousands of flops for cost of one message. °Lesson: Need large computation-to-communication ratio to be efficient. Time = latency + n*cost_per_word = latency + n/bandwidth Time =  + n*   n  n 

Example communication costs  and  measured in units of flops,  measured per 8-byte word Machine Year   Mflop rate per proc CM IBM SP Intel Paragon IBM SP Cray T3D (PVM) UCB NOW SGI Power Challenge SUN E

A more detailed performance model: LogP °L: latency across the network. °o: overhead (sending and receiving busy time). °g: gap between messages (1/bandwidth). °P: number of processors. °People often group overheads into latency (  model). °Real costs more complicated -- see Culler/Singh, Chapter 7. P M osos oror L (latency)

Message Passing Libraries °Many “message passing libraries” available Chameleon, from ANL. CMMD, from Thinking Machines. Express, commercial. MPL, native library on IBM SP-2. NX, native library on Intel Paragon. Zipcode, from LLL. PVM, Parallel Virtual Machine, public, from ORNL/UTK. Others... MPI, Message Passing Interface, now the industry standard. °Need standards to write portable code. °Rest of this discussion independent of which library. °Will have a detailed MPI lecture later.

Implementing Synchronous Message Passing °Send operations complete after matching receive and source data has been sent. °Receive operations complete after data transfer is complete from matching send. source destination 1) Initiate send send (P dest, addr, length,tag) rcv(P source, addr,length,tag) 2) Address translation on P dest 3) Send-Ready Request send-rdy-request 4) Remote check for posted receive tag match 5) Reply transaction receive-rdy-reply 6) Bulk data transfer time data-xfer

Example: Permuting Data ° Exchanging data between Procs 0 and 1, V.1: What goes wrong? Processor 0 Processor 1 send(1, item0, 1, tag1) send(0, item1, 1, tag2) recv( 1, item1, 1, tag2) recv( 0, item0, 1, tag1) ° Deadlock ° Exchanging data between Proc 0 and 1, V.2: Processor 0 Processor 1 send(1, item0, 1, tag1) recv(0, item0, 1, tag1) recv( 1, item1, 1, tag2) send(0,item1, 1, tag2) ° What about a general permutation, where Proc j wants to send to Proc s(j), where s(1),s(2),…,s(P) is a permutation of 1,2,…,P?

Implementing Asynchronous Message Passing °Optimistic single-phase protocol assumes the destination can buffer data on demand. source destination 1) Initiate send send (P dest, addr, length,tag) 2) Address translation on P dest 3) Send Data Request data-xfer-request tag match allocate 4) Remote check for posted receive 5) Allocate buffer (if check failed) 6) Bulk data transfer rcv(P source, addr, length,tag) time

Safe Asynchronous Message Passing °Use 3-phase protocol °Buffer on sending side °Variations on send completion wait until data copied from user to system buffer don’t wait -- let the user beware of modifying data source destination 1) Initiate send send (P dest, addr, length,tag) rcv(P source, addr, length,tag) 2) Address translation on P dest 3) Send-Ready Request send-rdy-request 4) Remote check for posted receive return and continue tag match record send-rdy computing 5) Reply transaction receive-rdy-reply 6) Bulk data transfer time data-xfer

Example Revisited: Permuting Data ° Processor j sends item to Processor s(j), where s(1),…,s(P) is a permutation of 1,…,P Processor j send_asynch(s(j), item, 1, tag) recv_block( ANY, item, 1, tag) ° What could go wrong? ° Need to understand semantics of send and receive. ° Many flavors available.

Other operations besides send/receive °“Collective Communication” (more than 2 procs) Broadcast data from one processor to all others. Barrier. Reductions (sum, product, max, min, boolean and, #, …), where # is any “associative” operation. Scatter/Gather. Parallel prefix -- Proc j owns x(j) and computes y(j) = x(1) # x(2) # … # x(j). Can apply to all other processors, or a user-define subset. Cost = O(log P) using a tree. °Status operations Enquire about/Wait for asynchronous send/receives to complete. How many processors are there? What is my processor number?

Example: Sharks and Fish °N fish on P procs, N/P fish per processor At each time step, compute forces on fish and move them °Need to compute gravitational interaction In usual n^2 algorithm, every fish depends on every other fish. Every fish needs to “visit” every processor, even if it “lives” on just one. °What is the cost?

Two Algorithms for Gravity: What are their costs? Algorithm 1 Copy local Fish array of length N/P to Tmp array for j = 1 to N for k = 1 to N/P, Compute force of Tmp(k) on Fish(k) “Rotate” Tmp by 1 for k=2 to N/P, Tmp(k) <= Tmp(k-1) recv(my_proc - 1,Tmp(1)) send(my_proc+1,Tmp(N/P) Algorithm 2 Copy local Fish array of length N/P to Tmp array for j = 1 to P for k=1 to N/P, for m=1 to N/P, Compute force of Tmp(k) on Fish(m) “Rotate” Tmp by N/P recv(my_proc - 1,Tmp(1:N/P)) send(my_proc+1,Tmp(1:N/P)) What could go wrong? (be careful of overwriting Tmp)

More Algorithms for Gravity °Algorithm 3 (in sharks and fish code): All processors send their Fish to Proc 0. Proc 0 broadcasts all Fish to all processors. °Tree-algorithms: Barnes-Hut, Greengard-Rokhlin, Anderson. O(N log N) instead of O(N^2). Parallelizable with cleverness. “Just” an approximation, but as accurate as you like (often only a few digits are needed, so why pay for more). Same idea works for other problems where effects of distant objects becomes “smooth” or “compressible”: -electrostatics, vorticity, … -radiosity in graphics. -anything satisfying Poisson equation or something like it.