CS267 L5 Distributed Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.

Parallel System Performance CS 524 – High-Performance Computing.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Wide Area Networks School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 11, Thursday 3/22/2007)

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 23 Introduction to Computer Networks.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Communication operations Efficient Parallel Algorithms COMP308.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.

1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Parallel System Performance CS 524 – High-Performance Computing.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Storage area network and System area network (SAN)

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Switching, routing, and flow control in interconnection networks.

Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.

Interconnect Network Topologies

Interconnect Networks

Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.

CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming David H. Bailey Based on previous notes by.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection Networks.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Super computers Parallel Processing

Embedded Computer Architecture 5SAI0 Interconnection Networks

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.

Basic Communication Operations Carl Tropper Department of Computer Science.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Advanced Computer Networks

Lecture 23: Interconnection Networks

Interconnection topologies

Course Outline Introduction in algorithms and applications

Switching, routing, and flow control in interconnection networks

Communication operations

Introduction to Scalable Interconnection Networks

Storage area network and System area network (SAN)

Networks Networking has become ubiquitous (cf. WWW)

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2

Embedded Computer Architecture 5SAI0 Interconnection Networks

CS 6290 Many-core & Interconnect

Lecture 18: Coherence and Synchronization

Multiprocessors and Multi-computers

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

CS267 L5 Distributed Memory.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 5: More about Distributed Memory Computers and Programming James Demmel

CS267 L5 Distributed Memory.2 Demmel Sp 1999 Recap of Last Lecture °Shared memory processors If there are caches then hardware must keep them coherent, i.e. with multiple cached copies of same location kept equal Requires clever hardware (see CS258) Distant memory much more expensive to access °Shared memory programming Solaris Threads Starting, stopping threads Synchronization with barriers, locks Sharks and Fish example

CS267 L5 Distributed Memory.3 Demmel Sp 1999 Outline °Distributed Memory Architectures Topologies Cost models °Distributed Memory Programming Send and Receive Collective Communication °Sharks and Fish Gravity

CS267 L5 Distributed Memory.4 Demmel Sp 1999 History and Terminology

CS267 L5 Distributed Memory.5 Demmel Sp 1999 Historical Perspective °Early machines were: Collection of microprocessors bi-directional queues between neighbors °Messages were forwarded by processors on path °Strong emphasis on topology in algorithms

CS267 L5 Distributed Memory.6 Demmel Sp 1999 Network Analogy °To have a large number of transfers occurring at once, you need a large number of distinct wires °Networks are like streets link = street switch = intersection distances (hops) = number of blocks traveled routing algorithm = travel plans °Properties latency: how long to get somewhere in the network bandwidth: how much data can be moved per unit time -limited by the number of wires -and the rate at which each wire can accept data

CS267 L5 Distributed Memory.7 Demmel Sp 1999 Components of a Network °Networks are characterized by °Topology - how things are connected two types of nodes: hosts and switches °Routing algorithm - paths used e.g., all east-west then all north-south (avoids deadlock) °Switching strategy circuit switching: full path reserved for entire message -like the telephone packet switching: message broken into separately-routed packets -like the post office °Flow control - what if there is congestion if two or more messages attempt to use the same channel may stall, move to buffers, reroute, discard, etc.

CS267 L5 Distributed Memory.8 Demmel Sp 1999 Properties of a Network °Diameter is the maximum shortest path between two nodes in the graph. °A network is partitioned if some nodes cannot reach others. °The bandwidth of a link in the is: w * 1/t w is the number of wires t is the time per bit °Effective bandwidth lower due to packet overhead °Bisection bandwidth sum of the minimum number of channels which, if removed, will partition the network Routing and control header Data payload Error code Trailer

CS267 L5 Distributed Memory.9 Demmel Sp 1999 Topologies °Originally much research in mapping algorithms to topologies °Cost to be minimized was number of “hops” = communication steps along individual wires °Modern networks use similar topologies, but hide hop cost, so algorithm design easier changing interconnection networks no longer changes algorithms °Since some algorithms have “natural topologies”, still worth knowing

CS267 L5 Distributed Memory.10 Demmel Sp 1999 Linear and Ring Topologies °Linear array diameter is n-1, average distance ~2/3n bisection bandwidth is 1 °Torus or Ring diameter is n/2, average distance is n/3 bisection bandwidth is 2 °Used in algorithms with 1D arrays

CS267 L5 Distributed Memory.11 Demmel Sp 1999 Meshes and Tori °2D Diameter: 2 *  n Bisection bandwidth: n 2D mesh 2D torus ° Often used as network in machines ° Generalizes to higher dimensions (Cray T3D used 3D Torus) ° Natural for algorithms with 2D, 3D arrays

CS267 L5 Distributed Memory.12 Demmel Sp 1999 Hypercubes °Number of nodes n = 2 d for dimension d Diameter: d Bisection bandwidth is n/2 ° 0d 1d 2d 3d 4d °Popular in early machines (Intel iPSC, NCUBE) Lots of clever algorithms See 1996 notes °Greycode addressing each node connected to d others with 1 bit different

CS267 L5 Distributed Memory.13 Demmel Sp 1999 Trees °Diameter: log n °Bisection bandwidth: 1 °Easy layout as planar graph °Many tree algorithms (summation) °Fat trees avoid bisection bandwidth problem more (or wider) links near top example, Thinking Machines CM-5

CS267 L5 Distributed Memory.14 Demmel Sp 1999 Butterflies °Butterfly building block °Diameter: log n °Bisection bandwidth: n °Cost: lots of wires °Use in BBN Butterfly °Natural for FFT O 1

CS267 L5 Distributed Memory.15 Demmel Sp 1999 Evolution of Distributed Memory Multiprocessors °Direct queue connections replaced by DMA (direct memory access) Processor packs or copies messages Initiates transfer, goes on computing °Message passing libraries provide store-and-forward abstraction can send/receive between any pair of nodes, not just along one wire Time proportional to distance since each processor along path must participate °Wormhole routing in hardware special message processors do not interrupt main processors along path message sends are pipelined don’t wait for complete message before forwarding

CS267 L5 Distributed Memory.16 Demmel Sp 1999 Performance Models

CS267 L5 Distributed Memory.17 Demmel Sp 1999 PRAM °Parallel Random Access Memory °All memory access free Theoretical, “too good to be true” °OK for understanding whether an algorithm has enough parallelism at all °Slightly more realistic: Concurrent Read Exclusive Write (CREW) PRAM

CS267 L5 Distributed Memory.18 Demmel Sp 1999 Latency and Bandwidth °Time to send message of length n is roughly °Topology irrelevant °Often called “  model” and written °Usually  >>  >> time per flop One long message cheaper than many short ones Can do hundreds or thousands of flops for cost of one message °Lesson: need large computation to communication ratio to be efficient Time = latency + n*cost_per_word = latency + n/bandwidth Time =  + n*   n  n 

CS267 L5 Distributed Memory.19 Demmel Sp 1999 Example communication costs  and  measured in units of flops,  measured per 8-byte word Machine Year   Mflop rate per proc CM IBM SP Intel Paragon IBM SP Cray T3D (PVM) UCB NOW SGI Power Challenge SUN E

CS267 L5 Distributed Memory.20 Demmel Sp 1999 More detailed performance model: LogP °L: latency across the network °o: overhead (sending and receiving busy time) °g: gap between messages (1/bandwidth) °P: number of processors °People often group overheads into latency (  model) °Real costs more complicated (see Culler/Singh, Chapter 7) P M osos oror L (latency)

CS267 L5 Distributed Memory.21 Demmel Sp 1999 Implementing Message Passing °Many “message passing libraries” available Chameleon, from ANL CMMD, from Thinking Machines Express, commercial MPL, native library on IBM SP-2 NX, native library on Intel Paragon Zipcode, from LLL … PVM, Parallel Virtual Machine, public, from ORNL/UTK MPI, Message Passing Interface, industry standard °Need standards to write portable code °Rest of this discussion independent of which library °Will have detailed MPI lecture later

CS267 L5 Distributed Memory.22 Demmel Sp 1999 Implementing Synchronous Message Passing °Send completes after matching receive and source data has been sent °Receive completes after data transfer complete from matching send source destination 1) Initiate send send (P dest, addr, length,tag) rcv(P source, addr,length,tag) 2) Address translation on P dest 3) Send-Ready Request send-rdy-request 4) Remote check for posted receive tag match 5) Reply transaction receive-rdy-reply 6) Bulk data transfer time data-xfer

CS267 L5 Distributed Memory.23 Demmel Sp 1999 Example: Permuting Data ° Exchanging data between Procs 0 and 1, V.1: What goes wrong? Processor 0 Processor 1 send(1, item0, 1, tag) send(0, item1, 1, tag) recv( 1, item1, 1, tag) recv( 0, item0, 1, tag) ° Deadlock ° Exchanging data between Proc 0 and 1, V.2: Processor 0 Processor 1 send(1, item0, 1, tag) recv(0, item0, 1, tag) recv( 1, item1, 1, tag) send(0,item1, 1, tag) ° What about a general permutation, where Proc j wants to send to Proc s(j), where s(1),s(2),…,s(P) is a permutation of 1,2,…,P?

CS267 L5 Distributed Memory.24 Demmel Sp 1999 Implementing Asynchronous Message Passing °Optimistic single-phase protocol assumes the destination can buffer data on demand source destination 1) Initiate send send (P dest, addr, length,tag) 2) Address translation on P dest 3) Send Data Request data-xfer-request tag match allocate 4) Remote check for posted receive 5) Allocate buffer (if check failed) 6) Bulk data transfer rcv(P source, addr, length,tag) time

CS267 L5 Distributed Memory.25 Demmel Sp 1999 Safe Asynchronous Message Passing °Use 3-phase protocol °Buffer on sending side °Variations on send completion wait until data copied from user to system buffer don’t wait -- let the user beware of modifying data source destination 1) Initiate send send (P dest, addr, length,tag) rcv(P source, addr, length,tag) 2) Address translation on P dest 3) Send-Ready Request send-rdy-request 4) Remote check for posted receive return and continue tag match record send-rdy computing 5) Reply transaction receive-rdy-reply 6) Bulk data transfer time data-xfer

CS267 L5 Distributed Memory.26 Demmel Sp 1999 Example Revisited: Permuting Data ° Processor j sends item to Processor s(j), where s(1),…,s(P) is a permutation of 1,…,P Processor j send_asynch(s(j), item, 1, tag) recv_block( ANY, item, 1, tag) ° What could go wrong? ° Need to understand semantics of send and receive - Many flavors available

CS267 L5 Distributed Memory.27 Demmel Sp 1999 Other operations besides send/receive °“Collective Communication” (more than 2 procs) Broadcast data from one processor to all others Barrier Reductions (sum, product, max, min, boolean and, #, …) -# is any “associative” operation Scatter/Gather Parallel prefix -Proc j owns x(j) and computes y(j) = x(1) # x(2) # … # x(j) Can apply to all other processors, or a user-define subset Cost = O(log P) using a tree °Status operations Enquire about/Wait for asynchronous send/receives to complete How many processors are there What is my processor number

CS267 L5 Distributed Memory.28 Demmel Sp 1999 Example: Sharks and Fish °N fish on P procs, N/P fish per processor At each time step, compute forces on fish and move them °Need to compute gravitational interaction In usual n^2 algorithm, every fish depends on every other fish every fish needs to “visit” every processor, even if it “lives” on one °What is the cost?

CS267 L5 Distributed Memory.29 Demmel Sp Algorithms for Gravity: What are their costs? Algorithm 1 Copy local Fish array of length N/P to Tmp array for j = 1 to N for k = 1 to N/P, Compute force from Tmp(k) on Fish(k) “Rotate” Tmp by 1 for k=2 to N/P, Tmp(k) <= Tmp(k-1) recv(my_proc - 1,Tmp(1)) send(my_proc+1,Tmp(N/P) Algorithm 2 Copy local Fish array of length N/P to Tmp array for j = 1 to P for k=1 to N/P, for m=1 to N/P, Compute force from Tmp(k) on Fish(m) “Rotate” Tmp by N/P recv(my_proc - 1,Tmp(1:N/P)) send(my_proc+1,Tmp(1:N/P)) What could go wrong? (be careful of overwriting Tmp)

CS267 L5 Distributed Memory.30 Demmel Sp 1999 More Algorithms for Gravity °Algorithm 3 (in sharks and fish code) All processors send their Fish to Proc 0 Proc 0 broadcasts all Fish to all processors °Tree-algorithms Barnes-Hut, Greengard-Rokhlin, Anderson O(N log N) instead of O(N^2) Parallelizable with cleverness “Just” an approximation, but as accurate as you like (often only a few digits are needed, so why pay for more) Same idea works for other problems where effects of distant objects becomes “smooth” or “compressible” -electrostatics, vorticity, … -radiosity in graphics -anything satisfying Poisson equation or something like it Will talk about it in detail later in course