Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( www.cse.psu.edu/~mji.

Similar presentations


Presentation on theme: "CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( www.cse.psu.edu/~mji."— Presentation transcript:

1 CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, © 2008, MK]

2 CSE431 Chapter 7C.2Irwin, PSU, 2008 Review: Shared Memory Multiprocessors (SMP)  Q1 – Single address space shared by all processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time  They come in two styles l Uniform memory access (UMA) multiprocessors l Nonuniform memory access (NUMA) multiprocessors Processor Cache Interconnection Network Memory I/O

3 CSE431 Chapter 7C.3Irwin, PSU, 2008 Message Passing Multiprocessors (MPP)  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (message passing)  Q2 – Coordination is built into message passing primitives (message send and message receive) Processor Cache Interconnection Network Memory

4 CSE431 Chapter 7C.4Irwin, PSU, 2008 Communication in Network Connected Multi’s  Implicit communication via loads and stores hardware designers have to provide coherent caches and process (thread) synchronization primitive (like ll and sc ) l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when neededrather than to send for it in case it might be used  Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication

5 CSE431 Chapter 7C.5Irwin, PSU, 2008 IN Performance Metrics  Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link wires (on chip)  Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links  Bisection bandwidth (BB) – closer to the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line  Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay, fault tolerance, power efficiency

6 CSE431 Chapter 7C.6Irwin, PSU, 2008 Bus IN  N processors, 1 switch ( ), 1 link (the bus)  Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch

7 CSE431 Chapter 7C.7Irwin, PSU, 2008 Ring IN  If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case  N processors, N switches, 2 links/switch, N links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2

8 CSE431 Chapter 7C.8Irwin, PSU, 2008 Fully Connected IN  N processors, N switches, N-1 links/switch, (N*(N-1))/2 links  N simultaneous transfers l NB = link bandwidth * (N * (N-1))/2 l BB = link bandwidth * (N/2) 2

9 CSE431 Chapter 7C.9Irwin, PSU, 2008 Crossbar (Xbar) Connected IN  N processors, N 2 switches (unidirectional), 2 links/switch, N 2 links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2

10 CSE431 Chapter 7C.10Irwin, PSU, 2008 Hypercube (Binary N-cube) Connected IN  N processors, N switches, logN links/switch, (NlogN)/2 links  N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube

11 CSE431 Chapter 7C.11Irwin, PSU, 2008 2D and 3D Mesh/Torus Connected IN  N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3  N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4 N/2 links or 6 N/2 links

12 CSE431 Chapter 7C.12Irwin, PSU, 2008 IN Comparison  For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1

13 CSE431 Chapter 7C.13Irwin, PSU, 2008 IN Comparison  For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1 64 2 64 2+1 64+64 256 16 64 4+1 128+64 192 32 64 6+7 192+64 2016 1024 64 63+1 2016+64

14 CSE431 Chapter 7C.14Irwin, PSU, 2008 “Fat” Trees CDAB  Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.  Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times  The solution is to 'thicken' the upper links. l Have more links as you work towards the root of the tree increases the bisection bandwidth  Rather than design a bunch of N-port switches, use pairs of switches

15 CSE431 Chapter 7C.15Irwin, PSU, 2008 Fat Tree IN  N processors, log(N-1) * logN switches, 2 up + 4 down = 6 links/switch, N * logN links  N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4

16 CSE431 Chapter 7C.16Irwin, PSU, 2008 SGI NUMAlink Fat Tree www.embedded-computing.com/articles/woodacre

17 CSE431 Chapter 7C.17Irwin, PSU, 2008 Cache Coherency in IN Connected SMPs  For performance reasons we want to allow the shared data to be stored in caches  Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast  Directory-base protocols l keep a directory that is a repository for the state of every block in main memory (records which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data

18 CSE431 Chapter 7C.18Irwin, PSU, 2008 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR16000128fat tree800 Cray 3TEAlpha 21164 300MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC 440 0.7GHz65,536*23D torus, fat tree, barrier

19 CSE431 Chapter 7C.19Irwin, PSU, 2008 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier Torus BW3 B/cycle

20 CSE431 Chapter 7C.20Irwin, PSU, 2008 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s 8 1 128 256 11GB/s 5.5 GB/s 700 MHz

21 CSE431 Chapter 7C.21Irwin, PSU, 2008 Multiprocessor Benchmarks Scaling?Reprogram?Description LinpackWeakYesDense matrix linear algebra SPECrateWeakNoIndependent job parallelism SPLASH 2StrongNoIndependent job parallelism (both kernels and applications, many from high-performance computing) NAS ParallelWeakYes (c or Fortran)Five kernels, mostly from computational fluid dynamics PARSECWeakNoMultithreaded programs that use Pthreads and OpenMP. Nine applications and 3 kernels – 8 with data parallelism, 3 with pipelined parallelism, one unstructured Berkeley Design Patterns Strong or Weak Yes13 design patterns implemented by frameworks or kernels

22 CSE431 Chapter 7C.22Irwin, PSU, 2008 Supercomputer Style Migration (Top500)  Uniprocessors and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%. Now its 98% Clusters and MPPs. Nov data http://www.top500.org/ Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block

23 CSE431 Chapter 7C.23Irwin, PSU, 2008 Reminders  Reminders l HW6 December 11 th l Check grade posting on-line (by your midterm exam number) for correctness l Second evening midterm exam scheduled -Tuesday, November 18, 20:15 to 22:15, Location 262 Willard -Please let me know ASAP (via email) if you have a conflict


Download ppt "CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( www.cse.psu.edu/~mji."

Similar presentations


Ads by Google