Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji.

Similar presentations


Presentation on theme: "CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji."— Presentation transcript:

1 CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

2 CSE431 L27 NetworkMultis.2Irwin, PSU, 2005 Review: Bus Connected SMPs (UMAs)  Caches are used to reduce latency and to lower bus traffic  Must provide hardware for cache coherence and process synchronization  Bus traffic and bandwidth limits scalability (<~ 36 processors) Processor Cache Single Bus Memory I/O Processor Cache

3 CSE431 L27 NetworkMultis.3Irwin, PSU, 2005 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

4 CSE431 L27 NetworkMultis.4Irwin, PSU, 2005 Network Connected Multiprocessors  Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives l Interconnection network supports interprocessor communication Processor Cache Interconnection Network (IN) Memory

5 CSE431 L27 NetworkMultis.5Irwin, PSU, 2005 Summing 100,000 Numbers on 100 Processors sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i];/* sum local array subset  Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel  The processors then coordinate in adding together the sub sums ( Pn is the number of processors, send(x,y ) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2;/*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1);/*final sum in P0’s sum

6 CSE431 L27 NetworkMultis.6Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum half = 10

7 CSE431 L27 NetworkMultis.7Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4 half = 10 half = 5 half = 3 half = 2 sum send receive P0P1P2 limit = 10 limit = 5 limit = 3 limit = 2 half = 1 P0P1P0 send receive send receive send receive

8 CSE431 L27 NetworkMultis.8Irwin, PSU, 2005 Communication in Network Connected Multi’s  Implicit communication via loads and stores l hardware designers have to provide coherent caches and process synchronization primitive l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))  Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication

9 CSE431 L27 NetworkMultis.9Irwin, PSU, 2005 Cache Coherency in NUMAs  For performance reasons we want to allow the shared data to be stored in caches  Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast  Directory-base protocols l keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data

10 CSE431 L27 NetworkMultis.10Irwin, PSU, 2005 IN Performance Metrics  Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link  Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links  Bisection bandwidth (BB) – represents the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line  Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay

11 CSE431 L27 NetworkMultis.11Irwin, PSU, 2005 Bus IN  N processors, 1 switch ( ), 1 link (the bus)  Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch

12 CSE431 L27 NetworkMultis.12Irwin, PSU, 2005 Ring IN  If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case  N processors, N switches, 2 links/switch, N links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2

13 CSE431 L27 NetworkMultis.13Irwin, PSU, 2005 Fully Connected IN  N processors, N switches, N-1 links/switch, (N*(N-1))/2 links  N simultaneous transfers l NB = link bandwidth * (N*(N-1))/2 l BB = link bandwidth * (N/2) 2

14 CSE431 L27 NetworkMultis.14Irwin, PSU, 2005 Crossbar (Xbar) Connected IN  N processors, N 2 switches (unidirectional),2 links/switch, N 2 links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2

15 CSE431 L27 NetworkMultis.15Irwin, PSU, 2005 Hypercube (Binary N-cube) Connected IN  N processors, N switches, logN links/switch, (NlogN)/2 links  N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube

16 CSE431 L27 NetworkMultis.16Irwin, PSU, 2005 2D and 3D Mesh/Torus Connected IN  N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3  N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links

17 CSE431 L27 NetworkMultis.17Irwin, PSU, 2005 Fat Tree  N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links  N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4

18 CSE431 L27 NetworkMultis.18Irwin, PSU, 2005 Fat Tree CDAB  Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.  Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times  The solution is to 'thicken' the upper links. l More links as the tree gets thicker increases the bisection  Rather than design a bunch of N-port switches, use pairs

19 CSE431 L27 NetworkMultis.19Irwin, PSU, 2005 SGI NUMAlink Fat Tree www.embedded-computing.com/articles/woodacre

20 CSE431 L27 NetworkMultis.20Irwin, PSU, 2005 IN Comparison  For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of Switches 1 Links per switch Total # of links 1

21 CSE431 L27 NetworkMultis.21Irwin, PSU, 2005 IN Comparison  For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1 64 2 64 2+1 64+64 256 16 64 4+1 128+64 192 32 64 6+7 192+64 2016 1024 64 63+1 2016+64

22 CSE431 L27 NetworkMultis.22Irwin, PSU, 2005 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR16000128fat tree800 Cray 3TEAlpha 21164 300MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC 440 0.7GHz65,536*23D torus, fat tree, barrier

23 CSE431 L27 NetworkMultis.23Irwin, PSU, 2005 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier Torus BW3 B/cycle

24 CSE431 L27 NetworkMultis.24Irwin, PSU, 2005 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s 8 1 128 256 11GB/s 5.5 GB/s 700 MHz

25 CSE431 L27 NetworkMultis.25Irwin, PSU, 2005 Networks of Workstations (NOWs) Clusters  Clusters of off-the-shelf, whole computers with multiple private address spaces  Clusters are connected using the I/O bus of the computers l lower bandwidth that multiprocessor that use the memory bus l lower speed network links l more conflicts with I/O traffic  Clusters of N processors have N copies of the OS limiting the memory available for applications  Improved system availability and expandability l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability  Economy-of-scale advantages with respect to costs

26 CSE431 L27 NetworkMultis.26Irwin, PSU, 2005 Commercial (NOW) Clusters ProcProc Speed # ProcNetwork Dell PowerEdge P4 Xeon3.06GHz2,500Myrinet eServer IBM SP Power41.7GHz2,944 VPI BigMacApple G52.3GHz2,200Mellanox Infiniband HP ASCI QAlpha 212641.25GHz8,192Quadrics LLNL Thunder Intel Itanium21.4GHz1,024*4Quadrics BarcelonaPowerPC 9702.2GHz4,536Myrinet

27 CSE431 L27 NetworkMultis.27Irwin, PSU, 2005 Summary  Flynn’s classification of processors - SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs l Scalability of bus connected UMAs limited (< ~ 36 processors) l Network connected NUMAs more scalable l Interconnection Networks (INs) -fully connected, xbar -ring -mesh -n-cube, fat tree  Message passing multis  Cluster connected (NOWs) multis

28 CSE431 L27 NetworkMultis.28Irwin, PSU, 2005 Next Lecture and Reminders  Next lecture -Reading assignment – PH 9.7  Reminders l HW5 (and last) due Dec 6 th (Part 1) l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (tentatively) schedule -Tuesday, December 13th, 2:30-4:20, 22 Deike


Download ppt "CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji."

Similar presentations


Ads by Google