Download presentation
Presentation is loading. Please wait.
Published bySuzanna Reynolds Modified over 9 years ago
1
CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
2
CSE431 L27 NetworkMultis.2Irwin, PSU, 2005 Review: Bus Connected SMPs (UMAs) Caches are used to reduce latency and to lower bus traffic Must provide hardware for cache coherence and process synchronization Bus traffic and bandwidth limits scalability (<~ 36 processors) Processor Cache Single Bus Memory I/O Processor Cache
3
CSE431 L27 NetworkMultis.3Irwin, PSU, 2005 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36 Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors?
4
CSE431 L27 NetworkMultis.4Irwin, PSU, 2005 Network Connected Multiprocessors Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives l Interconnection network supports interprocessor communication Processor Cache Interconnection Network (IN) Memory
5
CSE431 L27 NetworkMultis.5Irwin, PSU, 2005 Summing 100,000 Numbers on 100 Processors sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i];/* sum local array subset Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel The processors then coordinate in adding together the sub sums ( Pn is the number of processors, send(x,y ) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2;/*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1);/*final sum in P0’s sum
6
CSE431 L27 NetworkMultis.6Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum half = 10
7
CSE431 L27 NetworkMultis.7Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4 half = 10 half = 5 half = 3 half = 2 sum send receive P0P1P2 limit = 10 limit = 5 limit = 3 limit = 2 half = 1 P0P1P0 send receive send receive send receive
8
CSE431 L27 NetworkMultis.8Irwin, PSU, 2005 Communication in Network Connected Multi’s Implicit communication via loads and stores l hardware designers have to provide coherent caches and process synchronization primitive l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM)) Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication
9
CSE431 L27 NetworkMultis.9Irwin, PSU, 2005 Cache Coherency in NUMAs For performance reasons we want to allow the shared data to be stored in caches Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast Directory-base protocols l keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data
10
CSE431 L27 NetworkMultis.10Irwin, PSU, 2005 IN Performance Metrics Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links Bisection bandwidth (BB) – represents the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay
11
CSE431 L27 NetworkMultis.11Irwin, PSU, 2005 Bus IN N processors, 1 switch ( ), 1 link (the bus) Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch
12
CSE431 L27 NetworkMultis.12Irwin, PSU, 2005 Ring IN If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case N processors, N switches, 2 links/switch, N links N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2
13
CSE431 L27 NetworkMultis.13Irwin, PSU, 2005 Fully Connected IN N processors, N switches, N-1 links/switch, (N*(N-1))/2 links N simultaneous transfers l NB = link bandwidth * (N*(N-1))/2 l BB = link bandwidth * (N/2) 2
14
CSE431 L27 NetworkMultis.14Irwin, PSU, 2005 Crossbar (Xbar) Connected IN N processors, N 2 switches (unidirectional),2 links/switch, N 2 links N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2
15
CSE431 L27 NetworkMultis.15Irwin, PSU, 2005 Hypercube (Binary N-cube) Connected IN N processors, N switches, logN links/switch, (NlogN)/2 links N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube
16
CSE431 L27 NetworkMultis.16Irwin, PSU, 2005 2D and 3D Mesh/Torus Connected IN N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3 N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links
17
CSE431 L27 NetworkMultis.17Irwin, PSU, 2005 Fat Tree N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4
18
CSE431 L27 NetworkMultis.18Irwin, PSU, 2005 Fat Tree CDAB Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network. Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times The solution is to 'thicken' the upper links. l More links as the tree gets thicker increases the bisection Rather than design a bunch of N-port switches, use pairs
19
CSE431 L27 NetworkMultis.19Irwin, PSU, 2005 SGI NUMAlink Fat Tree www.embedded-computing.com/articles/woodacre
20
CSE431 L27 NetworkMultis.20Irwin, PSU, 2005 IN Comparison For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of Switches 1 Links per switch Total # of links 1
21
CSE431 L27 NetworkMultis.21Irwin, PSU, 2005 IN Comparison For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1 64 2 64 2+1 64+64 256 16 64 4+1 128+64 192 32 64 6+7 192+64 2016 1024 64 63+1 2016+64
22
CSE431 L27 NetworkMultis.22Irwin, PSU, 2005 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR16000128fat tree800 Cray 3TEAlpha 21164 300MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC 440 0.7GHz65,536*23D torus, fat tree, barrier
23
CSE431 L27 NetworkMultis.23Irwin, PSU, 2005 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier Torus BW3 B/cycle
24
CSE431 L27 NetworkMultis.24Irwin, PSU, 2005 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s 8 1 128 256 11GB/s 5.5 GB/s 700 MHz
25
CSE431 L27 NetworkMultis.25Irwin, PSU, 2005 Networks of Workstations (NOWs) Clusters Clusters of off-the-shelf, whole computers with multiple private address spaces Clusters are connected using the I/O bus of the computers l lower bandwidth that multiprocessor that use the memory bus l lower speed network links l more conflicts with I/O traffic Clusters of N processors have N copies of the OS limiting the memory available for applications Improved system availability and expandability l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability Economy-of-scale advantages with respect to costs
26
CSE431 L27 NetworkMultis.26Irwin, PSU, 2005 Commercial (NOW) Clusters ProcProc Speed # ProcNetwork Dell PowerEdge P4 Xeon3.06GHz2,500Myrinet eServer IBM SP Power41.7GHz2,944 VPI BigMacApple G52.3GHz2,200Mellanox Infiniband HP ASCI QAlpha 212641.25GHz8,192Quadrics LLNL Thunder Intel Itanium21.4GHz1,024*4Quadrics BarcelonaPowerPC 9702.2GHz4,536Myrinet
27
CSE431 L27 NetworkMultis.27Irwin, PSU, 2005 Summary Flynn’s classification of processors - SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)? Shared address multis – UMAs and NUMAs l Scalability of bus connected UMAs limited (< ~ 36 processors) l Network connected NUMAs more scalable l Interconnection Networks (INs) -fully connected, xbar -ring -mesh -n-cube, fat tree Message passing multis Cluster connected (NOWs) multis
28
CSE431 L27 NetworkMultis.28Irwin, PSU, 2005 Next Lecture and Reminders Next lecture -Reading assignment – PH 9.7 Reminders l HW5 (and last) due Dec 6 th (Part 1) l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (tentatively) schedule -Tuesday, December 13th, 2:30-4:20, 22 Deike
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.