Download presentation
Presentation is loading. Please wait.
Published byHarry Simpson Modified over 9 years ago
1
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
2
2 Message Passing Multicomputer n Consists of multiple computing units, called nodes n Each node is an autonomous computer, consists of Processor(s) (may be an SMP) Local memory Disks or I/O peripherals (optional) full-scale OS (some microkernel) n Nodes are communicated by message passing No-remote-memory-access (NORMA) machines n Distributed memory machines
3
3 IBM SP2
4
4 SP2 n IBM SP2 => Scalable POWERparallel System n Developed based on RISC System/6000 architecture (POWER2 processor) n Interconnect: High-Performance Switch (HPS)
5
5 SP2 - Nodes n 66.7 MHz POWER2 processor with L2 cache. n POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. n 2 floating point units (FPU) + 2 fixed point units (FXU) n Perform up to four floating-point operations (2 multiply-add ops) per cycle. n A peak performance of 266 Mflops (66.7 x4) can be achieved.
6
6 IBM SP2 using Two types of nodes : n Thin node: X 4 micro-channel (I/O) slots, 96KB L2 cache, 64-512MB memory, 1-4 GB disk n Wide node : X 8 micro-channel slots, 288KB L2 cache, 64- 2048MB memory, 1-8 GB disk
7
7 SP2 Wide Node
8
8 IBM SP2: Interconnect n Switch: X High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit). Omega-switch-based multistage network n Network interface: X Enhanced Communication Adapter. X The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking
9
9 SP2 Switch Board n Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed n 4 routes between each pair of nodes (set at booting time) n hardware latency is 500 nsec (board) n capable of scaling bisectional bandwidth linearly with the number of nodes
10
10 n Maximum point-to-point bandwidth: 40MB/s n 1 packet consists of 256 bytes flit size = 1 byte (wormhole routing) SP2 HPS (a 16 x 16 switch board) Vulcan chip
11
11 SP2 Communication Adapter n one adapter per node n one switch board unit per rack n send FIFO has 128 entries (256 bytes each) n receive FIFO has 64 entries (256 bytes each) n 2 DMA engines
12
12 SP2 Communication Adapter POWER2 Host Node Network Adapter
13
13 128-node SP2 (16 nodes per frame)
14
14 INTEL PARAGON
15
15 Intel Paragon (2-D mesh)
16
16 Intel Paragon Node Architecture n Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). X One of them is used as message processor (communication co- processor) handling all communication events. X Two are application processors (computation only) n Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) X Peak memory-to-processor bandwidth: 400 MB/s X Peak cache-to-processor bandwidth:1.2 GB/s.
17
17 Intel Paragon Node Architecture n message processor: X handles message protocol processing for the application program, X freeing the application processor to continue with numeric computation while messages are transmitted and received. X also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).
18
18 Paragon Node Architecture
19
19 Paragon Interconnect n 2-D Mesh X I/O devices attached on a single side X 16-bit link, 175 MB/s n Mesh Routing Components (MRCs), X one for each node. X 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). X In a 512 PEs (16x32), 10 hops is 400-700nsec
20
20 CRAY T3D
21
21 Cray T3D Node Architecture n Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) n PE: 150 MHz DEC 21064 Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS n 1024 processor: sustained max speed 152 Gflop/s
22
22 T3D Node and Network Interface
23
23 Cray T3D Interconnect n Interconnect: 3D Torus, 16-bit data/link, 150 MHz n Communication channel peak rate: 300 MB/s.
24
24 T3D n The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner n The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)
25
25 T3D : Local and Remote Memory n Local memory: X 16 or 64 MB DRAM per PE X Latency: 13 to 38 clock cycles (87 to 253 nsec) X Bandwidth: up to 320 MB/s n Remote memory: X Directly addressable by the processor, X Latency of 1 to 2 microseconds X Bandwidth: over 100 MB/s (measured in software).
26
26 T3D : Local and Remote Memory Distributed Shared Memory Machine n All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. n NCC-NUMA : non-cache-coherence NUMA
27
27 T3D: Bisectional Bandwidth n The network moves data in packets with payload sizes of either one or four 64-bit words n The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; X 512 node=8x8x8, 64 nodes/frame, 4x64x300
28
28 T3E Node E-Register Alpha 21164 4-issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz)
29
29 Cluster: Network of Workstation (NOW) Cluster of Workstation (COW) Pile-of-PCs (POPC)
30
30 Clusters of Workstations n Several workstations which are connected by a network. X connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. X some software to tightly integrate all resources n Each workstation is a independent machines
31
31 n Advantages X Cheaper X Easy to scale X Coarse-grain parallelism (traditionally) n Disadvantages of Clusters X Longer communication latency compared with other parallel system (traditionally) Cluster
32
32 ATM Cluster (Fore SBA-200) n Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc,.. n NI location: I/O bus n Communication processor: Intel i960, 33MHz, 128KB RAM n Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port n HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)
33
33 Myrinet Cluster n Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC,.. n NI location: I/O bus n Communication processor: LANai, 25 MHz, 128 KB SRAM n Peak bandwidth: 80 MB/s --> 160 MB/s
34
34 Conclusion n Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. n Overlap computation with communication improve performance.
35
35 Paragon n Main processor : 50 MHz i860 XP, 75 Mflop/s. n NI location : Memory bus (64-bit, 400 MB/s) n Communication processor : 50 MHz i860 XP -- a processor n Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)
36
36 SP2 n Main processor : 66.7 MHz POWER2, 266 MFLOPs n NI location : I/O bus (32-bit micro-channel) n Communication processor : 40 MHz i860 XR -- a processor n Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)
37
37 T3D n Main processor: 150 MHz DEC 21064 Alpha AXP, 150 MFLOPS n NI location: Memory bus (320 MB/s local; or 100 MB/s remote) n Communication processor : Controller (BLT) -- hardware circuitry n Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.