An Overview of MIMD Architectures 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Generic MIMD Architecture A generic modern multiprocessor Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Classification Shared memory model vs. distributed memory model 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Distributed Memory MIMD Machines (Multicomputers, MPPs, clusters, etc Message passing programming models Interconnect networks Generations/history: 1983-87: COSMIC CUBE iPSC/I, II software routing 1988-92: mesh-connected (hardware routing) Intel paragon 1993-99: CM-5, IBM-SP 1996 - : clusters 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Concept of Message-Passing Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes In simplest form, the send/recv match achieves pairwise synch event 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Evolution of Message-Passing Machines Early machines: FIFO on each link Hw close to programming model enabling non-blocking ops Buffered by system at destination until recv Diminishing role of topology Store&forward routing: topology important Introduction of pipelined routing made it less so Cost is in node-network interface Simplifies programming 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Example: IBM SP-2 Made out of essentially complete RS6000 Network interface integrated in I/O bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Example Intel Paragon 4/15/2019 \course\eleg652-04F\Topic1b.ppt
The MANNA Multiprocessor Testbed cluster Crossbar- Hierarchies Cluster Node Node Node i860XP Node CP Network Interface I/O 32 Mbyte Memory 8 Node Node Crossbar 4 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Shared-Memory Multiprocessors Uniform-memory-access model (UMA) Non-uniform-memory-access model (NUMA) without caches (BBN, cedar, Sequent) COMA (Kendall Square KSR-1, DDM) CC-NUMA (DASH) Symmetric vs. Asymmetric MPs Symmetric MP (SMPs) Asymmetric MP (some master some slave) 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Shared Address Space Model (e.g. pthreads) Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared Writes to shared address visible to other threads Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization S t o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Shared Address Space Architectures Any processor can directly reference any memory location (comm. Implicit) Convenient: Location transparency Similar programming model to time-sharing on uniprocessors Popularly known as shared memory machines or model Ambiguous: memory may be physically distributed among processors 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Shared-Memory Parallel Computers (late 90’s –early 2000’s) SMPs (Intel-Quad, SUN SMPs) Supercomputers Cray T3E Convex 2000 SGI Origin/Onyx Tera Computers 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Example: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Example: SUN Enterprise 16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Scaling Up interconnect: cost (crossbar) or bandwidth (bus) “Dance hall” Distributed memory interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost Distributed memory or non-uniform memory access (NUMA) Caching shared (particularly nonlocal) data? 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Example: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Multithreaded Shared-Memory MIMD “time sharing” one instruction processing unit in a pipelined fashion by all instruction streams 4/15/2019 \course\eleg652-04F\Topic1b.ppt
. . . . . . . . The Denelcor HEP PEM PEM 15 16 Packet switch network 2 PEM 16 Packet switch network DMM 1 DMM 2 . . . . DMM 127 DMM 128 PEM ST IF DF EX The Denelcor HEP INC PSW 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Denelcor HEP Many inst. streams single P-unit 16 PEM + 128 DMM : 64 bit/DMM Packet-switching network I-stream creation is under program control 50 I-streams Programmability : SISAL, Fortran = 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Tera MTA (1990) A shared memory LIW multiprocessor 128 fine threads have 32 registers each to tolerate FU, synchronization and memory latency. Explicit-dependence look ahead increases single-thread concurrency. Synchronization uses full/empty bits. 4/15/2019 \course\eleg652-04F\Topic1b.ppt
CM-5 Scalable Massively Parallel Supercomputer for 1990’s 1012 million floating-point operations per second (Tera-Flops) 64,000 powerful RISC microprocessors working together Scalable : performance grows transparently Universal : support a vast variety of application domains Highly reliable : sustained performance for large jobs requiring weeks/months to run. 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Future Trend of MIMD Computers Program execution models : beyond the SPMD model Hybrid architecture: provide both shared-memory and message-passing Efficient mechanism for latency AND bw management –called the “memory-wall” problem 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Shared Memory Architecture Examples (2000 – now) Sun’s Wildfire Architecture (Henn&Patt, section 6.11, page 622) Intel Xeon Multithreaded Architecture SGI Onyx-3000 IBM p690 Others 4/15/2019 \course\eleg652-04F\Topic1b.ppt
SUN FIRE 15K Expander Board Shared Memory p p p p p p p p I/O Boards 4 CPU per board: 900Mhz Ultra SPARC with 32KB I-cache and 64KB D-cache 32 GB memory per board Crossbar switch: 43 GB/s bandwidth 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Intel Xeon MP based server Xeon Proc Memory Control Hub I/O PCI-x Bridge 1.8Ghz Xeon with 512k L2 cache 4 processor share a common bus of 6.4GB/s bandwidth Memory share a common bus of 4.3GB/s bandwidth Memory accessed through a memory control hub 4/15/2019 \course\eleg652-04F\Topic1b.ppt
IBM P690 I 1Ghz cpu 1Ghz cpu I D D Shared L2 Cache L3 controller Distributed switch L3 Cache Proc local bus I/O bus Memory Each POWER4 chip has two 1Ghz processor core, shared 1.5MB L2, directed access 32MB/chip L3, chip to chip communication logic Each SMP building block has 4 POWER4 chips The base p690 has up to 4 SMP building block 4/15/2019 \course\eleg652-04F\Topic1b.ppt
SGI Onyx 3800 R-Brick P $ shared memory Each node is called a C-Brick with 2-4 processor of 600Mhz R-Brick is a 8 by 8 cross-bar switch of 3.2GB/s bandwidth, 4 for C-Brick 4 for other R-Bricks Each C-brick has up to 8 GB of local memory that can be accessed by all processor in the way of NUMAlink interconnect 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Recent High-End MIMD Parallel Architecture Projects ASCI Projects (USA) ASCI Blue ASCI Red ASCI Blue Mountains HTMT Project (USA) The Earth Simulator (Japan) HPCS architectures (USA) 4/15/2019 \course\eleg652-04F\Topic1b.ppt