Download presentation
Presentation is loading. Please wait.
1
An Overview of MIMD Architectures
4/15/2019 \course\eleg652-04F\Topic1b.ppt
2
Generic MIMD Architecture
A generic modern multiprocessor Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network 4/15/2019 \course\eleg652-04F\Topic1b.ppt
3
Classification Shared memory model vs. distributed memory model
4/15/2019 \course\eleg652-04F\Topic1b.ppt
4
Distributed Memory MIMD Machines (Multicomputers, MPPs, clusters, etc
Message passing programming models Interconnect networks Generations/history: : COSMIC CUBE iPSC/I, II software routing : mesh-connected (hardware routing) Intel paragon : CM-5, IBM-SP : clusters 4/15/2019 \course\eleg652-04F\Topic1b.ppt
5
Concept of Message-Passing
Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes In simplest form, the send/recv match achieves pairwise synch event 4/15/2019 \course\eleg652-04F\Topic1b.ppt
6
Evolution of Message-Passing Machines
Early machines: FIFO on each link Hw close to programming model enabling non-blocking ops Buffered by system at destination until recv Diminishing role of topology Store&forward routing: topology important Introduction of pipelined routing made it less so Cost is in node-network interface Simplifies programming 4/15/2019 \course\eleg652-04F\Topic1b.ppt
7
Example: IBM SP-2 Made out of essentially complete RS6000 Network interface integrated in I/O bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt
8
Example Intel Paragon 4/15/2019 \course\eleg652-04F\Topic1b.ppt
9
The MANNA Multiprocessor Testbed
cluster Crossbar- Hierarchies Cluster Node Node Node i860XP Node CP Network Interface I/O 32 Mbyte Memory 8 Node Node Crossbar 4 4/15/2019 \course\eleg652-04F\Topic1b.ppt
10
Shared-Memory Multiprocessors
Uniform-memory-access model (UMA) Non-uniform-memory-access model (NUMA) without caches (BBN, cedar, Sequent) COMA (Kendall Square KSR-1, DDM) CC-NUMA (DASH) Symmetric vs. Asymmetric MPs Symmetric MP (SMPs) Asymmetric MP (some master some slave) 4/15/2019 \course\eleg652-04F\Topic1b.ppt
11
Shared Address Space Model (e.g. pthreads)
Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared Writes to shared address visible to other threads Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization S t o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses 4/15/2019 \course\eleg652-04F\Topic1b.ppt
12
Shared Address Space Architectures
Any processor can directly reference any memory location (comm. Implicit) Convenient: Location transparency Similar programming model to time-sharing on uniprocessors Popularly known as shared memory machines or model Ambiguous: memory may be physically distributed among processors 4/15/2019 \course\eleg652-04F\Topic1b.ppt
13
Shared-Memory Parallel Computers (late 90’s –early 2000’s)
SMPs (Intel-Quad, SUN SMPs) Supercomputers Cray T3E Convex 2000 SGI Origin/Onyx Tera Computers 4/15/2019 \course\eleg652-04F\Topic1b.ppt
14
Example: Intel Pentium Pro Quad
All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume 4/15/2019 \course\eleg652-04F\Topic1b.ppt
15
Example: SUN Enterprise
16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt
16
Scaling Up interconnect: cost (crossbar) or bandwidth (bus)
“Dance hall” Distributed memory interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost Distributed memory or non-uniform memory access (NUMA) Caching shared (particularly nonlocal) data? 4/15/2019 \course\eleg652-04F\Topic1b.ppt
17
Example: Cray T3E Scale up to 1024 processors, 480MB/s links
Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) 4/15/2019 \course\eleg652-04F\Topic1b.ppt
18
Multithreaded Shared-Memory MIMD
“time sharing” one instruction processing unit in a pipelined fashion by all instruction streams 4/15/2019 \course\eleg652-04F\Topic1b.ppt
19
. . . . . . . . The Denelcor HEP PEM PEM 15 16 Packet switch network
2 PEM 16 Packet switch network DMM 1 DMM 2 . . . . DMM 127 DMM 128 PEM ST IF DF EX The Denelcor HEP INC PSW 4/15/2019 \course\eleg652-04F\Topic1b.ppt
20
Denelcor HEP Many inst. streams single P-unit
16 PEM DMM : 64 bit/DMM Packet-switching network I-stream creation is under program control 50 I-streams Programmability : SISAL, Fortran = 4/15/2019 \course\eleg652-04F\Topic1b.ppt
21
Tera MTA (1990) A shared memory LIW multiprocessor
128 fine threads have 32 registers each to tolerate FU, synchronization and memory latency. Explicit-dependence look ahead increases single-thread concurrency. Synchronization uses full/empty bits. 4/15/2019 \course\eleg652-04F\Topic1b.ppt
22
CM-5 Scalable Massively Parallel Supercomputer for 1990’s
1012 million floating-point operations per second (Tera-Flops) 64,000 powerful RISC microprocessors working together Scalable : performance grows transparently Universal : support a vast variety of application domains Highly reliable : sustained performance for large jobs requiring weeks/months to run. 4/15/2019 \course\eleg652-04F\Topic1b.ppt
23
Future Trend of MIMD Computers
Program execution models : beyond the SPMD model Hybrid architecture: provide both shared-memory and message-passing Efficient mechanism for latency AND bw management –called the “memory-wall” problem 4/15/2019 \course\eleg652-04F\Topic1b.ppt
24
Shared Memory Architecture Examples (2000 – now)
Sun’s Wildfire Architecture (Henn&Patt, section 6.11, page 622) Intel Xeon Multithreaded Architecture SGI Onyx-3000 IBM p690 Others 4/15/2019 \course\eleg652-04F\Topic1b.ppt
25
SUN FIRE 15K Expander Board Shared Memory p p p p p p p p I/O Boards 4 CPU per board: 900Mhz Ultra SPARC with 32KB I-cache and 64KB D-cache 32 GB memory per board Crossbar switch: 43 GB/s bandwidth 4/15/2019 \course\eleg652-04F\Topic1b.ppt
26
Intel Xeon MP based server
Xeon Proc Memory Control Hub I/O PCI-x Bridge 1.8Ghz Xeon with 512k L2 cache 4 processor share a common bus of 6.4GB/s bandwidth Memory share a common bus of 4.3GB/s bandwidth Memory accessed through a memory control hub 4/15/2019 \course\eleg652-04F\Topic1b.ppt
27
IBM P690 I 1Ghz cpu 1Ghz cpu I D D Shared L2 Cache L3 controller Distributed switch L3 Cache Proc local bus I/O bus Memory Each POWER4 chip has two 1Ghz processor core, shared 1.5MB L2, directed access 32MB/chip L3, chip to chip communication logic Each SMP building block has 4 POWER4 chips The base p690 has up to 4 SMP building block 4/15/2019 \course\eleg652-04F\Topic1b.ppt
28
SGI Onyx 3800 R-Brick P $ shared memory Each node is called a C-Brick with 2-4 processor of 600Mhz R-Brick is a 8 by 8 cross-bar switch of 3.2GB/s bandwidth, 4 for C-Brick 4 for other R-Bricks Each C-brick has up to 8 GB of local memory that can be accessed by all processor in the way of NUMAlink interconnect 4/15/2019 \course\eleg652-04F\Topic1b.ppt
29
Recent High-End MIMD Parallel Architecture Projects
ASCI Projects (USA) ASCI Blue ASCI Red ASCI Blue Mountains HTMT Project (USA) The Earth Simulator (Japan) HPCS architectures (USA) 4/15/2019 \course\eleg652-04F\Topic1b.ppt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.