An Overview of MIMD Architectures

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.

Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Computer Architectures
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Network Connected Multiprocessors
Lynn Choi School of Electrical Engineering
Overview Parallel Processing Pipelining
Parallel Architecture
Multiprocessor Systems
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Lecture 21 Synchronization
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Programming Models Dr. Xiao Qin Auburn.
Course Outline Introduction in algorithms and applications
Overview Parallel Processing Pipelining
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Shared Memory Multiprocessors
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Chapter 17 Parallel Processing
Convergence of Parallel Architectures
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
Parallel Computer Architecture
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
What is Computer Architecture?
Latency Tolerance: what to do when it just won’t go away
What is Computer Architecture?
What is Computer Architecture?
Chapter 4 Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Prof. Onur Mutlu Carnegie Mellon University 9/12/2012
CSE378 Introduction to Machine Organization
Cluster Computers.
Presentation transcript:

An Overview of MIMD Architectures 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Generic MIMD Architecture A generic modern multiprocessor Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Classification Shared memory model vs. distributed memory model 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Distributed Memory MIMD Machines (Multicomputers, MPPs, clusters, etc Message passing programming models Interconnect networks Generations/history: 1983-87: COSMIC CUBE iPSC/I, II software routing 1988-92: mesh-connected (hardware routing) Intel paragon 1993-99: CM-5, IBM-SP 1996 - : clusters 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Concept of Message-Passing Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes In simplest form, the send/recv match achieves pairwise synch event 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Evolution of Message-Passing Machines Early machines: FIFO on each link Hw close to programming model enabling non-blocking ops Buffered by system at destination until recv Diminishing role of topology Store&forward routing: topology important Introduction of pipelined routing made it less so Cost is in node-network interface Simplifies programming 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Example: IBM SP-2 Made out of essentially complete RS6000 Network interface integrated in I/O bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Example Intel Paragon 4/15/2019 \course\eleg652-04F\Topic1b.ppt

The MANNA Multiprocessor Testbed cluster Crossbar- Hierarchies Cluster Node Node Node i860XP Node CP Network Interface I/O 32 Mbyte Memory 8 Node Node Crossbar 4 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Shared-Memory Multiprocessors Uniform-memory-access model (UMA) Non-uniform-memory-access model (NUMA) without caches (BBN, cedar, Sequent) COMA (Kendall Square KSR-1, DDM) CC-NUMA (DASH) Symmetric vs. Asymmetric MPs Symmetric MP (SMPs) Asymmetric MP (some master some slave) 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Shared Address Space Model (e.g. pthreads) Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared Writes to shared address visible to other threads Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization S t o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Shared Address Space Architectures Any processor can directly reference any memory location (comm. Implicit) Convenient: Location transparency Similar programming model to time-sharing on uniprocessors Popularly known as shared memory machines or model Ambiguous: memory may be physically distributed among processors 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Shared-Memory Parallel Computers (late 90’s –early 2000’s) SMPs (Intel-Quad, SUN SMPs) Supercomputers Cray T3E Convex 2000 SGI Origin/Onyx Tera Computers 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Example: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Example: SUN Enterprise 16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Scaling Up interconnect: cost (crossbar) or bandwidth (bus) “Dance hall” Distributed memory interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost Distributed memory or non-uniform memory access (NUMA) Caching shared (particularly nonlocal) data? 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Example: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Multithreaded Shared-Memory MIMD “time sharing” one instruction processing unit in a pipelined fashion by all instruction streams 4/15/2019 \course\eleg652-04F\Topic1b.ppt

. . . . . . . . The Denelcor HEP PEM PEM 15 16 Packet switch network 2 PEM 16 Packet switch network DMM 1 DMM 2 . . . . DMM 127 DMM 128 PEM ST IF DF EX The Denelcor HEP INC PSW 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Denelcor HEP Many inst. streams single P-unit 16 PEM + 128 DMM : 64 bit/DMM Packet-switching network I-stream creation is under program control 50 I-streams Programmability : SISAL, Fortran = 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Tera MTA (1990) A shared memory LIW multiprocessor 128 fine threads have 32 registers each to tolerate FU, synchronization and memory latency. Explicit-dependence look ahead increases single-thread concurrency. Synchronization uses full/empty bits. 4/15/2019 \course\eleg652-04F\Topic1b.ppt

CM-5 Scalable Massively Parallel Supercomputer for 1990’s 1012 million floating-point operations per second (Tera-Flops) 64,000 powerful RISC microprocessors working together Scalable : performance grows transparently Universal : support a vast variety of application domains Highly reliable : sustained performance for large jobs requiring weeks/months to run. 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Future Trend of MIMD Computers Program execution models : beyond the SPMD model Hybrid architecture: provide both shared-memory and message-passing Efficient mechanism for latency AND bw management –called the “memory-wall” problem 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Shared Memory Architecture Examples (2000 – now) Sun’s Wildfire Architecture (Henn&Patt, section 6.11, page 622) Intel Xeon Multithreaded Architecture SGI Onyx-3000 IBM p690 Others 4/15/2019 \course\eleg652-04F\Topic1b.ppt

SUN FIRE 15K Expander Board Shared Memory p p p p p p p p I/O Boards 4 CPU per board: 900Mhz Ultra SPARC with 32KB I-cache and 64KB D-cache 32 GB memory per board Crossbar switch: 43 GB/s bandwidth 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Intel Xeon MP based server Xeon Proc Memory Control Hub I/O PCI-x Bridge 1.8Ghz Xeon with 512k L2 cache 4 processor share a common bus of 6.4GB/s bandwidth Memory share a common bus of 4.3GB/s bandwidth Memory accessed through a memory control hub 4/15/2019 \course\eleg652-04F\Topic1b.ppt

IBM P690 I 1Ghz cpu 1Ghz cpu I D D Shared L2 Cache L3 controller Distributed switch L3 Cache Proc local bus I/O bus Memory Each POWER4 chip has two 1Ghz processor core, shared 1.5MB L2, directed access 32MB/chip L3, chip to chip communication logic Each SMP building block has 4 POWER4 chips The base p690 has up to 4 SMP building block 4/15/2019 \course\eleg652-04F\Topic1b.ppt

SGI Onyx 3800 R-Brick P $ shared memory Each node is called a C-Brick with 2-4 processor of 600Mhz R-Brick is a 8 by 8 cross-bar switch of 3.2GB/s bandwidth, 4 for C-Brick 4 for other R-Bricks Each C-brick has up to 8 GB of local memory that can be accessed by all processor in the way of NUMAlink interconnect 4/15/2019 \course\eleg652-04F\Topic1b.ppt

Recent High-End MIMD Parallel Architecture Projects ASCI Projects (USA) ASCI Blue ASCI Red ASCI Blue Mountains HTMT Project (USA) The Earth Simulator (Japan) HPCS architectures (USA) 4/15/2019 \course\eleg652-04F\Topic1b.ppt