CSE 431 Computer Architecture Fall Lecture 27

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Multiple Processor Systems
CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin (
CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin (
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Chapter 17 Parallel Processing.
Review: Bus Connected SMPs (UMAs)
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Lecture 29Fall 2006 Computer Architecture Fall 2006 Lecture 29: Network Connected Multiprocessors Adapted from Mary Jane Irwin ( )
Outline Why this subject? What is High Performance Computing?
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 29 Fall 2011 Lecture 29: Parallel Programming Overview.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Network Connected Multiprocessors
Multiprocessor System Distributed System
Overview Parallel Processing Pipelining
Parallel Architecture
Interconnect Networks
Multiprocessor Systems
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
CSC 4250 Computer Architectures
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Cache Memory Presentation I
CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s
Parallel and Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Chapter 4 Multiprocessors
Advanced Computer and Parallel Processing
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
CSL718 : Multiprocessors 13th April, 2006 Introduction
Cluster Computers.
Presentation transcript:

Mary Jane Irwin ( www.cse.psu.edu/~mji ) CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] Other handouts To handout next time

Review: Bus Connected SMPs (UMAs) Processor Processor Processor Processor Cache Cache Cache Cache Single Bus Memory I/O Caches are used to reduce latency and to lower bus traffic Must provide hardware for cache coherence and process synchronization Bus traffic and bandwidth limits scalability (<~ 36 processors) Three desirable bus characteristics are incompatible: high bandwidth, low latency and long length

Review: Multiprocessor Basics Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors? # of Proc Communication model Message passing 8 to 2048 Shared address NUMA 8 to 256 UMA 2 to 64 Physical connection Network Bus 2 to 36

Network Connected Multiprocessors Cache Interconnection Network (IN) Memory Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives Interconnection network supports interprocessor communication AKA Shared memory – single address space Distributed memory – physical memory that is divided into modules with some place near each processor Cache coherent NUMA (ccNUMA) – a nonuniform memory access multi that maintains coherence for all caches

Summing 100,000 Numbers on 100 Processors Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i]; /* sum local array subset The processors then coordinate in adding together the sub sums (Pn is the number of processors, send(x,y) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2; /*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1); /*final sum in P0’s sum Divide and conquer summing – half of the processors add pairs of partial sums, the a quarter add pairs of the new partial sums, and so on. Code divides all processors into either senders (second half) or receivers (first half) and each receiving processor gets only one message. So we assume a receiving processor stalls until it receives a message. Thus, send and receive are also used as primitives for synchronization (as well as communication). If there is an odd number of nodes, the middle node doesn’t participate in send/receive. The limit is then set so that this node is the highest node in the next iteration.

An Example with 10 Processors sum P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10 For class handout

Communication in Network Connected Multi’s Implicit communication via loads and stores hardware designers have to provide coherent caches and process synchronization primitive lower communication overhead harder to overlap computation with communication more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM)) Explicit communication via sends and receives simplest solution for hardware designers higher communication overhead easier to overlap computation with communication easier for the programmer to optimize communication

Cache Coherency in NUMAs For performance reasons we want to allow the shared data to be stored in caches Once again have multiple copies of the same data with the same address in different processors bus snooping won’t work, since there is no single bus on which all memory references are broadcast Directory-base protocols keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.) directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention directory controller sends explicit commands over the IN to each processor that has a copy of the data The Cray T3E has a single address space but is not cache coherent.

IN Performance Metrics Network cost number of switches number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) width in bits per link, length of link Network bandwidth (NB) – represents the best case bandwidth of each link * number of links Bisection bandwidth (BB) – represents the worst case divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line Other IN performance issues latency on an unloaded network to send and receive messages throughput – maximum # of messages transmitted per unit time # routing hops worst case, congestion control and delay

Bus IN N processors, 1 switch ( ), 1 link (the bus) Bidirectional network switch Processor node N processors, 1 switch ( ), 1 link (the bus) Only 1 simultaneous transfer at a time NB = link (bus) bandwidth * 1 BB = link (bus) bandwidth * 1

Ring IN N processors, N switches, 2 links/switch, N links N simultaneous transfers NB = link bandwidth * N BB = link bandwidth * 2 If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case

Fully Connected IN N processors, N switches, N-1 links/switch, (N*(N-1))/2 links N simultaneous transfers NB = link bandwidth * (N*(N-1))/2 BB = link bandwidth * (N/2)2 Easy way to explain the BB: Half of the nodes (which is to say, N/2) each connect to the other N/2 nodes. Since you've got (N/2) nodes, each with (N/2) links, there are (N/2)^2 links crossing the bisection. Hence, (N/2)^2

Crossbar (Xbar) Connected IN N processors, N2 switches (unidirectional),2 links/switch, N2 links N simultaneous transfers NB = link bandwidth * N BB = link bandwidth * N/2 The crossbar can support any combination of messages between processors. Note: Remind students that the crossbar, unlike the others, doesn't have a 1-to-1 correlation between switches and processors. Hence, the usual calculation of "# of links * link bandwidth" doesn't apply here. Instead, you simply recognize that there are only N nodes, each with one input and one output, for a best case communication of Link Bandwidth * # Nodes

Hypercube (Binary N-cube) Connected IN N processors, N switches, logN links/switch, (NlogN)/2 links N simultaneous transfers NB = link bandwidth * (NlogN)/2 BB = link bandwidth * N/2

2D and 3D Mesh/Torus Connected IN N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links N simultaneous transfers NB = link bandwidth * 4N or link bandwidth * 6N BB = link bandwidth * 2 N1/2 or link bandwidth * 2 N2/3

Fat Tree N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links N simultaneous transfers NB = link bandwidth * NlogN BB = link bandwidth * 4 The CM5 fat tree switches had four downward connections and two or four upward connections. Greg note: I don't like this diagram much at all. So I just drew a new one on the board, and explained it with a previous step. See the next slide for a rough attempt at what I did.

Fat Tree Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network. A B C D Any time A wants to send to C, it ties up the upper links, so that B can't send to D. The bisection bandwidth on a tree is horrible - 1 link, at all times The solution is to 'thicken' the upper links. More links as the tree gets thicker increases the bisection Rather than design a bunch of N-port switches, use pairs Important point: Fat trees are /fantastic/ at multicast and large-scale message distribution. Wonderful for one-to-many messages, as the tree can propogate them down, saving much bandwidth. Especially helpful for timing specific things (same time of arrival in unloaded network) and other such group messages.

SGI NUMAlink Fat Tree www.embedded-computing.com/articles/woodacre

IN Comparison For a 64 processor system Bus Ring Torus 6-cube Fully connected Network bandwidth 1 Bisection bandwidth Total # of Switches Links per switch Total # of links For class handout

Network Connected Multiprocessors Proc Speed # Proc IN Topology BW/link (MB/sec) SGI Origin R16000 128 fat tree 800 Cray 3TE Alpha 21164 300MHz 2,048 3D torus 600 Intel ASCI Red Intel 333MHz 9,632 mesh IBM ASCI White Power3 375MHz 8,192 multistage Omega 500 NEC ES SX-5 500MHz 640*8 640-xbar 16000 NASA Columbia Intel Itanium2 1.5GHz 512*20 fat tree, Infiniband IBM BG/L Power PC 440 0.7GHz 65,536*2 3D torus, fat tree, barrier ASCI white has 16 processors per chip (those are probably mesh connected) The Columbia machine is 20 Infiniband connected SGI clusters of 512 fat tree interconnected processors

IBM BlueGene 512-node proto BlueGene/L Peak Perf 1.0 / 2.0 TFlops/s Memory Size 128 GByte 16 / 32 TByte Foot Print 9 sq feet 2500 sq feet Total Power 9 KW 1.5 MW # Processors 512 dual proc 65,536 dual proc Networks 3D Torus, Tree, Barrier Torus BW 3 B/cycle Two PowerPC 440s cores per chip – 2 PEs, 2 chips per computer card – 4 PEs, 16 compute cards per node card – 64 PEs, 32 node cards per cabinet – 2048 PEs, 64 cabinets per system – 131,072 PEs

16KB Multiport SRAM buffer A BlueGene/L Chip 11GB/s 32K/32K L1 440 CPU Double FPU 2KB L2 4MB L3 ECC eDRAM 128B line 8-way assoc 128 256 5.5 GB/s 16KB Multiport SRAM buffer 256 700 MHz 256 32K/32K L1 440 CPU Double FPU 2KB L2 128 256 5.5 GB/s 11GB/s 700 MAz, 0.13 micron copper CMOS Three-way superscalar out-of-order execution, 7 stage integer pipeline, dynamic branch prediction, single-cycle 32-b multiplier FPU – dual 64-b pipelined FPUs and two sets of 32 registers, 64-b wide 11 clock cycle latency L2’s, 28 to 40 clock cycle latency L3, 86 clock cycle latency main memory Gbit ethernet DDR control 3D torus Fat tree Barrier 1 8 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s

Networks of Workstations (NOWs) Clusters Clusters of off-the-shelf, whole computers with multiple private address spaces Clusters are connected using the I/O bus of the computers lower bandwidth that multiprocessor that use the memory bus lower speed network links more conflicts with I/O traffic Clusters of N processors have N copies of the OS limiting the memory available for applications Improved system availability and expandability easier to replace a machine without bringing down the whole system allows rapid, incremental expandability Economy-of-scale advantages with respect to costs About half of the clusters in the Top500 supercomputers contain single-processor workstations and about half of the clusters contain SMP servers.

Commercial (NOW) Clusters Proc Proc Speed # Proc Network Dell PowerEdge P4 Xeon 3.06GHz 2,500 Myrinet eServer IBM SP Power4 1.7GHz 2,944 VPI BigMac Apple G5 2.3GHz 2,200 Mellanox Infiniband HP ASCI Q Alpha 21264 1.25GHz 8,192 Quadrics LLNL Thunder Intel Itanium2 1.4GHz 1,024*4 Barcelona PowerPC 970 2.2GHz 4,536 ASCI Q may be an SMP (if so, so is LLNL’s Thunder) – quadrics is a fat tree IN topology

Summary Flynn’s classification of processors - SISD, SIMD, MIMD Q1 – How do processors share data? Q2 – How do processors coordinate their activity? Q3 – How scalable is the architecture (what is the maximum number of processors)? Shared address multis – UMAs and NUMAs Scalability of bus connected UMAs limited (< ~ 36 processors) Network connected NUMAs more scalable Interconnection Networks (INs) fully connected, xbar ring mesh n-cube, fat tree Message passing multis Cluster connected (NOWs) multis

Next Lecture and Reminders Reading assignment – PH 9.7 Reminders HW5 (and last) due Dec 6th (Part 1) Check grade posting on-line (by your midterm exam number) for correctness Final exam (tentatively) schedule Tuesday, December 13th, 2:30-4:20, 22 Deike