CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( www.cse.psu.edu/~mji.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Multiple Processor Systems
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin (
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Review: Bus Connected SMPs (UMAs)
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Interconnect Networks
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Lecture 29Fall 2006 Computer Architecture Fall 2006 Lecture 29: Network Connected Multiprocessors Adapted from Mary Jane Irwin ( )
Interconnection network network interface and a case study.
Outline Why this subject? What is High Performance Computing?
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 29 Fall 2011 Lecture 29: Parallel Programming Overview.
Network Connected Multiprocessors
Overview Parallel Processing Pipelining
Parallel Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
CSE 431 Computer Architecture Fall Lecture 27
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
CMSC 611: Advanced Computer Architecture
Multiple Processor Systems
Parallel Processing Architectures
Distributed Systems CS
Presentation transcript:

CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin ( ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, © 2008, MK]

CSE431 Chapter 7C.2Irwin, PSU, 2008 Review: Shared Memory Multiprocessors (SMP)  Q1 – Single address space shared by all processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time  They come in two styles l Uniform memory access (UMA) multiprocessors l Nonuniform memory access (NUMA) multiprocessors Processor Cache Interconnection Network Memory I/O

CSE431 Chapter 7C.3Irwin, PSU, 2008 Message Passing Multiprocessors (MPP)  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (message passing)  Q2 – Coordination is built into message passing primitives (message send and message receive) Processor Cache Interconnection Network Memory

CSE431 Chapter 7C.4Irwin, PSU, 2008 Communication in Network Connected Multi’s  Implicit communication via loads and stores hardware designers have to provide coherent caches and process (thread) synchronization primitive (like ll and sc ) l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when neededrather than to send for it in case it might be used  Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication

CSE431 Chapter 7C.5Irwin, PSU, 2008 IN Performance Metrics  Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link wires (on chip)  Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links  Bisection bandwidth (BB) – closer to the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line  Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay, fault tolerance, power efficiency

CSE431 Chapter 7C.6Irwin, PSU, 2008 Bus IN  N processors, 1 switch ( ), 1 link (the bus)  Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch

CSE431 Chapter 7C.7Irwin, PSU, 2008 Ring IN  If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case  N processors, N switches, 2 links/switch, N links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2

CSE431 Chapter 7C.8Irwin, PSU, 2008 Fully Connected IN  N processors, N switches, N-1 links/switch, (N*(N-1))/2 links  N simultaneous transfers l NB = link bandwidth * (N * (N-1))/2 l BB = link bandwidth * (N/2) 2

CSE431 Chapter 7C.9Irwin, PSU, 2008 Crossbar (Xbar) Connected IN  N processors, N 2 switches (unidirectional), 2 links/switch, N 2 links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2

CSE431 Chapter 7C.10Irwin, PSU, 2008 Hypercube (Binary N-cube) Connected IN  N processors, N switches, logN links/switch, (NlogN)/2 links  N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube

CSE431 Chapter 7C.11Irwin, PSU, D and 3D Mesh/Torus Connected IN  N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3  N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4 N/2 links or 6 N/2 links

CSE431 Chapter 7C.12Irwin, PSU, 2008 IN Comparison  For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1

CSE431 Chapter 7C.13Irwin, PSU, 2008 IN Comparison  For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi)

CSE431 Chapter 7C.14Irwin, PSU, 2008 “Fat” Trees CDAB  Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.  Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times  The solution is to 'thicken' the upper links. l Have more links as you work towards the root of the tree increases the bisection bandwidth  Rather than design a bunch of N-port switches, use pairs of switches

CSE431 Chapter 7C.15Irwin, PSU, 2008 Fat Tree IN  N processors, log(N-1) * logN switches, 2 up + 4 down = 6 links/switch, N * logN links  N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4

CSE431 Chapter 7C.16Irwin, PSU, 2008 SGI NUMAlink Fat Tree

CSE431 Chapter 7C.17Irwin, PSU, 2008 Cache Coherency in IN Connected SMPs  For performance reasons we want to allow the shared data to be stored in caches  Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast  Directory-base protocols l keep a directory that is a repository for the state of every block in main memory (records which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data

CSE431 Chapter 7C.18Irwin, PSU, 2008 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR fat tree800 Cray 3TEAlpha MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC GHz65,536*23D torus, fat tree, barrier

CSE431 Chapter 7C.19Irwin, PSU, 2008 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier Torus BW3 B/cycle

CSE431 Chapter 7C.20Irwin, PSU, 2008 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s GB/s 5.5 GB/s 700 MHz

CSE431 Chapter 7C.21Irwin, PSU, 2008 Multiprocessor Benchmarks Scaling?Reprogram?Description LinpackWeakYesDense matrix linear algebra SPECrateWeakNoIndependent job parallelism SPLASH 2StrongNoIndependent job parallelism (both kernels and applications, many from high-performance computing) NAS ParallelWeakYes (c or Fortran)Five kernels, mostly from computational fluid dynamics PARSECWeakNoMultithreaded programs that use Pthreads and OpenMP. Nine applications and 3 kernels – 8 with data parallelism, 3 with pipelined parallelism, one unstructured Berkeley Design Patterns Strong or Weak Yes13 design patterns implemented by frameworks or kernels

CSE431 Chapter 7C.22Irwin, PSU, 2008 Supercomputer Style Migration (Top500)  Uniprocessors and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%. Now its 98% Clusters and MPPs. Nov data Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block

CSE431 Chapter 7C.23Irwin, PSU, 2008 Reminders  Reminders l HW6 December 11 th l Check grade posting on-line (by your midterm exam number) for correctness l Second evening midterm exam scheduled -Tuesday, November 18, 20:15 to 22:15, Location 262 Willard -Please let me know ASAP (via ) if you have a conflict