Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

SE-292 High Performance Computing

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Today’s topics Single processors and the Memory Hierarchy

Jie Liu, Ph.D. Professor Department of Computer Science

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Lecture 18: Multiprocessors

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Chapter 17 Parallel Processing.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Computer Architecture and Interconnect 1b.1.

Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shared versus Switched Media.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Outline Why this subject? What is High Performance Computing?

Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Background Computer System Architectures Computer System Software.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

The University of Adelaide, School of Computer Science

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Centralized Multiprocessor.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Parallel Architecture

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

Distributed and Parallel Processing

Multiprocessor Systems

Course Outline Introduction in algorithms and applications

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

Parallel Architectures Based on Parallel Computing, M. J. Quinn

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

The University of Adelaide, School of Computer Science

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Presentation transcript:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 Parallel Architectures

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors Multicomputers Multicomputers Flynn’s taxonomy Flynn’s taxonomy

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Interconnection Networks Uses of interconnection networks Uses of interconnection networks  Connect processors to shared memory  Connect processors to each other Interconnection media types Interconnection media types  Shared medium  Switched medium

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shared versus Switched Media

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shared Medium Allows only message at a time Allows only message at a time Messages are broadcast Messages are broadcast Each processor “listens” to every message Each processor “listens” to every message Arbitration is decentralized Arbitration is decentralized Collisions require resending of messages Collisions require resending of messages Ethernet is an example Ethernet is an example

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Switched Medium Supports point-to-point messages between pairs of processors Supports point-to-point messages between pairs of processors Each processor has its own path to switch Each processor has its own path to switch Advantages over shared media Advantages over shared media  Allows multiple messages to be sent simultaneously  Allows scaling of network to accommodate increase in processors

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Switch Network Topologies View switched network as a graph View switched network as a graph  Vertices = processors or switches  Edges = communication paths Two kinds of topologies Two kinds of topologies  Direct  Indirect

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Direct Topology Ratio of switch nodes to processor nodes is 1:1 Ratio of switch nodes to processor nodes is 1:1 Every switch node is connected to Every switch node is connected to  1 processor node  At least 1 other switch node

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Indirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect other switches Some switches simply connect other switches

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating Switch Topologies Diameter Diameter Bisection width Bisection width Number of edges / node Number of edges / node Constant edge length? (yes/no) Constant edge length? (yes/no)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 2-D Mesh Network Direct topology Direct topology Switches arranged into a 2-D lattice Switches arranged into a 2-D lattice Communication allowed only between neighboring switches Communication allowed only between neighboring switches Variants allow wraparound connections between switches on edge of mesh Variants allow wraparound connections between switches on edge of mesh

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 2-D Meshes

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating 2-D Meshes Diameter:  (n 1/2 ) Diameter:  (n 1/2 ) Bisection width:  (n 1/2 ) Bisection width:  (n 1/2 ) Number of edges per switch: 4 Number of edges per switch: 4 Constant edge length? Yes Constant edge length? Yes

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Binary Tree Network Indirect topology Indirect topology n = 2 d processor nodes, n-1 switches n = 2 d processor nodes, n-1 switches

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating Binary Tree Network Diameter: 2 log n Diameter: 2 log n Bisection width: 1 Bisection width: 1 Edges / node: 3 Edges / node: 3 Constant edge length? No Constant edge length? No

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Hypertree Network Indirect topology Indirect topology Shares low diameter of binary tree Shares low diameter of binary tree Greatly improves bisection width Greatly improves bisection width From “front” looks like k-ary tree of height d From “front” looks like k-ary tree of height d From “side” looks like upside down binary tree of height d From “side” looks like upside down binary tree of height d

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Hypertree Network

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating 4-ary Hypertree Diameter: log n Diameter: log n Bisection width: n / 2 Bisection width: n / 2 Edges / node: 6 Edges / node: 6 Constant edge length? No Constant edge length? No

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Butterfly Network Indirect topology Indirect topology n = 2 d processor nodes connected by n(log n + 1) switching nodes n = 2 d processor nodes connected by n(log n + 1) switching nodes

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Butterfly Network Routing

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating Butterfly Network Diameter: log n Diameter: log n Bisection width: n / 2 Bisection width: n / 2 Edges per node: 4 Edges per node: 4 Constant edge length? No Constant edge length? No

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Hypercube Directory topology Directory topology 2 x 2 x … x 2 mesh 2 x 2 x … x 2 mesh Number of nodes a power of 2 Number of nodes a power of 2 Node addresses 0, 1, …, 2 k -1 Node addresses 0, 1, …, 2 k -1 Node i connected to k nodes whose addresses differ from i in exactly one bit position Node i connected to k nodes whose addresses differ from i in exactly one bit position

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Hypercube Addressing

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Hypercubes Illustrated

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating Hypercube Network Diameter: log n Diameter: log n Bisection width: n / 2 Bisection width: n / 2 Edges per node: log n Edges per node: log n Constant edge length? No Constant edge length? No

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shuffle-exchange Direct topology Direct topology Number of nodes a power of 2 Number of nodes a power of 2 Nodes have addresses 0, 1, …, 2 k -1 Nodes have addresses 0, 1, …, 2 k -1 Two outgoing links from node i Two outgoing links from node i  Shuffle link to node LeftCycle(i)  Exchange link to node [xor (i, 1)]

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shuffle-exchange Illustrated

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shuffle-exchange Addressing

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Evaluating Shuffle-exchange Diameter: 2log n - 1 Diameter: 2log n - 1 Bisection width:  n / log n Bisection width:  n / log n Edges per node: 2 Edges per node: 2 Constant edge length? No Constant edge length? No

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Comparing Networks All have logarithmic diameter except 2-D mesh All have logarithmic diameter except 2-D mesh Hypertree, butterfly, and hypercube have bisection width n / 2 Hypertree, butterfly, and hypercube have bisection width n / 2 All have constant edges per node except hypercube All have constant edges per node except hypercube Only 2-D mesh keeps edge lengths constant as network size increases Only 2-D mesh keeps edge lengths constant as network size increases

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Vector Computers Vector computer: instruction set includes operations on vectors as well as scalars Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers Two ways to implement vector computers  Pipelined vector processor: streams data through pipelined arithmetic units  Processor array: many identical, synchronized arithmetic processing elements

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Why Processor Arrays? Historically, high cost of a control unit Historically, high cost of a control unit Scientific applications have data parallelism Scientific applications have data parallelism

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Processor Array

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Data/instruction Storage Front end computer Front end computer  Program  Data manipulated sequentially Processor array Processor array  Data manipulated in parallel

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Processor Array Performance Performance: work done per time unit Performance: work done per time unit Performance of processor array Performance of processor array  Speed of processing elements  Utilization of processing elements

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Performance Example processors 1024 processors Each adds a pair of integers in 1  sec Each adds a pair of integers in 1  sec What is performance when adding two 1024-element vectors (one per processor)? What is performance when adding two 1024-element vectors (one per processor)?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Performance Example processors 512 processors Each adds two integers in 1  sec Each adds two integers in 1  sec Performance adding two vectors of length 600? Performance adding two vectors of length 600?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 2-D Processor Interconnection Network Each VLSI chip has 16 processing elements

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. if (COND) then A else B

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. if (COND) then A else B

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. if (COND) then A else B

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Processor Array Shortcomings Not all problems are data-parallel Not all problems are data-parallel Speed drops for conditionally executed code Speed drops for conditionally executed code Don’t adapt to multiple users well Don’t adapt to multiple users well Do not scale down well to “starter” systems Do not scale down well to “starter” systems Rely on custom VLSI for processors Rely on custom VLSI for processors Expense of control units has dropped Expense of control units has dropped

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Multiprocessors Multiprocessor: multiple-CPU computer with a shared memory Multiprocessor: multiple-CPU computer with a shared memory Same address on two different CPUs refers to the same memory location Same address on two different CPUs refers to the same memory location Avoid three problems of processor arrays Avoid three problems of processor arrays  Can be built from commodity CPUs  Naturally support multiple users  Maintain efficiency in conditional code

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Centralized Multiprocessor Straightforward extension of uniprocessor Straightforward extension of uniprocessor Add CPUs to bus Add CPUs to bus All processors share same primary memory All processors share same primary memory Memory access time same for all CPUs Memory access time same for all CPUs  Uniform memory access (UMA) multiprocessor  Symmetrical multiprocessor (SMP)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Centralized Multiprocessor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Private and Shared Data Private data: items used only by a single processor Private data: items used only by a single processor Shared data: values used by multiple processors Shared data: values used by multiple processors In a multiprocessor, processors communicate via shared data values In a multiprocessor, processors communicate via shared data values

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Problems Associated with Shared Data Cache coherence Cache coherence  Replicating data across multiple caches reduces contention  How to ensure different processors have same value for same address? Synchronization Synchronization  Mutual exclusion  Barrier

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Cache-coherence Problem Cache CPU A Cache CPU B Memory 7 X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Cache-coherence Problem CPU ACPU B Memory 7 X 7

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Cache-coherence Problem CPU ACPU B Memory 7 X 7 7

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Cache-coherence Problem CPU ACPU B Memory 2 X 7 2

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Write Invalidate Protocol CPU ACPU B 7 X 7 7 Cache control monitor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Write Invalidate Protocol CPU ACPU B 7 X 7 7 Intent to write X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Write Invalidate Protocol CPU ACPU B 7 X 7 Intent to write X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Write Invalidate Protocol CPU ACPU B X 2 2

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Distributed Multiprocessor Distribute primary memory among processors Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Also called non-uniform memory access (NUMA) multiprocessor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Distributed Multiprocessor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Cache Coherence Some NUMA multiprocessors do not support it in hardware Some NUMA multiprocessors do not support it in hardware  Only instructions, private data in cache  Large memory access time variance Implementation more difficult Implementation more difficult  No shared memory bus to “snoop”  Directory-based protocol needed

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Directory-based Protocol Distributed directory contains information about cacheable memory blocks Distributed directory contains information about cacheable memory blocks One directory entry for each cache block One directory entry for each cache block Each entry has Each entry has  Sharing status  Which processors have copies

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sharing Status Uncached Uncached  Block not in any processor’s cache Shared Shared  Cached by one or more processors  Read only Exclusive Exclusive  Cached by exactly one processor  Processor has written block  Copy in memory is obsolete

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Directory-based Protocol Interconnection Network Directory Local Memory Cache CPU 0 Directory Local Memory Cache CPU 1 Directory Local Memory Cache CPU 2

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Directory-based Protocol Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X U Bit Vector

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X U Read Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S 1 0 0

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X Read Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X 7 X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 6 to X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X 7 X Write Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 6 to X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X S X 7 X Invalidate

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 6 to X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 1 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X E X Read Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 1 Reads X Interconnection Network CPU 0CPU 1CPU 2 7 X Caches Memories Directories X E X Switch to Shared

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 1 Reads X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 1 Reads X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X S X 6 X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Writes 5 to X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X S X 6 X Write Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Writes 5 to X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X S X 6 X Invalidate

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 2 Writes 5 to X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X E X Write Miss

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 6 X Caches Memories Directories X E Take Away 5 X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 5 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 5 X Caches Memories Directories X E 1 0 0

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 5 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes 4 to X Interconnection Network CPU 0CPU 1CPU 2 5 X Caches Memories Directories X E X

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes Back X Block Interconnection Network CPU 0CPU 1CPU 2 5 X Caches Memories Directories X E X 4 X Data Write Back

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CPU 0 Writes Back X Block Interconnection Network CPU 0CPU 1CPU 2 4 X Caches Memories Directories X U 0 0 0

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Multicomputer Distributed memory multiple-CPU computer Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Same address on different processors refers to different physical memory locations Processors interact through message passing Processors interact through message passing Commercial multicomputers Commercial multicomputers Commodity clusters Commodity clusters

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Asymmetrical Multicomputer

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Asymmetrical MC Advantages Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance Only a simple back-end operating system needed  Easy for a vendor to create Only a simple back-end operating system needed  Easy for a vendor to create

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Asymmetrical MC Disadvantages Front-end computer is a single point of failure Front-end computer is a single point of failure Single front-end computer limits scalability of system Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front-end and back-end program Every application requires development of both front-end and back-end program

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Symmetrical Multicomputer

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Symmetrical MC Advantages Alleviate performance bottleneck caused by single front-end computer Alleviate performance bottleneck caused by single front-end computer Better support for debugging Better support for debugging Every processor executes same program Every processor executes same program

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Symmetrical MC Disadvantages More difficult to maintain illusion of single “parallel computer” More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor More difficult to achieve high performance when multiple processes on each processor

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. ParPar Cluster, A Mixed Model

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Commodity Cluster Co-located computers Co-located computers Dedicated to running parallel jobs Dedicated to running parallel jobs No keyboards or displays No keyboards or displays Identical operating system Identical operating system Identical local disk images Identical local disk images Administered as an entity Administered as an entity

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Network of Workstations Dispersed computers Dispersed computers First priority: person at keyboard First priority: person at keyboard Parallel jobs run in background Parallel jobs run in background Different operating systems Different operating systems Different local images Different local images Checkpointing and restarting important Checkpointing and restarting important

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Flynn’s Taxonomy Instruction stream Instruction stream Data stream Data stream Single vs. multiple Single vs. multiple Four combinations Four combinations  SISD  SIMD  MISD  MIMD

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. SISD Single Instruction, Single Data Single Instruction, Single Data Single-CPU systems Single-CPU systems Note: co-processors don’t count Note: co-processors don’t count  Functional  I/O Example: PCs Example: PCs

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. SIMD Single Instruction, Multiple Data Single Instruction, Multiple Data Two architectures fit this category Two architectures fit this category  Pipelined vector processor (e.g., Cray-1)  Processor array (e.g., Connection Machine)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MISD Multiple Instruction, Single Data Multiple Instruction, Single Data Example: systolic array Example: systolic array

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. MIMD Multiple Instruction, Multiple Data Multiple Instruction, Multiple Data Multiple-CPU computers Multiple-CPU computers  Multiprocessors  Multicomputers

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary Commercial parallel computers appeared in 1980s Commercial parallel computers appeared in 1980s Multiple-CPU computers now dominate Multiple-CPU computers now dominate Small-scale: Centralized multiprocessors Small-scale: Centralized multiprocessors Large-scale: Distributed memory architectures (multiprocessors or multicomputers) Large-scale: Distributed memory architectures (multiprocessors or multicomputers)