Parallel Computer Architectures Chapter 8. Parallel Computer Architectures (a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

SE-292 High Performance Computing
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Classification of Distributed Systems Properties of Distributed Systems n motivation: advantages of distributed systems n classification l architecture.
Multiple Processor Systems
UMA Bus-Based SMP Architectures
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 17 Parallel Processing.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
1 Multiple Processors, A Network, An OS, and Middleware Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Cotter-cs431 Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 8 Multiple Processor Systems.
Outline Why this subject? What is High Performance Computing?
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
Overview Parallel Processing Pipelining
Parallel Architecture
MODERN OPERATING SYSTEMS Third Edition ANDREW S
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Parallel and Multiprocessor Architectures – Shared Memory
Multiple Processor Systems
Multiple Processor Systems
Multiple Processor and Distributed Systems
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Presentation transcript:

Parallel Computer Architectures Chapter 8

Parallel Computer Architectures (a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer. (e) A grid.

Parallelism a)Introduced at various levels b)Within CPU chip (multiple instructions per cycle) –Instruction level VLIW (Very Long Instruction word) –Superscalar –On Chip Multithreading –Single chip multiprocessors c)Extra CPU boards ( d)Multiprocessor/Multicomputer e)Computer grids f)Tightly Coupled – computationally intimate g)Loosely Coupled – computationally remote

Instruction-Level Parallelism (a) A CPU pipeline. (b) A sequence of VLIW instructions. (c) An instruction stream with bundles marked.

The TriMedia VLIW CPU (1) A typical TriMedia instruction, showing five possible operations.

The TriMedia VLIW CPU (2) The TM3260 functional units, their quantity, latency, and which instruction slots they can use.

The TriMedia VLIW CPU (3) The major groups of TriMedia custom operations.

The TriMedia VLIW CPU (4) (a) An array of 8-bit elements. (b) The transposed array. (c) The original array fetched into four registers. (d) The transposed array in four registers.

Multithreading a)Fine-grained multithreading –Run multiple threads one instruction from each –Will never stall if enough active threads –Requires hardware to track which instruction is from which thread b)Coarse-grain multithreading –Run thread until stall and switch (one cycle wasted) c)Simultaneous multithreading –Coarse grain with no cycle wasted d)Hyperthreading –5% increase in size give 25% gain –Resource sharing Partitionedfull resource sharing Threshold sharing

On-Chip Multithreading (1) (a) – (c) Three threads. The empty boxes indicated that the thread has stalled waiting for memory. (d) Fine-grained multithreading. (e) Coarse-grained multithreading.

On-Chip Multithreading (2) Multithreading with a dual-issue superscalar CPU. (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading.

Hyperthreading on the Pentium 4 Sharing between two thread white and gray Resource sharing between threads in the Pentium 4 NetBurst microarchitecture.

Single-Chip Multiprocessor a)Two areas of interest servers and consumer electronics b)Homogeneous chips –2 piplines, one CPU –2 CPU (same design) c)Hetrogeneous chips –CPU’s for DVD player or CELL phones –More software => slower but cheaper –Many different cores (essentially libraries)

Sample chip a)Cores on a chip for DVD player: –Control –MPEG video –Audio decoder –Video encoder –Disk controller –Cache Cores require interconnect IBM CoreConnect AMBA Advanced Microcontroller Bus Architecture VCI Virtual Component Interconnect OCP-IP Open Core Protocol

Homogeneous Multiprocessors on a Chip Single-chip multiprocessors. (a) A dual-pipeline chip. (b) A chip with two cores.

Heterogeneous Multiprocessors on a Chip (1) The logical structure of a simple DVD player contains a heterogeneous multiprocessor containing multiple cores for different functions.

Heterogeneous Multiprocessors on a Chip (2) An example of the IBM CoreConnect architecture.

Coprocessors a)Come in a variety of sizes –Separate cabinets for mainframes –Separate boards –Separate chips b)Primary purpose to offload work and assist main processor c)Different types –I/O –DMA –Floating point –Network –Graphics –Encryption

Introduction to Networking (1) How users are connected to servers on the Internet.

Networks a)LAN – local area network b)WAN – Wide area network c)Packet – chunk of data on network bytes d)Store-and-forward packet switching – what a router does e)Internet – series of WAN’s linked by routers f)ISP – Internet service provider g)Firewall – specialized computer that filters traffic h)Protocols – set of formats, exchange sequences, and rules i)HTTP – HyperText Transfer Protocol j)TCP – Transmission Control Protocol k)IP – Internet protocol

Networks a)CRC – Cyclic Redundancy Check b)TCP Header – information about data for TCP level c)IP header – routing header source, destination, hops d)Ethernet Header Next hop address, address, CRC e)ASIC – Application Specific Integrated Circuit f)FPGA – Field programmable Gate Array g)Network processor – programmable device that handles incoming and outgoing packets a wire speed h)PPE – Protocol/Programmable/Packet Processing Engines

Introduction to Networking (2) A packet as it appears on the Ethernet.

Introduction to Network Processors A typical network processor board and chip.

Packet Processing a)Checksum verification b)Field Extraction c)Packet Classification d)Path Selection e)Destination network determination f)Route Lookup g)Fragmentation and reassembly h)Computation (compression/ encryption) i)Header Management j)Queue management k)Checksum generation l)Accounting m)Statistics gathering

Improving Performance a)Performance is name of game. b)How to measure it. –Packets per second –Bytes per second c)Ways to speed up –Performance is not linear with clock speed –Introduce more PPE’s –Specialized processors –More internal busses –Widen existing busses –Replace SDRAM with SRAM

The Nexperia Media Processor The Nexperia heterogeneous multiprocessor on a chip.

Multiprocessors (a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different CPU.

Shared-Memory Multiprocessors a)Multiprocessor – has shared memory b)SMP (Symetric Multiprocessor) – every multiprocessor can access any I/O device c)Multicomputer – (distributed memory system) – each computer has it’s own memory d)Multiprocessor – has one address space e)Multicomputer has one address space per computer f)Multicomputers pass messages to communicate g)Ease or programming vs ease of construction h)DSM – distributed shared memory page fault memory for distributed computers

Multicomputers (1) ( a) A multicomputer with 16 CPUs, each with its own private memory. (b) The bit-map image of Fig split up among the 16 memories.

Multicomputers (2) Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c) The language runtime system.

Taxonomy of Parallel Computers (1) Flynn’s taxonomy of parallel computers.

Taxonomy of Parallel Computers (2) A taxonomy of parallel computers.

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved MIMD categories a)UMA – uniform memory access b)NUMA – NonUniform Memory Access c)COMA – Cache only memory Access d)Multicomputers are NORMA ( No remote Memory Access) –MPP Massive Parallel processor

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved Consistency Models a)How hardware and software will work with memory b)Strict Consistency any read of location X returns the most recent value written to location X c)Sequential Consistency – values will be returned in the order they are written (true order d)Processor consistency – Writes by any CPU are seen in the order they are written e)For every memory word, all CPU see all writes to it in the same order f)Weak Consistency – no guarantee unless synchronization is used. g)Release Consistency writes must occur before critical section is reentered.

Sequential Consistency (a) Two CPUs writing and two CPUs reading a common memory word. (b) - (d) Three possible ways the two writes and four reads might be interleaved in time.

Weak Consistency Weakly consistent memory uses synchronization operations to divide time into sequential epochs.

UMA Symmetric Multiprocessor Architectures Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories.

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved Cache as Cache can a)Cache coherence protocol keep memory in maximum of one cache (eg. Write through) b)Snooping cache monitor bus fir access to cache memory c)Choose between update strategy or invalidate strategy d)MESI protocol named after states –Invalid Shared Exclusive Modified

Snooping Caches The write through cache coherence protocol. The empty boxes indicate that no action is taken.

The MESI Cache Coherence Protocol The MESI cache coherence protocol.

UMA Multiprocessors Using Crossbar Switches (a) An 8 × 8 crossbar switch. (b) An open crosspoint. (c) A closed crosspoint.

UMA Multiprocessors Using Multistage Switching Networks (1) (a) A 2 × 2 switch. (b) A message format.

UMA Multiprocessors Using Multistage Switching Networks (2) An omega switching network.

NUMA Multiprocessors A NUMA machine based on two levels of buses. The Cm* was the first multiprocessor to use this design.

Cache Coherent NUMA Multiprocessors (a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit memory address into fields. (c) The directory at node 36.

The Sun Fire E25K NUMA Multiprocessor (1) The Sun Microsystems E25K multiprocessor.

The Sun Fire E25K NUMA Multiprocessor (2) The SunFire E25K uses a four-level interconnect. Dashed lines are address paths. Solid lines are data paths.

Message-Passing Multicomputers A generic multicomputer.

Topology Various topologies. The heavy dots represent switches. The CPUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.

BlueGene (1) The BlueGene/L custom processor chip.

BlueGene (2) The BlueGene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e) System.

Red Storm (1) Packaging of the Red Storm components.

Red Storm (2) The Red Storm system as viewed from above.

A Comparison of BlueGene/L and Red Storm A comparison of BlueGene/L and Red Storm.

Google (1) Processing of a Google query.

Google (2) A typical Google cluster.

Scheduling Scheduling a cluster. (a) FIFO. (b) Without head-of-line blocking. (c) Tiling. The shaded areas indicate idle CPUs.

Distributed Shared Memory (1) A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. (a) The initial situation. ….

Distributed Shared Memory (2) A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. … (b) After CPU 0 references page 10. …

Distributed Shared Memory (3) A virtual address space consisting of 16 pages spread over four nodes of a multicomputer. … (c) After CPU 1 references page 10, here assumed to be a read-only page.

Linda Three Linda tuples.

Orca A simplified ORCA stack object, with internal data and two operations.

Software Metrics (1) Real programs achieve less than the perfect speedup indicated by the dotted line.

Software Metrics (2) (a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.

Achieving High Performance (a) A 4-CPU bus-based system. (b) A 16-CPU bus-based system. (c) A 4-CPU grid-based system. (d) A 16-CPU grid-based system.

Grid Computing The grid layers.