CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Today’s topics Single processors and the Memory Hierarchy
Classification of Distributed Systems Properties of Distributed Systems n motivation: advantages of distributed systems n classification l architecture.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Introduction to MIMD architectures
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
2. Multiprocessors Main Structures 2.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel Computer Architectures
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Advanced Computer Architecture Section 1 Parallel Computer Models
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Overview Parallel Processing Pipelining
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Chapter 1.
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
AN INTRODUCTION ON PARALLEL PROCESSING
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor System Interconnects
Presentation transcript:

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers

Categories of Parallel Computers Considering their architecture only, there are two main categories of parallel computers: systems with shared common memories, and systems with unshared distributed memories.

Shared-Memory Multiprocessors Shared-memory multiprocessor models: Uniform-memory-access (UMA) Nonuniform-memory-access (NUMA) Cache-only memory architecture (COMA) These systems differ in how the memory and peripheral resources are shared or distributed.

The UMA Model - 1 Physical memory uniformly shared by all processors, with equal access time to all words. Processors may have local cache memories. Peripherals also shared in some fashion. Tightly coupled systems use a common bus, crossbar, or multistage network to connect processors, peripherals, and memories. Many manufacturers have multiprocessor (MP) extensions of uniprocessor (UP) product lines.

The UMA Model - 2 Synchronization and communication among processors achieved through shared variables in common memory. Symmetric MP systems – all processors have access to all peripherals, and any processor can run the OS and I/O device drivers. Asymmetric MP systems – not all peripherals accessible by all processors; kernel runs only on selected processors (master); others are called attached processors (AP).

The UMA Multiprocessor Model P1P1 P2P2 PnPn … System Interconnect (Bus, Crossbar, Multistage network) I/OSM 1 … SM m

Example: Performance Calculation Consider two loops. The first loop adds corresponding elements of two N-element vectors to yield a third vector. The second loop sums elements of the third vector. Assume each add/assign operation takes 1 cycle, and ignore time spent on other actions (e.g. loop counter incrementing/testing, instruction fetch, etc.). Assume interprocessor communication requires k cycles. On a sequential system, each loop will require N cycles, for a total of 2N cycles of processor time.

Example: Performance Calculation On an M-processor system, we can partition each loop into M parts, each having L = N / M add/assigns requiring L cycles. The total time required is thus 2L. This leaves us with M partial sums that must be totaled. Computing the final sum from the M partial sums requires l = log 2 (M) additions, each requiring k cycles (to access a non-local term) and 1 cycle (for the add/assign), for a total of l  (k+1) cycles. The parallel computation thus requires 2N / M + (k + 1) log 2 (M) cycles.

Example: Performance Calculation Assume N = Sequential execution requires 2N = 2 21 cycles. If processor synchronization requires k = 200 cycles, and we have M = 256 processors, parallel execution requires 2N / M + (k + 1) log 2 (M) =2 21 /  8 = = 9800 cycles Comparing results, the parallel solution is 214 times faster than the sequential, with the best theoretical speedup being 256 (since there are 256 processors). Thus the efficiency of the parallel solution is 214 / 256 = 83.6 %.

The NUMA Model - 1 Shared memories, but access time depends on the location of the data item. The shared memory is distributed among the processors as local memories, but each of these is still accessible by all processors (with varying access times). Memory access is fastest from the locally-connected processor, with the interconnection network adding delays for other processor accesses. Additionally, there may be global memory in a multiprocessor system, with two separate interconnection networks, one for clusters of processors and their cluster memories, and another for the global shared memories.

Shared Local Memories P1P1 P2P2 PnPn LM 1 Inter- connection Network LM 2 LM n......

Hierarchical Cluster Model GSM … Global Interconnect Network GSM P P P CINCIN CSM … P P P CINCIN......

The COMA Model In the COMA model, processors only have cache memories; the caches, taken together, form a global address space. Each cache has an associated directory that aids remote machines in their lookups; hierarchical directories may exist in machines based on this model. Initial data placement is not critical, as cache blocks will eventually migrate to where they are needed.

Cache-Only Memory Architecture Interconnection Network … C D P C D P C D P

Other Models There can be other models used for multiprocessor systems, based on a combination of the models just presented. For example: cache-coherent non-uniform memory access (each processor has a cache directory, and the system has a distributed shared memory) cache-coherent cache-only model (processors have caches, no shared memory, caches must be kept coherent).

Multicomputer Models Multicomputers consist of multiple computers, or nodes, interconnected by a message-passing network. Each node is autonomous, with its own processor and local memory, and sometimes local peripherals. The message-passing network provides point-to-point static connections among the nodes. Local memories are not shared, so traditional multicomputers are sometimes called no-remote-memory- access (or NORMA) machines. Inter-node communication is achieved by passing messages through the static connection network.

Generic Message-Passing Multicomputer Message-passing interconnection network MP MPPM PM P M P M P M P M … …

Multicomputer Generations Each multicomputer uses routers and channels in its interconnection network, and heterogeneous systems may involved mixed node types and uniform data representation and communication protocols. First generation: hypercube architecture, software- controlled message switching, processor boards. Second generation: mesh-connected architecture, hardware message switching, software for medium-grain distributed computing. Third generation: fine-grained distributed computing, with each VLSI chip containing the processor and communication resources.

Multivector and SIMD Computers Vector computers often built as a scalar processor with an attached optional vector processor. All data and instructions are stored in the central memory, all instructions decoded by scalar control unit, and all scalar instructions handled by scalar processor. When a vector instruction is decoded, it is sent to the vector processor’s control unit which supervises the flow of data and execution of the instruction.

Vector Processor Models In register-to-register models, a fixed number of possibly reconfigurable registers are used to hold all vector operands, intermediate, and final vector results. All registers are accessible in user instructions. In a memory-to-memory vector processor, primary memory holds operands and results; a vector stream unit accesses memory for fetches and stores in units of large superwords (e.g. 512 bits).

SIMD Supercomputers Operational model is a 5-tuple (N, C, I, M, R). N = number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to partion PEs into enabled/disabled states. R = set of data-routing functions to enable inter-PE communication through the interconnection network.

Operational Model of SIMD Computer Interconnection Network … P M Control Unit P M P M