Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

L.N. Bhuyan Adapted from Patterson’s slides

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.

SE-292 High Performance Computing

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Lecture 6: Multicore Systems

Thoughts on Shared Caches Jeff Odom University of Maryland.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Parallel Processing Architectures Laxmi Narayan Bhuyan

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Chapter 18 Multicore Computers

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

COMP 740: Computer Architecture and Implementation

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Parallel Processing - introduction

Multi-core processors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Chapter 17 Parallel Processing

Multiprocessors - Flynn’s taxonomy (1966)

CS 213: Parallel Processing Architectures

Introduction to Multiprocessors

Parallel Processing Architectures

/ Computer Architecture and Design

CSC3050 – Computer Architecture

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSL718 : Multiprocessors 13th April, 2006 Introduction

The University of Adelaide, School of Computer Science

Presentation transcript:

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale

Small-Scale MIMD Designs Memory: centralized with uniform memory access time (UMA) and bus interconnect Memory: centralized with uniform memory access time (UMA) and bus interconnect Examples: SPARCcenter Examples: SPARCcenter

Large-Scale MIMD Designs Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect Examples: Cray T3D, Intel Paragon, CM-5 Examples: Cray T3D, Intel Paragon, CM-5

Communication Models Shared Memory Shared Memory –Communication via shared address space –Advantages: Ease of programming Ease of programming Lower latency Lower latency Easier to use hardware controlled caching Easier to use hardware controlled caching Message passing Message passing –Processors have private memories, communicate via messages –Advantages: Less hardware, easier to design Less hardware, easier to design Focuses attention on costly non-local operations Focuses attention on costly non-local operations

Communication Properties Bandwidth Bandwidth –Need high bandwidth in communication –Limits in network, memory, and processor Latency Latency –Affects performance, since processor wait –Affects ease of programming - How to overlap communication and computation. Latency Hiding Latency Hiding –How can a mechanism help hide latency? –Examples: overlap message send with computation, prefetch

Small-Scale—Shared Memory Caches serve to: Caches serve to: –Increase bandwidth versus bus/memory –Reduce latency of access –Valuable for both private data and shared data What about cache consistency? What about cache consistency?

The Problem of Cache Coherency Value of X in memory is 1 Value of X in memory is 1 CPU A reads X – its cache now contains 1 CPU A reads X – its cache now contains 1 CPU B reads X – its cache now contains 1 CPU B reads X – its cache now contains 1 CPU A stores 0 into X CPU A stores 0 into X –CPU A ’ s cache contains a 0 –CPU B ’ s cache contains a 1

Multicore Systems

Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

Pollack’s Rule Performance increase is roughly proportional to the square root of the increase in complexity performance  √complexity Power consumption increase is roughly linearly proportional to the increase in complexity power consumption  complexity

Pollack’s Rule complexitypowerperformance  100s of low complexity cores, each operating at very low power Ex: Four small cores complexitypowerperformance 4x1 4x14

Increasing CPU Performance Manycore Chip Composed of hybrid cores Some general purpose Some graphics Some floating point

Exascale Systems Board composed of multiple manycore chips sharing memory A room full of these racks  Millions of cores  Exascale systems (10 18 Flop/s) Rack composed of multiple boards

Moore’s Law Reinterpreted Number of cores per chip doubles every 2 years Number of threads of execution doubles every 2 years

Shared Memory MIMD Shared memory Single address space All processes have access to the pool of shared memory Memory Bus PPPP

Shared Memory MIMD Each processor executes different instructions asynchronously, using different data Memory PE data instruction CU

Symmetric Multiprocessors (SMP) MIMD Shared memory UMA Proc L1 L2 Main Memory I/O Proc L1 L2 … System bus

Symmetric Multiprocessors (SMP) Characteristics: Two or more similar processors Processors share the same memory and I/O facilities Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor All processors share access to I/O devices All processors can perform the same functions The system is controlled by the operating system

Symmetric Multiprocessors (SMP) Operating system: Provides tools and functions to exploit the parallelism Schedules processes or threads across all of the processors Takes care of scheduling of threads and processes on processors synchronization among processors

Multicore Computers Dedicated L1 Cache (ARM11 MPCore) CPU core 1 L1-I L2 Main Memory I/O … L1-D CPU core n L1-IL1-D

Multicore Computers Dedicated L2 Cache (AMD Opteron) CPU core 1 L1-I L2 Main Memory I/O … L1-D CPU core n L1-IL1-D L2

Multicore Computers Shared L2 Cache (Intel Core Duo) CPU core 1 L1-I L2 Main Memory I/O … L1-D CPU core n L1-IL1-D

Multicore Computers Shared L3 Cache (Intel Core i7) CPU core 1 L1-I L2 Main Memory I/O … L1-D CPU core n L1-IL1-D L2 L3

Multicore Computers Advantages of Shared L2 cache Reduced overall miss rate Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement Advantages of Dedicated L2 cache Each core can access its private cache more rapidly L3 cache When the amount of memory and number of cores grow, L3 cache provides better performance

Multicore Computers On-chip interconnects Bus Crossbar Off-chip communication (CPU-to-CPU or I/O): Bus-based

Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

Multicore Computers Multithreading A multithreaded processor provides a separate PC for each thread (hardware multithreading) Implicit multithreading Concurrent execution of multiple threads extracted from a single sequential program Explicit multithreading Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines

Multicore Computers Explicit Multithreading Fine-grained multithreading (Interleaved multithreading) Processor deals with two or more thread contexts at a time Switching from one thread to another at each clock cycle Coarse-grained multithreading (Blocked multithreading) Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs This event causes a switch to another thread Simultaneous multithreading (SMT) Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor Thread-level parallelism is combined with instruction-level parallelism (ILP) Chip multiprocessing (CMP) Each processor of a multicore system handles separate threads

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

GPUs (Graphics Processing Units) Characteristics of GPUs GPUs are accelerators for CPUs SIMD GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core) CPU-GPU combination is an example for heterogeneous computing GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

GPUs