Chapter 7 Multicores, Multiprocessors, and Clusters.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Review: Multiprocessor Systems (MIMD)
Chapter 7 Multicores, Multiprocessors, and Clusters.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Advanced Computer Architectures
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.
Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Morgan Kaufmann Publishers
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Parallel Computing.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Outline Why this subject? What is High Performance Computing?
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Lecture 3: Computer Architectures
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly.
Processor Level Parallelism 1
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Processing - introduction
Multi-core processors
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Chapter 17 Parallel Processing
Parallel Processing Architectures
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Presentation transcript:

Chapter 7 Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters — 2 Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Task-level (process-level) parallelism Independent tasks running in parallel High throughput for independent jobs Parallel processing program Single program run on multiple processors §9.1 Introduction

Chapter 7 — Multicores, Multiprocessors, and Clusters — 3 Introduction Multicore microprocessors Processors are called cores in a multicore chip Chips with multiple processors (cores) Almost always Shared Memory Processors (SMPs) Clusters Microprocessors housed in many independent servers Search engines, web servers, servers, and databases Memory not shared §9.1 Introduction

Chapter 7 — Multicores, Multiprocessors, and Clusters — 4 Introduction The number of cores is expected to increase with Moore’s Law Challenge is to create hardware and software that will make it easy to write correct parallel processing programs that will execute efficiently in performance and energy as the number of cores per chip increases. §9.1 Introduction

Chapter 7 — Multicores, Multiprocessors, and Clusters — 5 Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., Intel Core i7 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware

Chapter 7 — Multicores, Multiprocessors, and Clusters — 6 Hardware and Software §7.6 SISD, MIMD, SIMD, SPMD, and Vector Software SequentialConcurrent HardwareSerialMatrix Multiply written in MatLab running on an Intel Pentium 4 Windows 7 running on an Intel Pentium 4 ParallelMatrix Multiply written in MatLab running on an Intel Core i7 Windows 7 running on an Intel Core i7

Chapter 7 — Multicores, Multiprocessors, and Clusters — 7 Parallel Processing Programs Difficulties Not with the hardware There are too few important application programs written to complete tasks sooner on multiprocessors It is difficult to write software that uses multiple processors to complete one task faster Problem gets worse as the number of processors increases

Chapter 7 — Multicores, Multiprocessors, and Clusters — 8 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! §7.2 The Difficulty of Creating Parallel Processing Programs

Chapter 7 — Multicores, Multiprocessors, and Clusters — 9 Parallel Programming Challenges Scheduling the workload Partitioning the work into parallel pieces Balancing the load evenly between the processors Synchronize the work Communication overhead between the processors §7.2 The Difficulty of Creating Parallel Processing Programs

Chapter 7 — Multicores, Multiprocessors, and Clusters — 10 Instruction and Data Streams An alternate classification §7.6 SISD, MIMD, SIMD, SPMD, and Vector Data Streams SingleMultiple Instruction Streams SingleSISD: Intel Pentium 4 SIMD: SSE instructions of x86 MultipleMISD: No examples today MIMD: Intel Xeon e5345

Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Instruction and Data Streams Conventional uniprocessor SISD – Single instruction stream and single data stream Conventional multiprocessor MIMD – multiple instruction streams and multiple data streams Can be separate programs running on different processors or Single Program that runs on all processors Rely on conditional statements when different processors should execute different sections of code §7.6 SISD, MIMD, SIMD, SPMD, and Vector

Chapter 7 — Multicores, Multiprocessors, and Clusters — 12 Instruction and Data Streams SIMD – Single instruction stream and multiple data stream Operate on vectors of data, e.g. One instruction to add 64 numbers by sending 64 data streams to 64 ALUs to form 64 sums within a single clock cycle Works best when dealing with arrays in for loops Must have a lot of identical structured data E.g. Graphics processing MISD – Multiple instruction stream and single data stream Perform a series of computations on a single data stream in a pipelined fashion E.g. Parse input from network, decrypt data, decompress data, search for match §7.6 SISD, MIMD, SIMD, SPMD, and Vector

Chapter 7 — Multicores, Multiprocessors, and Clusters — 13 Hardware Multithreading While MIMD relies on multiple processes or threads to try to keep multiple processors busy, hardware multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion to try to utilize the hardware resources efficiently. §7.5 Hardware Multithreading

Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 Hardware Multithreading Performing multiple threads of execution in parallel Processor must duplicate the independent state of each thread Replicate registers, and program counter, etc. Must have fast switching between threads Process switch can take thousands of cycles Thread switch can be instantaneous §7.5 Hardware Multithreading

Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 Hardware Multithreading Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Instruction-level parallelism Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Disadvantage – pipeline start up cost No instruction-level parallelism §7.5 Hardware Multithreading

Chapter 7 — Multicores, Multiprocessors, and Clusters — 16 Simultaneous Multithreading In multiple-issue dynamically scheduled pipelined processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and caches

Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 Multithreading Example

Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 Multithreading Example The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support A major stall such as an instruction cache miss, can leave the entire processor idle. Coarse-grained, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. Pipelined start-up overhead still leads to idle cycles §7.5 Hardware Multithreading

Chapter 7 — Multicores, Multiprocessors, and Clusters — 19 Multithreading Example Fine-grained Instruction-level parallelism The interleaving of threads mostly eliminates idle clock cycles SMT Thread-level parallelism and instruction-level parallelism are both exploited Multiple threads using the issue slots in a single clock cycle §7.5 Hardware Multithreading

Chapter 7 — Multicores, Multiprocessors, and Clusters — 20 Future of Multithreading Will it survive? In what form? Power considerations  simplified microarchitectures Simpler forms of multithreading Tolerating cache-miss latency Thread switch may be most effective Multiple simple cores might share resources more effectively

Chapter 7 — Multicores, Multiprocessors, and Clusters — 21 Shared Memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Nearly all current multicore chips §7.3 Shared Memory Multiprocessors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 22 Shared Memory Memory access time can be either: uniform memory access (UMA) – the access time to a word does not depend on which processor asks for it. or nonuniform memory access (NUMA) – some memory accesses are much faster than others, depending on which processor asks for which word Because main memory is divided and attached to different microprocessors or to different memory controllers Synchronize shared variables using locks §7.3 Shared Memory Multiprocessors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 23 Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 ≤ Pn ≤ 99

Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps

Chapter 7 — Multicores, Multiprocessors, and Clusters — 25 Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);

Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 Private Memory Each processor has private physical address space Processor must communicate via explicit message passing (send and receive) §7.4 Clusters and Other Message-Passing Multiprocessors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 27 Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP

Chapter 7 — Multicores, Multiprocessors, and Clusters — 28 Clusters Clusters with high-performance message- passing dedicated networks Many supercomputers today use custom networks Much more expensive than local area networks

Chapter 7 — Multicores, Multiprocessors, and Clusters — 29 Warehouse-Scale Computers Building to house, power, and cool 100,000 servers with dedicated network Like large clusters but their architecture and operation are more sophisticated They act as one giant computer

Chapter 7 — Multicores, Multiprocessors, and Clusters — 30 Grid Computing Separate computers interconnected by long-haul networks E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs

Chapter 7 — Multicores, Multiprocessors, and Clusters — 31 Networks inside a chip Thanks to Moore’s Law and the increasing number of cores per chip, we now need networks inside a chip as well.

Chapter 7 — Multicores, Multiprocessors, and Clusters — 32 Network Characteristics Performance Latency per message on an unloaded network to send and receive a message Throughput The maximum number of messages that can be transmitted in a given time period. Congestion delays caused by contention for a portion of the network Fault tolerance – work with broken components Collision detection

Chapter 7 — Multicores, Multiprocessors, and Clusters — 33 Network Characteristics Cost – includes: Number of switches Number of links on a switch to connect to the network The width (number of bits) per link The length of the links when the network is mapped into silicon Routability in silicon Power – energy efficiency

Chapter 7 — Multicores, Multiprocessors, and Clusters — 34 Interconnection Networks Network topologies Arrangements of processors (black squares), switches (blue dots), and links §7.8 Introduction to Multiprocessor Network Topologies BusRing 2D Mesh N-cube (N = 3) Fully connected

Chapter 7 — Multicores, Multiprocessors, and Clusters — 35 Concluding Remarks Goal: higher performance by using multiple processors Difficulties Developing parallel software Devising appropriate architectures Many reasons for optimism Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects! §7.13 Concluding Remarks