Multiprocessors Speed of execution is a paramount concern, always so … If feasible … the more simultaneous execution that can be done on multiple computers.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Lecture 18: Multiprocessors
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Operating System 4 THREADS, SMP AND MICROKERNELS
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 13: Multiprocessors Kai Bu
COMP 740: Computer Architecture and Implementation
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Lecture 18: Coherence and Synchronization
/ Computer Architecture and Design
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
/ Computer Architecture and Design
Lecture 25: Multiprocessors
/ Computer Architecture and Design
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Multiprocessors Speed of execution is a paramount concern, always so … If feasible … the more simultaneous execution that can be done on multiple computers … the better Easier to do in the server and embedded processor markets where there is a natural parallelism that is exhibited by the applications and algorithms Less so in the desktop market

Multiprocessors and Thread- Level Parallelism Chapter 6 delves deeply into the issues surrounding multiprocessors. Thread level parallelism is a necessary adjunct to the study of multiprocessors

Outline Intro to problems in parallel processing Taxonomy MIMDs Communication Shared-Memory Multiprocessors Multicache coherence Implementation Performance

Outline - continued Distributed-Memory Multiprocessors Coherence protocols Performance Synchronization Atomic operations, spin locks, barriers Thread-Level Parallelism

Low Level Issues in Parallel Processing Consider the following generic code: y = x + 3 z = 2*x + y w = w*w Lock file M Read file M Naively splitting up the code between two processors leads to big problems.

Low Level Issues in Parallel Processing - continued Processor A Processor B y = x + 3 z = 2*x + y w = w*w Lock file M Read file M Problems: commands must be executed so as to not violate the original sequential nature of the algorithm, a processor has to wait on a file etc.

Low Level Issues in Parallel Processing - continued This was a grossly bad example of course, but the underlying issues appear in good multiprocessing applications Two key issues are: Shared memory (shared variables) Interprocessor communication (e.g. current shared variable updates, file locks)

Computation/Communication “A key characteristic in determining the performance of parallel programs is the ratio of computation to communication.” (bottom of page 546 “Communication is the costly part of parallel computing” … and also the slow part A table on page 547 shows this ratio for some DSP calculations – which normally have a good ratio

Computation/Communication – best and worst cases Problem: Add 6 to each component of vector x[n]. Three processors A, B, and C. Best: give A the first n/3 components, B the next n/3 and C the last n/3. One message at the beginning, the results passed back in one message at the end. Computation/Communication ratio = n/2

Computation/Communication – best and worst cases Worst case: have processor A add 1 to x[k] pass it to B which adds 2 which passes it to C to add 3. Two messages per effective computation. Computation/Communication ratio = n/(2*n) = 1/2 Of course this is terrible coding but it makes the point. Real examples are found on page 547

Taxonomy SISD – single instruction stream, single data stream (uniprocessors) SIMD – single instruction stream, multiple data streams (vector processors) MISD – multiple instruction streams, single data stream (no commercial processors have been built of this type, to date) MIMD – multiple instruction streams, multiple data streams

MIMDs Have emerged as the architecture of choice for general purpose multiprocessors. Often built with off-the-shelf microprocessors Flexible designs are possible

Two Classes of MIMD Two basic structures will be studied: Centralized shared-memory multiprocessors Distributed-memory multiprocessors

Why focus on memory? Communication or data sharing can be done at several levels in our basic structure Sharing disks is no problem and sharing cache between processors is probably not feasible Hence our main distinction is whether or not to share memory

Centralized Shared-Memory Multiprocessors Main memory is shared This has many advantages Much faster message passing !! This also forces many issues to be dealt with Block write contention Coherent memory

Distributed-Memory Multiprocessors Each processor has its own memory An interconnection network aids the message passing

Communication Algorithms or applications that can be parsed completely into independent streams of computations are very rare. Usually, in order to parse an application between n processors a great deal of inter-processor information must be communicated Examples, which data a processor is working on, how far it has processed the data it is working on, computed values that are needed by another processor, etc. Message passing, shared memory, RPCs, all are methods of communication for multiprocessors

The Two Biggest Challenges in Using Multiprocessors Page 540 and 537 Insufficient parallelism (in the algorithms or code) Long-latency remote communications “Much of this chapter focuses on techniques for reducing the impact of long remote communication latency.” page nd paragraph

Advantages of Different Communication Mechanisms Since this is a key distinction, both in terms of system performance and cost you should be aware of the comparative advantages. Know the issues on pages 535-6

SMPs - Shared-Memory Multiprocessors Often called by SMP rather than centralized shared-memory multiprocessors We now look at the coherent memory problem

Multiprocessor Cache Coherence – the key problem Time Event Cache for A Cache for B Memory contents for X CPU A reads X CPU B reads X CPU A stores 0 in X The problem is that CPU B is still using a value of X = 1 whereas A is not. Obviously we can’t allow this … but how do we stop it?

Basic Schemes for Enforcing Coherence – Section 6.3 Look over the definitions of coherence and consistency (page 550) Coherence protocols on page 552: directory based and snooping We concentrate on snooping with invalidation that is implemented by a write-back cache Understand the basics in figure 6.8 and 6.9 Study the finite-state transition diagram on page 557

A Cache Coherence Protocol

Performance of Symmetric Shared-Memory Multiprocessors Comments: Not an easy topic, definitions can vary as with the case of single processors Results of studies are given in section 6.4 Review the specialized definitions on page 561 first Coherence misses True sharing misses False sharing

Example: CPU execution on a four-processor system Study figure 6.13 (page 563) and the accompanying explanation

What is considered in CPU time measurements Note that these benchmarks include substantial I/O time which is ignored in the CPU time measurements. Of course the cache access time is included in the CPU time measurements since the processes will not be switched out on a cache access vice a memory miss or I/O request L2 hits, L3 hits and pipeline stalls add time to the execution – these are shown graphically

Commercial Workload Performance

OLTP Performance and L3 Caches Online transaction processing workloads (part of the commercial benchmark) demand a lot from memory systems. This graph focuses on the impact of L3 cache size

Memory Access Cycles vs. Processor Count Note the increase in memory access cycles as the processor count increases This is mainly due to true and false sharing misses which increase as the processor count increases

Distributed Shared-Memory Architectures Coherence is again an issue Study pages where some of the disadvantages of allowing hardware to exclude cache coherence are discussed

Directory-Based Cache Coherence Protocols Just as with a snooping protocol there are two primary operations that a directory protocol must handle: read misses and writes to shared, clean blocks. Basics: a directory is added to each node

Directory Protocols We won’t spend as much time in class on these. But look over the state transition diagrams and browse over the performance section.

Synchronization Key ability needed to synchronize in a multiprocessor setup Ability to atomically read and modify a memory location That means: no other process can context switch in and modify the memory location after our process reads and before our process modifies.

Synchronization “These hardware primitives are the basic building blocks that are used to build a wide variety of user-level synchronization operations, including locks and barriers.” (page 591) Examples of these atomic operations are given on in both code and text form Read over and understand both the spin lock and barrier concepts. Problems on the next exam may well include one of these.

Synchronization Examples Check out the examples on 596, They bring out key points in the operation of multiprocessor synchronization that you need to know.

Threads Threads are “lightweight processes” Thread switches are much faster than process or context switches Page 608 for this study a thread is: Thread = {copy of registers, separate PC, separate page table }

Threads and SMT SMST – Simultaneous Multithreading exploits TLP (thread-level parallelism) at the same it exploits ILP (instruction-level parallelism) And why is SMT good? It turns out that most modern multiple-issue processors have more functional unit parallelism available than a single thread can effectively use (see section 3.6 for more – basically they allow multiple instructions to issue in a single clock cycle – superscaler and VLIW are two basic flavors – but more later in the course.