Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
SE-292 High Performance Computing
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
The University of Adelaide, School of Computer Science
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Lecture 18: Multiprocessors
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture: Coherence Protocols Topics: snooping-based protocols.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Lecture 5 Approaches to Concurrency: The Multiprocessor
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Introduction to Multiprocessors
Lecture 24: Memory, VM, Multiproc
Distributed Systems CS
High Performance Computing
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1

Introduction Goal: connecting multiple computers to get higher performance –Multiprocessors –Scalability, availability, power efficiency Job-level (process-level) parallelism –High throughput for independent jobs Parallel processing program –Single program run on multiple processors Multicore microprocessors –Chips with multiple processors (cores) 2

Hardware and Software Hardware –Serial: e.g., Pentium 4 –Parallel: e.g., quad-core Xeon e5345 Software –Sequential: e.g., matrix multiplication –Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware –Challenge: making effective use of parallel hardware 3

Parallel Programming Parallel software is the problem Need to get significant performance improvement –Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties –Partitioning –Coordination –Communications overhead 4

Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Need sequential part to be 0.1% of original time 5

Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum –Speed up from 10 to 100 processors Single processor: Time = ( ) × t add 10 processors –Time = 10 × t add + 100/10 × t add = 20 × t add –Speedup = 110/20 = 5.5 (55% of potential) 100 processors –Time = 10 × t add + 100/100 × t add = 11 × t add –Speedup = 110/11 = 10 (10% of potential) 6

Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = ( ) × t add 10 processors –Time = 10 × t add /10 × t add = 1010 × t add –Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors –Time = 10 × t add /100 × t add = 110 × t add –Speedup = 10010/110 = 91 (91% of potential) 7

Strong vs Weak Scaling Strong scaling: problem size fixed –As in example Weak scaling: problem size proportional to number of processors –10 processors, 10 × 10 matrix Time = 20 × t add –100 processors, 32 × 32 matrix Time = 10 × t add /100 × t add = 20 × t add –Constant performance in this example 8

9 Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization  uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

10 SMPs or Centralized Shared-Memory Processor Caches Processor Caches Processor Caches Processor Caches Main Memory I/O System

11 Memory Organization - II For higher scalability, memory is distributed among processors  distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared  distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data  cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

12 Distributed Memory Multiprocessors Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Interconnection network

13 SMPs Centralized main memory and many caches  many copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory CPU-A reads X CPU-B reads X CPU-A stores 0 in X 0 1 0

14 Cache Coherence A memory system is coherent if: P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors – write serialization The memory consistency model defines “time elapsed” before the effect of a processor is seen by others

15 Cache Coherence Protocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary  Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies  Write-update: when a processor writes, it updates other shared copies of that block