EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Slides:

Advertisements

Similar presentations

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Advertisements

SE-292 High Performance Computing

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Lecture 13: Consistency Models

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Memory Consistency Models

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Distributed Shared Memory Systems and Programming

©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

DISTRIBUTED COMPUTING

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.

Multiprocessors – Locks

Symmetric Multiprocessors: Synchronization and Sequential Consistency

18-447: Computer Architecture Lecture 30B: Multiprocessors

Lecture 21 Synchronization

Memory Consistency Models

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Memory Consistency Models

Lecture 18: Coherence and Synchronization

12.4 Memory Organization in Multiprocessor Systems

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Multiprocessors - Flynn’s taxonomy (1966)

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

CSL718 : Multiprocessors 13th April, 2006 Introduction

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I

EE 382 Processor DesignWinter 98/99Michael Flynn 2 Processor Issues for MP Initialization Interrupts Virtual Memory –TLB Coherency Emphasis on Physical Memory and System Interconnect Physical Memory –Coherency –Synchronization –Consistency

EE 382 Processor DesignWinter 98/99Michael Flynn 3 Outline Partitioning –Granularity –Overhead and efficiency Multi-threaded MP Shared Bus –Coherency –Synchronization –Consistency Scalable MP –Cache directories –Interconnection networks –Trends and tradeoffs Additional References –Hennessy and Patterson, CAQA, Chapter 8 –Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach book.alpha/index.html

EE 382 Processor DesignWinter 98/99Michael Flynn 4 Representative System L2 Cache Pipelines Registers L1 Icache L1 Dcache CPU Chipset Memory I/O Bus(es)

EE 382 Processor DesignWinter 98/99Michael Flynn 5 Shared Memory MP Shared-Memory –Consider systems with a single memory address space –Contrasted to multi-computers separate memory address spaces message passing for communication and synchronization Example: Network of Workstations

EE 382 Processor DesignWinter 98/99Michael Flynn 6 Shared Memory MP Types of shared-memory MP –multithreaded or shared resource MP –shared-bus MP (broadcast protocols) –scalable MP (networked protocols) Issues –partitioning of application into p parallel tasks –scheduling of tasks to minimize dependency T w –communications and synchronization

EE 382 Processor DesignWinter 98/99Michael Flynn 7 Partitioning If a uniprocessor executes a program in time T 1 with O 1 operations, and a p parallel proc. executes in T p with O p ops, then O p >O 1 due to task overhead Also Sp = T 1 /T p < p, where p=no. of processors in the system and this is also the amount of parallelism (or the degree of partitioning) available in the program.

EE 382 Processor DesignWinter 98/99Michael Flynn 8 Granularity grain size Sp finecoarse limited by parallelism and load balance overhead limited

EE 382 Processor DesignWinter 98/99Michael Flynn 9 Task Scheduling Static… at compile time Dynamic … run time –system load balancing –load balancing –clustering of tasks with inter-processor communication –schedule with compiler assistance

EE 382 Processor DesignWinter 98/99Michael Flynn 10 Overhead Limits Sp to less than p with p processors Efficiency = Sp/p = T 1 /(T p * p) Lee’s equal work hypothesis: Sp < p/ln(p) Task overhead due to – communications delays – context switching – cold cache effects

EE 382 Processor DesignWinter 98/99Michael Flynn 11 Multi-threaded MP Multiple processors sharing many execution units –each processor has its own state –share function units, caches, TLBs, etc. Types –time multiplex multiple processors so that there are no pipeline breaks,etc. –pipelined processor switch context and on any processor delay (cache miss,etc) Optimizes multi-thread throughput, but limits single- thread performance –See Study 8.1 on p. 537 Processors share D cache

EE 382 Processor DesignWinter 98/99Michael Flynn 12 Shared-Bus MP Processors with own D cache require cache coherency protocol. Simplest protocols have processors snoop on writes to memory that occur on a shared bus If write is to a line in own cache, either invalidate or update that line.

EE 382 Processor DesignWinter 98/99Michael Flynn 13 Coherency, Synchronization, and Consistency Coherency –Property that the value returned after a read is the same value as the latest write –Required for process migration even without sharing Synchronization –Instructions that control access to critical sections of data shared by multiple processors Consistency –Rules for allowing memory references to be reordered that may lead to observed differences in memory state by multiple processors

EE 382 Processor DesignWinter 98/99Michael Flynn 14 Shared-Bus Cache Coherency Protocols Write invalidate, simple 3 state -V,I,D Berkeley (w.invalidate) 4 state - V,S,D,I Illinois (w.invalidate) 4 state - M,E,S,I Dragon (w.update) 5 state - M,E,S,D,I Simpler protocols have somewhat more memory bus traffic.

EE 382 Processor DesignWinter 98/99Michael Flynn 15 MESI Protocol

EE 382 Processor DesignWinter 98/99Michael Flynn 16 Coherence Overhead for Parallel Processing Results for 4 parallel programs with 16 CPUs and 64KB cache Coherence traffic is a substantial portion of bus demand Large blocks can lead to false sharing Hennessy and Patterson CAQA Fig 8.15

EE 382 Processor DesignWinter 98/99Michael Flynn 17 Synchronization Primitives Communicating Sequential Processes Process A Process B acquire semaphoreacquire semaphore access shared dataaccess shared data (read/modify/write)(read/modify/write) release semaphorerelease semaphore

EE 382 Processor DesignWinter 98/99Michael Flynn 18 Synchronization Primitives Acquiring the semaphore generally requires an atomic read-modify-write operation a location –Ensure that only one process enters critical section –Test&Set, Locked-Exchange, Compare&Exchange, Fetch&Add, Load-Locked/Store-Conditional Looping on a semaphore with a test and set or similar instruction is called a spin lock –Techniques to minimize overhead for spin contention: Test + Test&Set, exponential backoff

EE 382 Processor DesignWinter 98/99Michael Flynn 19 Memory Consistency Problem Can the tests at L1 and L2 below both succeed? Process A Process B A = 0;B = 0; A = 1;B = 1; L1:if (B==0) L2:if (A==0) Memory Consistency Model –Rules for allowing memory references by a program executing on one processor to be observed in a different order by a program executing on another processor –Memory Fence operations explicitly control ordering of memory references

EE 382 Processor DesignWinter 98/99Michael Flynn 20 Memory Consistency Models (Part I) Sequential consistency (strong ordering) –All memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency (Total Store Ordering) –Writes are buffered and stored in order –Reads are performed in order, but can bypass writes –Processor flushes store buffer when synchronization instruction executed Weak consistency –Memory references generally allowed in any order –Programs enforce ordering when required for shared data by executing Memory Fence instructions All memory references for previous instructions complete before fence No memory references for subsequent instructions issued before fence –Synchronization instructions act like fences

EE 382 Processor DesignWinter 98/99Michael Flynn 21 Memory Consistency Models (Part II) Release consistency –Distinguish between acquire/release of semaphore before/after access to shared data –Acquire semaphore Ensure that semaphore acquired before any reads or writes by subsequent instructions (which may access shared data) –Release semaphore Ensure that any writes by previous instructions (which may access shared data) are visible before semaphore released Hennessy and Patterson CAQA Fig 8.39

EE 382 Processor DesignWinter 98/99Michael Flynn 22 Pentium Processor Example 2-Level Cache Hierarchy –Inclusion Enforced –Snoops on system bus only need interrogate L2 Cache Policy –Write-Back supported –Write-Through optional selected by page or line write buffers used Cache Coherence –MESI at both levels Memory Consistency –Processor Ordering Issues –Writes hit E-line on-chip –Writes hit E or M line while buffer occupied CPU Pipelines Data Cache Write Buffer L2 Cache Cache Write Buffer System Bus

EE 382 Processor DesignWinter 98/99Michael Flynn 23 Shared-Bus Performance Models Null Binomial –resubmissions don’t automatically occur, e.g, multithreaded MP –See study 8.1, page 537 Resubmissions model –where requests remain on bus until serviced –See pp and cache example posting on web Bus traffic usually limits number of processors –Bus optimized for MP supports But high cost for small systems –Bus that incrementally extends uniprocessor limited to 2-4