Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
L.N. Bhuyan Adapted from Patterson’s slides
CMSC 611: Advanced Computer Architecture
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS252 Graduate Computer Architecture Lecture 25 Memory Consistency Models and Snoopy Bus Protocols Prof John D. Kubiatowicz
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS 258 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors March 10, 2008 Prof John D. Kubiatowicz
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
ECE669 L13: Shared Memory Multiprocessors March 11, 2004 ECE 669 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors.
Chapter 17 Parallel Processing.
ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday:
Shared Memory Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
2/4/200408/29/2002CS267 Lecure 51 CS 267: Shared Memory Parallel Machines Katherine Yelick
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS 258 Parallel Computer Architecture Lecture 12 Shared Memory Multiprocessors II March 1, 2002 Prof John D. Kubiatowicz
Snooping Cache and Shared-Memory Multiprocessors
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Cache Coherence CSE 661 – Parallel and Vector Architectures
1 Memory and Cache Coherence. 2 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor.
Multiprocessing & Cache Coherency.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Additional Material CEG 4131 Computer Architecture III
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
COSC6385 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Shared Memory Multiprocessors
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Shared Memory Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessors CS258 S99.
CS 213: Parallel Processing Architectures
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
/ Computer Architecture and Design
Shared Memory Multiprocessors
High Performance Computing
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Prof John D. Kubiatowicz
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
Presentation transcript:

Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho

2 - CSE/ESE 560M – Graduate Computer Architecture I Moore’s Law ILP has been extended with MPs (TLP) since the 60s. Now it is more critical and more attractive than ever with Chip MPs. … and it is all about memory systems! T ransistors uuuuuuu uu u uu u u u u u u u u u u u uuuu u u u u u u u u u u u uu u u uu u uuuu u uuu u u uuuu uuu u uuuu 1,000 10, ,000 1,000,000 10,000, ,000, Bit-level parallelismInstruction-levelThread-level i4004 i8008 i8080 i8086 i80286 i80386 R2000 Pentium R10000 R3000

3 - CSE/ESE 560M – Graduate Computer Architecture I P T i m e ( s ) Uniprocessor View Busy-useful Data-local Performance depends heavily on memory hierarchy Time spent by a program –Timeprog(1) = Busy(1) + Data Access(1) –Divide by cycles to get CPI equation Data access time can be reduced by: – Optimizing machine bigger caches, lower latency... – Optimizing program temporal and spatial locality

4 - CSE/ESE 560M – Graduate Computer Architecture I Uniprocessor vs. Multiprocessor P 0 P 1 P 2 P 3 Busy-overhead Busy-useful Data-local Synchronization Data-remote T i m e ( s ) T i m e ( s ) (a) Sequential essors(b) Parallel with four proc

5 - CSE/ESE 560M – Graduate Computer Architecture I Multiprocessor PPP PPP... A collection of communicating processors –Goals: balance load, reduce inherent communication and extra work A multi-cache, multi-memory system –Role of these components essential regardless of programming model –Prog. model and comm. abstr. affect specific performance tradeoffs

6 - CSE/ESE 560M – Graduate Computer Architecture I Parallel Architecture –Computer Architecture –Communication Architecture Small-scale shared memory –Extend the memory system to support multiple processors –Good for multiprogramming throughput and parallel computing –Allows fine-grain sharing of resources Characterization –Naming –Synchronization –Latency and Bandwidth = Performance

7 - CSE/ESE 560M – Graduate Computer Architecture I Naming –what data is shared –how it is addressed –what operations can access data –how processes refer to each other Choice of naming affects –code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing –replication of data; via load in cache memory hierarchy or via SW replication and consistency

8 - CSE/ESE 560M – Graduate Computer Architecture I Naming Global physical address space –any processor can generate, address and access it in a single operation –memory can be anywhere –virtual addr. translation handles it Global virtual address space –if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space – locations are named – –uniformly for all processes of the parallel program

9 - CSE/ESE 560M – Graduate Computer Architecture I Synchronization To cooperate, processes must coordinate –Message passing is implicit coordination with transmission or arrival of data Shared address –additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

10 - CSE/ESE 560M – Graduate Computer Architecture I Performance Latency & Bandwidth –Utilize the normal migration within the storage to avoid long latency operations and to reduce bandwidth –Economical medium with fundamental BW limit –Focus on eliminating unnecessary traffic

11 - CSE/ESE 560M – Graduate Computer Architecture I Memory Configurations P 1 Switch Main memory P n (Interleaved) First-level $ P 1 $ Interconnection network $ P n Mem P 1 $ Interconnection network $ P n Mem Shared Cache Centralized Memory Dance Hall, Uniform MA Distributed Memory (Non-Uniform Mem Access) Scale

12 - CSE/ESE 560M – Graduate Computer Architecture I Shared Cache Alliant FX-8 –early 80’s –eight 68020s with x-bar to 512 KB interleaved cache Encore & Sequent –first 32-bit micros (N32032) –two to a board with a shared cache P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory

13 - CSE/ESE 560M – Graduate Computer Architecture I Advantages Single Cache –Only one copy of any cached block Fine-Grain Sharing –Cray Xmp has shared registers! Potential for Positive Interference –One proc pre-fetches data for another Smaller Total Storage –Only one copy of code/data Can share data within a line –Long lines without false sharing P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory

14 - CSE/ESE 560M – Graduate Computer Architecture I Disadvantages Fundamental BW limitation Increases latency of all accesses –Cross-bar (Multiple ports) –Larger cache –L1 hit time determines proc. Cycle Potential for negative interference –one proc flushes data needed by another Many L2 caches are shared today CMP makes cache sharing attractive P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory

15 - CSE/ESE 560M – Graduate Computer Architecture I Bus-Based Symmetric Shared Memory Dominate the server market –Building blocks for larger systems; arriving to desktop Attractive as throughput servers and for parallel programs –Fine-grain resource sharing –Uniform access via loads/stores –Automatic data movement and coherent replication in caches –Cheap and powerful extension Normal uniprocessor mechanisms to access data –Key is extension of memory hierarchy to support multiple processors Chip Multiprocessors I/O devices Mem P 1 $ $ P n Bus

16 - CSE/ESE 560M – Graduate Computer Architecture I Caches are Critical for Performance PPP Reduce average latency –automatic replication closer to processor Reduce average bandwidth Data is logically transferred from producer to consumer to memory –store reg --> mem –load reg <-- mem Many processors can shared data efficiently What happens when store & load are executed on different processors?

17 - CSE/ESE 560M – Graduate Computer Architecture I Cache Coherence Problem I/O devices Memory P 1 $ $ $ P 2 P 3 5 u = ? 4 u u=5 1 u 2 u 3 u = 7 –Processors see different values for u after event 3 –With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when Processes accessing main memory may see very stale value –Unacceptable to programs, and frequent!

18 - CSE/ESE 560M – Graduate Computer Architecture I Cache Coherence Caches play key role in all cases –Reduce average data access time –Reduce bandwidth demands placed on shared interconnect Private processor caches create a problem –Copies of a variable can be present in multiple caches –A write by one processor may not become visible –Accessing stale value in their caches –Cache coherence problem What do we do about it? –Organize the mem hierarchy to make it go away –Detect and take actions to eliminate the problem

19 - CSE/ESE 560M – Graduate Computer Architecture I Snoopy Cache-Coherence Protocols State Address Data Bus is a broadcast medium Cache Controller “Snoops” all transactions –take action to ensure coherence –invalidate, update, or supply value –depends on state of the block and the protocol

20 - CSE/ESE 560M – Graduate Computer Architecture I Write-thru Invalidate I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u=5 1 u 2 u 3 u = 7 u = 7

21 - CSE/ESE 560M – Graduate Computer Architecture I Architectural Building Blocks Bus Transactions –fundamental system design abstraction –single set of wires connect several devices –bus protocol: arbitration, command/addr, data –Every device observes every transaction Cache block state transition diagram –FSM specifying how disposition of block changes –invalid, valid, dirty

22 - CSE/ESE 560M – Graduate Computer Architecture I Design Choices Snoop State Tag Data ° ° ° Cache Controller Processor Ld/St Cache Controller –Updates state of blocks in response to processor and snoop events –Generates bus transactions Action Choices –Write-through vs. Write-back –Invalidate vs. Update

23 - CSE/ESE 560M – Graduate Computer Architecture I State Tag Data I/O devices Mem P 1 $ $ P n Bus State Tag Data Write-through Invalidate Protocol Two states per block in each cache –Hardware state bits associated with blocks that are in the cache –other blocks can be seen as being in invalid (not-present) state in that cache Writes invalidate all other caches –can have multiple simultaneous readers of block,but write invalidates them

24 - CSE/ESE 560M – Graduate Computer Architecture I Write-through vs. Write-back Write-through protocol is simpler –Every write is observable Every write goes on the bus –Only one write can take place at a time in any processor Uses a Lot of bandwidth –Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes 30 M stores per second per processor 240 MB/s per processor 1GB/s bus can support only about 4 processors without saturating

25 - CSE/ESE 560M – Graduate Computer Architecture I Invalidate vs. Update Basic question of program behavior –Is a block written by one processor later read by others before it is overwritten? Invalidate –yes: readers will take a miss –no: multiple writes without addition traffic Update –yes: avoids misses on later references –no: multiple useless updates Requirement –Correctness –Patterns of program references –Hardware complexity

26 - CSE/ESE 560M – Graduate Computer Architecture I Conclusion Parallel Computer –Shared Memory (This Lecture) –Distributed Memory (Next Lecture) –Experience: Biggest Issue today is Memory BW Types of Shared Memory MPs –Shared Interleaved Cache –Distributed Cache Bus based Symmetric Shared Memory –Most Popular at Smaller Scale –Fundamental Problem: Cache Coherence –Solution Example: Snoopy Cache –Design Choices Write-thru vs. Write-back Invalidate vs. Update