Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Cache Coherence Mechanisms (Research project) CSCI-5593
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
COMP 740: Computer Architecture and Implementation
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Parallel and Multiprocessor Architectures – Shared Memory
Cache Coherence (controllers snoop on bus transactions)
Multi-core systems COMP25212 System Architecture
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
Multiprocessors - Flynn’s taxonomy (1966)
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also “Computer Architecture: A Quantitative Approach”, J.L. Hennessy and D.A. Patterson, Morgan Kaufmann

Interconnection Network PPP M P C UMA Uniform Memory Access Shared Address Space UMA Uniform Memory Access Shared Address Space with cache NUMA Non-Uniform Memory Access Shared Address Space with cache Grama fig 2.5 M MM M M P C P C P MC P MC P MC

Shared Address Space Systems Systems with caches but otherwise flat memory generally called UMA If access to local cheaper than remote (NUMA), this should be built into your algorithm –How to do this and O/S support is another matter –(man numa on linux will give details of numa support) Global address space easier to program –Read only interactions invisible to programmer and coded like sequential program –Read/write are harder, require mutual exclusion for concurrent accesses Programmed using threads and directives –We will consider pthreads and OpenMP Synchronization using locks and related mechanisms

Caches on Multiprocessors Multiple copies of some data word being manipulated by two or more processors at the same time Two requirements –Address translation mechanism that locates memory word in system –Concurrent operations on multiple copies have well defined semantics Latter generally known as cache coherency protocol –(I/O using DMA on machines with caches also leads to coherency issues) Some machines only provide shared address space mechanisms and leave coherence to programme –Cray T3D provided get and put and cache invalidate

Shared-Address-Space and Shared- Memory Computers Shared-memory historically used for architectures in which memory is physically shared among various processors, and all processors have equal access to any memory segment –This is identical to the UMA model Compared to distributed-memory computer where different memory segments are physically associated with different processing elements. Either of these physical models can present the logical view of a disjoint or shared-address- space platform –A distributed-memory shared-address-space computer is a NUMA system

X = 1 Memory X = 1 P1 load x X = 1 P0 load x X = 1 Memory X = 1 P1 load x X = 1 P0 load x X = 1 Memory X = 1 P1 X = 1 P0 write #3, x X = 3 invalidate X = 3 Memory X = 1 P1 X = 3 X = 1 P0 write #3, x X = 3 update Invalidate Protocol Update Protocol Grama fig 2.21 Cache Coherency Protocols

Update v Invalidate Update Protocol –When a data item is written, all of its copies in the system are updated Invalidate Protocol (most common) –Before a data item is written, all other copies are marked as invalid Comparison –Multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation –Multiword cache blocks, each word written in a cache block must be broadcast in an update protocol, but only one invalidate per line is required –Delay between writing a word in one processor and reading the written data in another is usually less for update False sharing: two processors modify different parts of the same cache line –Invalidate protocol leads to ping-ponged cache lines –Update protocol performs reads locally but updates

Implementation Techniques On small scale bus based machines –A processor must obtain access to the bus to broadcast a write invalidation –With two competing processors the first to gain access to the bus will invalidate the others’ data A cache miss needs to locate top copy of data –Easy for write through cache –For write back each processor snoops the bus and responses by providing data if it has the top copy For writes we would like to know if any other copies of the block are cached –i.e. whether a write back cache needs to put details on the bus –Handled by having a tag to indicate shared status Minimizing processor stalls –Either by duplication of tags or having multiple inclusive caches

InvalidModified Shared C_write write flush readwrite C_readC_write read/write read Grama fig State (MSI) Cache Coherency Protocol read: local read c_read: coherency read, i.e. read on remote processor gives rise to shown transition in local cache

MSI Coherency Protocol x = 5, M y = 12,M read x * x = x + 1 * read y * x = x + y * x = x + 1 * read y * y = y + 1 read x * y = x + y * y = y + 1 x = 5, S * x = 6, M * y = 13, S x = 6,S y = 19, M y = 13, I x = 20, M * y = 12, S * y = 13, M y = 13, S x = 6, S x = 6, I y = 19, M * y = 20, M x = 5, S y = 12, S x = 5, I y = 12, I y = 13, S x = 6, S x = 6, I y = 13, I x = 6, I y = 13, I TIME

Snoopy Cache Systems All caches broadcast all transactions –Suited to bus or ring interconnects All processors monitor the bus for transactions of interest Each processors cache has a set of tag bits that determine the state of the cache block –Tags updated according to state diagram for relevant protocol Eg snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out What sort of data access characteristics are likely to perform well/badly on snoopy based systems?

Processor Snoop H/WTags Cache Processor Snoop H/WTags Processor Snoop H/WTags Memory Cache Address/Data Dirty Grama fig 2.24 Snoopy Cache Based System

Directory Cache Based Systems Need to broadcast clearly not scalable –Solution is to only send information to processing elements specifically interested in that data This requires a directory to store information –Augment global memory with presence bitmap to indicate which caches memory located in

Directory Based Cache Coherency Must handle a read miss and a write to a shared, clean cache block To implement the direct must track the state of each cache block A simple protocol might be: –Shared: one or more processors have the block cached, and the value in memory is up to date –Uncached: no processor has a copy –Exclusive: only one processor (the owner) has a copy and the value in memory is out of date We also need to track the processors that have copies of shared cache block, or ownership of

Interconnection Network Processor Cache Processor Cache Processor Cache DataPresence Bits Status Directory Interconnection Network Processor Cache Memory Presence bits/state Processor Cache Memory Presence bits/state Grama fig 2.25 Directory Based Cache Coherency

Directory Based Systems How much memory is required to store the directory? What sort of data access characteristics are likely to perform well/badly on directory based systems? –How do distributed and centralized systems compare?

Costs on SGI Origin 3000 (processor clock cycles)  16 CPU > 16 CPU Cache Hit11 Cache miss to local memory85 Cache miss to remote home directory Cache miss to remotely cached data (3 hop) Data from: “Computer Architecture: A Quantitative Approach”, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003

Real Cache Coherency Protocols From wikipedia –“Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an "Exclusive" state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an "Owned" state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the "Forward" state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.”MESI protocolMOSI protocolMOESI protocolMESIF protocol

MESI (on a bus)