CS425/CSE424/ECE428 – Distributed Systems 2011-10-27Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
CS425/CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Cache Optimization Summary
Distributed Shared Memory
Multiple Processor Systems
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
Distributed Operating Systems CS551 Colorado State University at Lockheed-Martin Lecture 4 -- Spring 2001.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Distributed Resource Management: Distributed Shared Memory
Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.
Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
PRASHANTHI NARAYAN NETTEM.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed Shared Memory.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Distributed Shared Memory Systems and Programming
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Distributed File Systems
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Lecture 28-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) December 2, 2010 Lecture 28 Distributed.
Performance of Snooping Protocols Kay Jr-Hui Jeng.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Translation Lookaside Buffer
Distributed Shared Memory
Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Ivy Eva Wu.
CMSC 611: Advanced Computer Architecture
Lecture 26 A: Distributed Shared Memory
Outline Midterm results summary Distributed file systems – continued
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Distributed Shared Memory
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 26 A: Distributed Shared Memory
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Distributed Resource Management: Distributed Shared Memory
Presentation transcript:

CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya

P1 P2 P3 Shared Address Space Page Transfer Shared Address Space 9 Read-only replicated page

 In a multiprocessor, two or more processors share a common main memory. Any process on a processor can read/write any word in the shared memory. All communication through a bus.  E.g., Cray supercomputer  Called Shared Memory  In a multicomputer, each processor has its own private memory. All communication using a network.  E.g., CSIL PC cluster  Easier to build: One can take a large number of single-board computers, each containing a processor, memory, and a network interface, and connect them together. (called COTS="Components off the shelf")  Uses Message passing  Message passing can be implemented over shared memory.  Shared memory can be implemented over message passing.  Let's look at shared memory by itself.

 When any of the CPUs wants to read a word from the memory, it puts the address of the requested word on the address line, and asserts the bus control (read) line.  To prevent two CPUs from accessing the memory at the same time, a bus arbitration mechanism is used, i.e., if the control line is already asserted, wait.  To improve performance, each CPU can be equipped with a snooping cache.  Snooping used in both (a) write-through and (b) write-once models CPU MemoryCPU Cache Memory Bus

EventAction taken by a cache in response to its own operation Action taken by other caches in response (to a remote operation) Read hitFetch data from local cache (no action) Read missFetch data from memory and store in cache (no action) Write missUpdate data in memory and store in cache Invalidate cache entry Write hitUpdate memory and cacheInvalidate cache entry All the other caches see the write (because they are snooping on the bus) and check to see if they are also holding the word being modified. If so, they invalidate the cache entries.

A B C W1 Shared A B C W1 Shared A B C W2 W1 Invalid A B C W2 W1 Invalid CPU W1 A reads word W and gets W1. B does not respond but the memory does. Private W2 A writes a value W2. B snoops on the bus, and invalidates its entry. A's copy is marked as private. Private W3 A writes W again. This and subsequent writes by A are done locally, without any bus traffic. Initially both the memory and B have an updated entry of word W. For write, at most one CPU has valid access Shared

A B C W2 W1 Invalid A B C W4 W1 Invalid Shared W3 C reads W. A sees the request by snooping on the bus, asserts a signal that inhibits memory from responding, provides the values. Also changes label to Shared. Invalid W3 W4 C writes W. A invalidates it own entry. C now has the only valid copy. Private The cache consistency protocol is built upon the notion of snooping and built into the memory management unit (MMU). All above mechanisms are implemented in hardware for efficiency. The above shared memory can be implemented using message passing instead of the bus. W3 Shared

 Basic idea: Create the illusion of global shared address space  Approach:  Divide address space into chunks (pages)  Distribute page storage across computers  Use the page fault mechanism to migrate chunk to local memory  Similar to virtual memory, but missing pages filled from other computers instead of disk

CPU Cache Memory Bus X X Network

 When a processor references a word that is absent, it causes a page fault.  On a page fault,  the missing page is just brought in from a remote processor.  A region of 2, 4, or 8 pages including the missing page may also be brought in. ▪ Locality of reference: if a processor has referenced one word on a page, it is likely to reference other neighboring words in the near future.  Region size  Small => too many page transfers  Large => False sharing  Above tradeoff also applies to page size

ABAB ABAB Processor 1Processor 2 Code using A Code using B Two unrelated shared variables Occurs because: Page size > locality of reference Unrelated variables in a region cause large number of pages transfers Large page sizes => more pairs of unrelated variables Page consists of two variables A and B

 Achieving consistency is not an issue if  Pages are not replicated, or…  Only read-only pages are replicated  But don't want to compromise performance.  Two approaches are taken in DSM  Update: the write is allowed to take place locally, but the address of the modified word and its new value are broadcast to all the other processors. Each processor holding the word copies the new value, i.e., updates its local value.  Invalidate: The address of the modified word is broadcast, but the new value is not. Other processors invalidate their copies. (Similar to example in first few slides for multiprocessor)  Page-based DSM systems typically use an invalidate protocol instead of an update protocol. ? [Why?]

 Each page is either in R or W state.  When a page is in W state, only one copy exists, located at one processor (called current "owner") in read-write mode.  When a page is in R state, the current/latest owner has a copy (mapped read-only), but other processors may have copies. W Processor 1Processor 2 Owner P page Processor 1Processor 2 R Owner P Suppose Processor 1 is attempting a read: Different scenarios (a) (b)

W Processor 1Processor 2 Owner P page Exclusive write access, no trap Suppose Processor 1 is attempting a read: Different scenarios

R Processor 1Processor 2 Owner P Exclusive read access, no trap

R Processor 1Processor 2 Owner P Shared read access, no trap R

R Processor 1Processor 2 Owner P Shared read access, no trap R

R Processor 1Processor 2 Owner P Read miss 1.Ask for copy 2.Mark as R 3.Satisfy read R

R Processor 1Processor 2 Owner P Read miss 1.Ask to downgrade from W to R 2.Ask for copy 3.Mark as R 4.Satisfy read R W

W Processor 1Processor 2 Owner P Write hit: No trap

W Processor 1Processor 2 Owner P Upgrade to write R

W Processor 1Processor 2 Owner P 1.Invalidate other copies 2.Upgrade to write R R

W Processor 1Processor 2 Owner P 1.Ask for ownership 2.Invalidate other copies 3.Upgrade to write R R Owner

W Processor 1Processor 2 Owner P 1.Ask for a copy 2.Ask for ownership 3.Invalidate other copies 4.Upgrade to write R R Owner

W Processor 1Processor 2 Owner P 1.Ask for a copy 2.P2 downgrades to R 3.Ask for ownership 4.Invalidate other copies 5.Upgrade to write R W Owner

 Owner is the processor with latest updated copy. How do you locate it?  1. Do a broadcast, asking for the owner to respond.  Broadcast interrupts each processor, forcing it to inspect the request packet.  An optimization is to include in the message whether the sender wants to read/write and whether it needs a copy.  2. Designate a page manager to keep track of who owns which page.  A page manager uses incoming requests not only to provide replies but also to keep track of changes in ownership.  Potential performance bottleneck  multiple page managers ▪ The lower-order bits of a page number are used as an index into a table of page managers. 1. Request 2. Reply Page Manager P 3. Request 4. Reply Page Manager Owner 1. Request 3. Reply 2. Request forwarded Owner P

 Broadcast a msg giving the page num. and asking processors holding the page to invalidate it.  Works only if broadcast messages are reliable and can never be lost. Also expensive.  The owner (or page manager) for a page maintains a copyset list giving processors currently holding the page.  When a page must be invalidated, the owner (or page manager) sends a message to each processor holding the page and waits for an acknowledgement. Network Copyset Page num.

 Different types of consistency: a tradeoff between accuracy and performance.  Strict Consistency (one-copy semantics)  Any read to a memory location x returns the value stored by the most recent write operation to x.  When memory is strictly consistent, all writes are instantaneously visible to all processes and a total order is achieved.  Similar to "Linearizability"  Sequential Consistency  For any execution, a sequential order can be found for all ops in the execution so that  The sequential order is consistent with individual program orders (FIFO at each processor)  Any read to a memory location x should have returned (in the actual execution) the value stored by the most recent write operation to x in this sequential order.

 In this model, writes must occur in the same order on all copies, reads however can be interleaved on each system, as convenient. Stale reads can occur.  Can be realized in a system with causal- totally ordered reliable broadcast mechanism.

 Example: Given H1= W(x)1 and H2= R(x)0 R(x)1, how do we come up with a sequential order (single string S of ops) that satisfies seq. cons.  Program order must be maintained  Memory coherence must be respected: a read to some location, x must always return the value most recently written to x.  Answer: S= R(x)0 W(x)1 R(x)1

 Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines.  Example 1: P1: P2: P3: P4: W(x)1 W(x) 3 R(x)1 W(x)2 R(x)1 R(x)3 R(x)2 R(x)2 R(x) 3 This sequence obeys causal consistency Concurrent writes

P1: P2: P3: P4: W(x)1 R(x)1 W(x)2 R(x)2 R(x)1 R(x)1 R(x) 2 This sequence does not obey causal consistency Causally related P1: P2: P3: P4: W(x)1 W(x)2 R(x)2 R(x)1 R(x)1 R(x) 2 This sequence obeys causal consistency

 Advantages  Simple programming model  No marshalling  Disadvantages  Higher overhead due to false sharing  Does not handle failures well  Difficult with heterogeneous systems

 DSM: usually implemented in multicomputer where there is no global memory  Invalidate versus Update protocols  Consistency models – tradeoff between accuracy and performance  strict, sequential, causal, etc.  Some of the material is from Tanenbaum (on reserve at library), but slides ought to be enough.  Reading from Coulouris textbook: Chap 18 (4 th ed; relevant parts – topics covered in the slides), (5 th ed)