Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Consistency Models Based on Tanenbaum/van Steen’s “Distributed Systems”, Ch. 6, section 6.2.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Today: –Introduction –Consistency models Data-centric consistency.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Distributed Resource Management: Distributed Shared Memory
Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
DISTRIBUTED COMPUTING
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
Lecture 28-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) December 2, 2010 Lecture 28 Distributed.
COSC6385 Advanced Computer Architecture
Multiprocessor System Distributed System
CS6320 – Performance L. Grewe.
Distributed Shared Memory
Lecture 21 Synchronization
Computer Engineering 2nd Semester
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
Ivy Eva Wu.
Advanced Operating Systems
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013
Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Cache Coherence Protocols 15th April, 2006
Consistency Models.
Lecture 26 A: Distributed Shared Memory
Multiprocessors - Flynn’s taxonomy (1966)
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Distributed Shared Memory
CSS490 Distributed Shared Memory
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 26 A: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Update : about 8~16% are writes
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Distributed Resource Management: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Multiprocessors and Multi-computers
Presentation transcript:

Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) December 5, 2013 Lecture 28 Distributed Shared Memory Section 6.5, Chapter 6 from Tanenbaum textbook  2013, I. Gupta, K. Nahrtstedt, S. Mitra, N. Vaidya, M. T. Harandi, J. Hou

HW4 Due Now

Distributed Shared Memory Shared Address Space 1 2 3 4 5 6 7 8 9 2 1 4 3 6 5 7 8 9 P1 P2 P3 Shared Address Space Shared Address Space 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 2 1 4 3 6 2 1 4 3 6 5 9 7 8 9 5 9 7 8 9 Page Transfer Read-only replicated page

Shared Memory vs. Message Passing In a multiprocessor, two or more processors share a common main memory. Any process on a processor can read/write any word in the shared memory. All communication is through a bus. E.g., Cray supercomputer Called Shared Memory In a multicomputer, each processor has its own private memory. All communication using a network. E.g., (Our favorite) EWS cluster. Or any datacenter really. Easier to build: One can take a large number of single-board computers, each containing a processor, memory, and a network interface, and connect them together. (called COTS=“Components off the shelf”) Uses Message passing Message passing can be implemented over shared memory. Shared memory can be implemented over message passing. Let’s look at shared memory by itself. CPU Cache Memory Bus

Cache Consistency – Write Through CPU Cache Memory Bus Cache Consistency – Write Through Event Action taken by a cache in response to its own operation Action taken by other caches in response (to a remote operation) Read hit Fetch data from local cache (no action) Read miss Fetch data from memory and store in cache Write miss Update data in memory and store in cache Invalidate cache entry Write hit Update memory and cache All the other caches see the write (because they are snooping on the bus) and check to see if they are also holding the word being modified. If so, they invalidate the cache entries.

Cache Consistency – Write Once CPU For write, at most one CPU has valid access A B C W1 Clean A B C W1 Clean W1 Clean Initially both the memory and B have an updated entry of word W. A reads word W and gets W1. B does not respond but the memory does W1 A C B C W1 A B W3 W1 W2 W1 Dirty Dirty Invalid Invalid A writes a value W2. B snoops on the bus, and invalidates its entry. A’s copy is marked as dirty. A writes W again. This and subsequent writes by A are done locally, without any bus traffic.

Cache Consistency – Write Once B B C W3 W1 W4 W3 W1 Dirty Invalid Invalid Dirty Invalid A writes a value W3. No bus traffic is incurred C writes W. A sees the request by snooping on the bus, asserts a signal that inhibits memory from responding, provides the values. A invalidates it own entry. C now has the only valid copy. The cache consistency protocol is built upon the notion of snooping and built into the memory management unit (MMU). All above mechanisms are implemented in hardware for efficiency. The above shared memory can be implemented using message passing instead of the bus.

Granularity of Chunks Region size When a processor references a word that is absent, it causes a page fault. On a page fault, the missing page is just brought in from a remote processor. A region of 2, 4, or 8 pages including the missing page may also be brought in. Locality of reference: if a processor has referenced one word on a page, it is likely to reference other neighboring words in the near future. Region size Small => too many page transfers Large => False sharing Above tradeoff also applies to page size

False Sharing Page consists of two variables A and B Processor 1 unrelated shared variables Code using A Code using B Occurs because: Page size > locality of reference Unrelated variables in a region cause large number of pages transfers Large page sizes => more pairs of unrelated variables

Achieving Sequential Consistency Achieving consistency is less challenging if: Pages are not replicated, or… Only read-only pages are replicated. But don’t want to compromise performance. Two approaches are taken in Distributed Shared Memory (DSM) Invalidate: The address of the modified word is broadcast, but the new value is not. Other processors invalidate their copies. (Similar to example in first few slides for multiprocessor) Update: the write is allowed to take place locally, but the address of the modified word and its new value are broadcast to all the other processors. Each processor holding the word copies the new value, i.e., updates its local value. Page-based DSM systems typically use an invalidate protocol instead of an update protocol. ? [Why?]

Invalidation Protocol to Achieve Consistency Owner = Processor with latest copy of page Each page is either in R or W state. When a page is in W state, only one copy exists, located at one processor (called current “owner”) in read-write mode. When a page is in R state, the current/latest owner has a copy (mapped read-only), but other processors may have copies. Suppose Processor 1 is attempting a read: Different scenarios W Processor 1 Processor 2 Owner P page Processor 1 Processor 2 R Owner P (b) (a)

Invalidation Protocol (Read) Suppose Processor 1 is attempting a read: Different scenarios Invalidation Protocol (Read) Processor 1 Processor 2 Processor 1 Processor 2 R P Owner R Owner P (c) (d) In the first 4 cases, the page is mapped into its address space, and no trap occurs. Processor 1 Processor 2 P R Owner Processor 1 Processor 2 P W Owner (f) (e) Ask for a copy Mark page as R Do read Ask P2 to degrade its copy to R Ask for a copy Mark page as R Do read

Invalidation Protocol (Write) Suppose Processor 1 is attempting a write: Different scenarios Processor 1 Processor 2 Processor 1 Processor 2 R W P P Owner Owner Mark page as W Do write Processor 1 Processor 2 Processor 1 Processor 2 R P Owner R Owner P Invalidate other copies Mark local page as W Do write Invalidate other copies Ask for ownership Mark page as W Do write

Invalidation Protocol (Write) Suppose Processor 1 is attempting a write: Different scenarios Processor 1 Processor 2 P R Owner Processor 1 Processor 2 P W Owner Invalidate other copies Ask for ownership Ask for a page Mark page as W Do write Invalidate other copies Ask for ownership Ask for a page Mark page as W Do write

Finding the Owner Owner is the processor with latest updated copy. How do you locate it? 1. Do a broadcast, asking for the owner to respond. Broadcast interrupts each processor, forcing it to inspect the request packet. An optimization is to include in the message whether the sender wants to read/write and whether it needs a copy. 2. Designate a page manager to keep track of who owns which page. A page manager uses incoming requests not only to provide replies but also to keep track of changes in ownership. Potential performance bottleneck Multiple page managers Map pages to page managers using the lower-order bits of page number. Page Manager Page Manager 1. Request 1. Request 2. Request forwarded 2. Reply 3. Request P Owner P Owner 4. Reply 3. Reply

How does the Owner Find the Copies to Invalidate Broadcast a msg giving the page num. and asking processors holding the page to invalidate it. Works only if broadcast messages are reliable and can never be lost. Also expensive. The owner (or page manager) for a page maintains a copyset list giving processors currently holding the page. When a page must be invalidated, the owner (or page manager) sends a message to each processor holding the page (or a multicast) and waits for an acknowledgement. 3 4 2 3 4 1 2 3 2 3 4 1 2 4 1 3 4 5 2 4 Copyset Page num. Network

Strict and Sequential Consistency Different types of consistency: a tradeoff between accuracy and performance. Strict Consistency (one-copy semantics) Any read to a memory location x returns the value stored by the most recent write operation to x. When memory is strictly consistent, all writes are instantaneously visible to all processes and an absolute global time order is maintained. Similar to “Linearizability” Sequential Consistency For any execution, a sequential order can be found for all ops in the execution so that The sequential order is consistent with individual program orders Any read to a memory location x should have returned (in the actual execution) the value stored by the most recent write operation to x in this sequential order.

How to Determine the Sequential Order? Example: Given H1= W(x)1 and H2= R(x)0 R(x)1, how do we come up with a sequential order (single string S of ops) that satisfies seq. cons. Program order must be maintained Memory coherence must be respected: a read to some location, x must always return the value most recently written to x. Answer: S= R(x)0 W(x)1 R(x)1

This sequence obeys causal consistency Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. Example 1: Concurrent writes P1: W(x) 3 W(x)1 P2: R(x)1 W(x)2 P3: R(x)1 R(x)3 R(x)2 P4: R(x)2 R(x) 3 R(x)1 This sequence obeys causal consistency

Causal Consistency Causally related P1: W(x)1 P2: R(x)1 W(x)2 P3: R(x)2 R(x)1 P4: R(x)1 R(x) 2 This sequence does not obey causal consistency P1: W(x)1 P2: W(x)2 P3: R(x)2 R(x)1 P4: R(x)1 R(x) 2 This sequence obeys causal consistency

Pipelined RAM Consistency Any pair of writes that are both done by a single processor are received by all other processors in the same order. A pair of writes from different processes may be seen in different orders at different processors. P1: W(x)1 P2: R(x)1 W(x)2 P3: R(x)2 R(x)1 P4: R(x)1 R(x) 2 This sequence is allowed with PRAM consistent memory

Cache Coherence Discussion so far focuses on single individual variable Sequential/causal/PRAM consistency needs to obey ordering across all variable accesses. But sometimes you only care for each variable independently Accesses to x and y can be linearized into R(x)0, W(x)1, and separately, R(y)0, W(y)1 The history is coherent (per variable), but not sequentially consistent P1: W(x)1 R(y)0 P2: W(y)1 R(x)0

Weak Consistency Not all the processors require seeing all the writes, let alone seeing them in order. E.g, a process is inside a critical section reading/writing some variables in a tight loop. Other processes are not supposed to touch the variables until the first process has left the critical section. A special synchronization variable is introduced. When a synchronization completes, all writes done on that processor are propagated outward and all writes done on other processors are brought in. Access to synchronization variables are sequentially consistent. No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere else. Accessing a synchronization variable “flushes the pipeline”, both to and from this processor No data access (read/write) is allowed until all previous accesses to synchronization variables have been performed.

The memory in P3 and P4 has been brought up to date Weak Consistency P1: W(x)1 W(y) 2 S P2: P3: R(y)2 R(x)0 S P4: R(x)0 R(y) 2 S This sequence is allowed with weak consistent memory (but is not recommended) P1: W(x)1 W(y) 2 S P2: P3: S R(y)2 R(x)1 P4: S R(x)1 R(y) 2 The memory in P3 and P4 has been brought up to date

This sequence is allowed with release consistent memory Release Consistency Two special synchronization variables are defined: Acquire and Release Before any read/write op is allowed to complete at a processor, all previous acquire’s at all processors must be completed Before a release is allowed to complete at a processor, all previous read/write ops. must be propagated to all processors Acquire and release ops are sequentially consistent w.r.t. one another P1: Acq(L) W(x)1 W(x) 2 Rel(L) P2: P3: Acq(L) R(x)2 Rel(L) P4: R(x) 1 This sequence is allowed with release consistent memory

Is that Bear dead? DSM is an old area, and is less active research-wise. In other words, many say it is “dead” However, faster networks in datacenters Infiniband and CLOS => RDMA: Remote Direct Memory Access Servers can read each others’ memories Time to rejuvenate DSM? Also, many consistency model ideas in key-value/NoSQL storage systems originate from DSM Is that Bear really starting to wake up from hibernation?

Announcements MP4 due this Sunday. Demos next Monday (watch Piazza/wiki for signup sheet) Mandatory to attend next Tuesday’s lecture (last lecture of course) Final Exam, December 19 (Thursday), 7.00 PM - 10.00 PM David Kinley Hall – 114 (1DKH-114) 1407 W. Gregory Drive, Urbana IL 61801 Do not come to our regular Siebel classroom! Also on website schedule Allowed to bring a cheat sheet: two sides only, at least 1 pt font Conflict exam Please email course staff email by this Thursday (Dec 5) if you need to take a conflict exam