Ivy Eva Wu.

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

Distributed Shared Memory
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
Distributed Operating Systems CS551 Colorado State University at Lockheed-Martin Lecture 4 -- Spring 2001.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Distributed Resource Management: Distributed Shared Memory
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed Shared Memory.
Multiprocessor Cache Coherency
Distributed Shared Memory Systems and Programming
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Distributed Shared Memory (DSM)
2008 dce Distributed Shared Memory Pham Quoc Cuong & Phan Dinh Khoi Use some slides of James Deak - NJIT.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
B. Prabhakaran 1 Distributed Shared Memory DSM provides a virtual address space that is shared among all nodes in the distributed system. Programs access.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed Shared Memory Presentation by Deepthi Reddy.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
DISTRIBUTED COMPUTING
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
Memory Coherence in Shared Virtual Memory Systems Yeong Ouk Kim, Hyun Gi Ahn.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Distributed Shared Memory
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Lecture 26 A: Distributed Shared Memory
Outline Midterm results summary Distributed file systems – continued
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Distributed Shared Memory
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
User-level Distributed Shared Memory
Distributed Shared Memory
CSS490 Distributed Shared Memory
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 26 A: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Distributed Resource Management: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Ivy Eva Wu

Table of Content Ivy – Design, Architecture Distributed Manager Algorithms Performance, Problems, and Potential Improvements

Ivy First DSM system – Apollo Workstation Implemented at the Yale University in mid-to-late 80’s Implement multiprocessor cache coherency protocol in software Single writer and multiple readers Page management implementation Centralized manager Fixed distributed manager Dynamic distributed manager Provide a shared memory system across a group of workstations Provide abstraction of 2 classes of memory: private and shared Shared memory – easier to write parallel programs with than using message passing

Ivy Architecture Ownership of a page moves across nodes Sequential consistency Must invalidate before writing a page Simulates FIFO Sequential consistency – all other nodes/processes see the results of every memory operation performed by any processor in the same order Provides uniprocessor behavior on multiprocessor

Granularity Size of unit transfer Ivy – 1Kbyte page for access Advantage: large blocks – fewer page numbers of transfer (locality) Disadvantage – false sharing Ivy – 1Kbyte page for access False sharing – when system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity Sum_a – may need to continually re-read x from main memory

Centralized Manager Only manager knows all the copies Contains a table (Info) with one entry for each page Owner – most recent write access processor Copy_set – list of all the processor that have copies of the page Lock – for synchronizing requests Each processor has a page table (PTable) – for accessibility Access Lock

Read Page fault for p1 in C C sends a read request to manager Manager sends read forward to A; manager adds C to copy_set A sends p1 to C; p1 in C is marked read-only C sends read confirmation to manager Copy_set – set of nodes with a copy of a piece of data Last write is considered the “owner,” and has the current copy_set 2 messages for manager processor – one to owner, another from the owner 4 messages for nonmanager – one to manager, one to owner, one from owner, one for confirmation

Write Page fault for p1 on B B sends write request to manager Manager sends invalidate to all processors in copy_set (C) C sends invalidate confirm to manager Manager clears copy_set and sends write forward to A A sends p1 to B and clears access B sends write confirmation to manager Potential bottleneck/hotspot – all requests must go to manager Page is not writeable without invalidation operation if there is a read copy Invalidated all copies – send a message to each processor in the copy_set, wait for acks from everyone Confirmation message – completion of a request, manager can give the page to someone else – synchronization

Eventcount Process synchronization mechanism – based on shared virtual memory Four primitive operations: init(), read(), await(value), and advance() Atomic operation Any process can use eventcount after initiailization Eventcount operations are local when a page is received by a processor Init() – initialized an eventcount Read() – returns value of eventcount Await() – suspends the calling process itself until the value of the eventcount reaches the value specified Advance() – increments the values of the eventcount by one and wakes up awaiting processes

Improved Centralized Manager The owner, instead of the manager, keeps the copy_set of a page PTable: access, lock, and copy_set Manager still answers where the page owner is Copy_set is sent along with the data Owner is responsible for invalidation Decentralized synchronization – centralized manager no longer the hot spot Eliminate confirmation operation to manager Copy set field is valid only if the processor that holds the page table is the owner of the page Might still have bottleneck – manager must respond to every page fault

Fixed Distributed Manager Every processor has a predetermined set of pages to manage One manager per processor Manager is responsible for pages specified by fixed mapping function H Page fault on p Faulting processor asks processor H(p) where the true page owner is Processor H(p) finds true page owner using centralized manager algorithm Straightforward approach – distribute the pages evenly in a fixed manner to all processors Hashing function H defined by H(p) = p mod N More general definition – H(p) = (p/s) mod N P – number of pages, N – number of processors, s – number of pages per segment

Broadcast Distributed Manager No manager PTable: access, lock, copy_set, and owner Owner behaves similar to a manager and keeps the copy_set Requesting processor sends a broadcast message Disadvantage: all processes have to process each broadcast request Each processor manages those pages it owns Broadcast read request – true owner of the pages responds by adding processor P to the page’s copy set field and sending a copy of the page to P Broadcast write request – true owner of the page gives up ownership and sends back the page and its copy set. The requesting processor invalidates all the copies.

Broadcast Distributed Manager Read Add P1 to copy set and send copy of page 0 Broadcast P0 P1 P2 P3 P4 Request Page 0 Page 0

Broadcast Distributed Manager Write Page 0 and its copy set Broadcast P0 P1 P2 P3 P4 Write request Page 0 Page 0

Dynamic Distributed Manager Manager = owner A page does not have a fixed owner or manager Each process keeps tracks of the probable owner (probOwner) Every host keeps track of the page ownership in its local page table PTable Owner field is replaced with field: probOwner If the processor is the true owner, it proceeds as in the centralized manager algorithm

probOwner Value either true owner or “probable” owner of the page Page fault – sends a request to the processor in probOwner field If correct, then proceeds as in centralized manager algorithm If incorrect, then forward the message to the “probable” owner Initially, all probOwners are set to a default processor Updates when Invalidation request Relinquishes ownership Forwards a page fault request Job of the page fault handlers and their servers to maintain this field as the program runs When forward a request, it does not need to send a reply to the requesting processor

Dynamic Distributed Manager Read Request Read P3 P4 P0 P1 P2 probOwner Page 0 Long link to find the true owner

Dynamic Distributed Broadcasts Improved the dynamic distributed manager algorithm by enforcing a broadcast message Announce the true owner after every M page faults M steadily increases as number of processors get large Program converges when M is very large

Dynamic Distributed Broadcasts Read Current Owner Request Broadcast P0 P1 P2 P3 P4 probOwner Page 0

Dynamic Distributed Copy Set Copy set data to be stored as a tree Root: owner Bi-directional Directed from root: copy_set Directed from leaves: probOwner Read fault: probOwner to the owner Write fault: Invalidates all copies starting at owner and propagate to the copy_sets Propagation of invalidation – “divide and conquer” If balanced, O(log(m)) for m read copies Read fault now only needs to find a single processor that holds a copy of the page – a lock is needed on processors having read copies of the page to synchronize sending copies of the page in the presence of other read or write faults

Double Fault Read first, then write – page fault twice Solution – sequence numbers Process can send its page sequence number to the owner. The owner then decide whether a transfer is needed Only avoids transaction

Performance Works well when there is little sharing Cannot handle false sharing Sequential consistency required large amounts of communication Ping-pong effect

Potential Improvements Allow multiple writers by allowing certain users to keep private copies Do not share the entire page to reduce false sharing

Questions?