Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

Slides:



Advertisements
Similar presentations
Cache Optimization Summary
Advertisements

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Distributed Shared Memory
Distributed Operating Systems CS551 Colorado State University at Lockheed-Martin Lecture 4 -- Spring 2001.
Distributed Resource Management: Distributed Shared Memory
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Computer Organization and Architecture
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
PRASHANTHI NARAYAN NETTEM.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed Shared Memory.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Distributed Shared Memory (DSM)
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
B. Prabhakaran 1 Distributed Shared Memory DSM provides a virtual address space that is shared among all nodes in the distributed system. Programs access.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
Memory Coherence in Shared Virtual Memory Systems Yeong Ouk Kim, Hyun Gi Ahn.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Virtual Memory Pranav Shah CS147 - Sin Min Lee. Concept of Virtual Memory Purpose of Virtual Memory - to use hard disk as an extension of RAM. Personal.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Memory COMPUTER ARCHITECTURE
Distributed Shared Memory
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Ivy Eva Wu.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Outline Midterm results summary Distributed file systems – continued
Main Memory Background Swapping Contiguous Allocation Paging
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Distributed Resource Management: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Presentation transcript:

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University Oct. 11 th 2011 Kim Myung Sun, Lee Chung Hyeon

2 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

3 Basic concept  What is shared virtual memory ? Provides a virtual address space that is shared among all processors Applications can use it just as they do a traditional virtual memory Not only “pages” data between physical memories and disks Conventional virtual memory system But also “pages” data between the physical memories of the individual processors Data can naturally migrate between processors on demand  What do the author really want to say in this paper ? Main difficulty in building a shared virtual memory Memory coherence problem Introduction

4 Multiprocessor system Introduction CPU 1 Memory 1 Mapping Manager Shared virtual memory CPU 2 Memory 2 Mapping Manager CPU 3 Memory 3 Mapping Manager CPU 4 Memory 4 Mapping Manager A single address space virtually shared Address space is partitioned into pages A single address space physically shared Multi cache schemes (cache consistent protocol) Embedded System Loosely coupledTightly coupled Shared memory

5 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

6 Main features  Memory mapping managers Keeping the address space coherent Views its local memory as a large cache  Key goals To allow processes of a program to execute on different processors in parallel To improve the performance of parallel programs # parallel processes Shared data update ratio Contention for the shared data Shared virtual memory

7 Characteristics  Partitioned into pages  RO pages can have many copies on each CPU’s memory  WR pages can reside in only one CPU’s memory  Memory reference can causes a page fault Shared virtual memory CPU 1 CPU 2 Memory 1Memory 2 reference Slightly different from page fault in Embedded system

8 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

9 Basic concept & design issues  What is the meaning of memory coherence? The value returned by a read operation is always the same as the value written by the most recent write operation to the same address  Design issues Granularity of the memory units(“page sizes”) Sending large packets –is not much more expensive than small one –is the greater chance for contention than small one 1K bytes is suitable ?(right size is clearly application dependent) Memory coherence strategies Page synchronization and page ownership Memory Coherence Problem

10 Memory coherence strategies  Page synchronization Invalidation Invalidates all copies of the page when write page fault occurs There is only one owner processor for each page Write-broadcast Writes to all copies of the page when write page fault occurs  Page Ownership Fixed A page is always owned by the same processor Dynamic Centralized manager Distributed manager –Fixed –Dynamic Memory Coherence Problem

11 Page Table, Locking, and invalidation  Data structures for page table Access: indicates the accessibility to the page (nil/read/write) Copy set: processor number (read copies of the page) Lock: synchronizes multiple page faults by different processes on the same processor and synchronizes remote page requests Memory Coherence Problem lock(PTable[p].lock): LOOP IF test-and-set the lock bit THEN EXIT; IF fail THEN queue this process; unlock(PTable[p].lock): clear the lock bit; IF a process is waiting on the lock THEN resume the process; invalidate ( p, copy_set ) for i in copy_set DO send an invalidation request to processor i;

12 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

13 A Monitor-Like Centralized Manager Algorithm  Consisting of a per page data structure and provide mutually exclusive access to the data structure.  Only one processor can be the “monitor”  Only the monitor knows the owner Centralized manager algorithms Centralized Manager : Monitor Info owner copy set lock the most recent processor to have write access to it lists all processors that have copies of the page used for synchronizing request to the page Each processors PTable Access lock keeps information about the accessibility of pages on the local processor

14 Centralized manager algorithms

15  e.g. [Read page fault in P2] P1, Centralized Manager P2 P3 Info owner copy set lock P3 { P1 } 0 PTable access lock0 PTable access lock R 0 Centralized manager algorithms

16  e.g. [Read page fault in P2] P1, Centralized Manager P2 P3 Info owner copy set lock P3 { P1 } 0 PTable access lock 1 PTable access lock R 0 Centralized manager algorithms

17  e.g. [Read page fault in P2] P1, Centralized Manager P2 P3 1. Request Info owner copy set lock P3 { P1, P2 } 1 PTable access lock1 PTable access lock R 0 Centralized manager algorithms

18  e.g. [Read page fault in P2] lock1 P1, Centralized Manager P2 P3 1. Request Info owner copy set P3 { P1, P2 } PTable access lock1 PTable access lock R 0 Centralized manager algorithms

19  e.g. [Read page fault in P2] lock1 P1, Centralized Manager P2 P3 1. Request 2. Ask to send copy to P2 Info owner copy set P3 { P1, P2 } PTable access lock1 PTable access lock R 1 Centralized manager algorithms

20  e.g. [Read page fault in P2] lock1 P1, Centralized Manager P2 P3 1. Request 2. Ask to send copy to P2 Info owner copy set P3 { P1, P2 } PTable access lock1 PTable access lock R 1 Centralized manager algorithms

21  e.g. [Read page fault in P2] lock1 P1, Centralized Manager P2 P3 1. Request 2. Ask to send copy to P2 Info owner copy set P3 { P1, P2 } PTable access lock1 PTable access lock R 0 3. Send copy 4. Confirmation Centralized manager algorithms

22  e.g. [Read page fault in P2] lock P1, Centralized Manager P2 P3 1. Request 2. Ask to send copy to P2 Info owner copy set P3 { P1, P2 } PTable access lock 0 PTable access lock R 0 3. Send copy R 4. Confirmation 0 Centralized manager algorithms

23  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set P3 { P1, P2 } PTable access lock0 PTable access lock R 0 R 0 Centralized manager algorithms

24  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set P3 { P1, P2 } PTable access lock PTable access lock R 0 R 0 1. Write Request 1 Centralized manager algorithms

25  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set { P1, P2 } PTable access lock1 PTable access lock R Write Request 2. Invalidate of copy set P2 Centralized manager algorithms

26  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set P2 PTable access lock1 PTable access lock R Write Request 2. Invalidate of copy set { } 3. Ask to send page to P2 Centralized manager algorithms

27  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set P2 PTable access lock1 PTable access lock Write Request 2. Invalidate of copy set { } 3. Ask to send page to P2 4. Send page to P2 Centralized manager algorithms

28  e.g. [Write page fault in P2] lock P1, Centralized Manager P2 P3 Info owner copy set PTable access lock 0 PTable access lock 0 W 1. Write Request { } 3. Ask to send page to P2 4. Send page to P2 5. confirmation 0 P2 Centralized manager algorithms

29  The primary difference from the previous one Synchronization of page ownership move to the individual owners The locking mechanism on each processor now deals not only with multiple local requests, but also with remote requests The copy set field is valid iff the processor that holds the page table is the owner of the page. But, for large N there still might be a bottleneck at the manager processor because it must respond to every page fault Centralized manager algorithms An improved centralized manger algorithm Centralized Manager : Monitor Each processors lock Info owner copy set Owner owner PTable access lock PTable access lock Copy set

30 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

31 A Fixed Distributed Manager Algorithm  Predetermined subset of the pages e.g., H(p) = p mod N, N is # of processors H(p) = (p/s) mod N, s is # of pages per segment (e.g., s = 3)  Proceeds as in the centralized manager algorithm  Superior to the centralized manager algorithms when a parallel program exhibits a high rate of page faults  To find the good H(p) that fits all Appl. well is difficult Distributed manager algorithms P1P2P3P1P2P3P1P2P3P1P2P P1P2P1P2P3P1P2P3P1P P1 Pages in SVM

32 A broadcast distributed manager algorithm  Each processor manages precisely those pages that it owns  Faulting processor send broadcasts into the network to find the true owner of a page  All broadcast operations are atomic  All processors have to process each broadcast request  slowing down the system. Distributed manager algorithms

33 Distributed manager algorithms P1 P2 P3 PTable access lock copy set owner PTable access lock R copy set{ P1 } ownerP3 PTable access lock R copy set ownerP3 e.g. [Read page fault in P2]

34 Distributed manager algorithms P1 P2 P3 PTable access lock copy set owner PTable access lock R copy set{ P1 } ownerP3 PTable access lock R copy set ownerP3 1. Broadcast read request e.g. [Read page fault in P2]

35 Distributed manager algorithms P1 P2 P3 PTable access lock copy set owner PTable access lock copy set ownerP3 PTable access lock R copy set ownerP3 1. Broadcast read request 2. Send copy of p { P1, P2 } R e.g. [Read page fault in P2]

36 Distributed manager algorithms A Dynamic Distributed Manager Algorithm  keeping track of the ownership of all pages in each processors’ local Ptable There is no fixed owner or manager Owner field -> probOwner field The information that it contains is just a hint When a processor has a page fault, it sends request to the processor indicated by the probOwner field If that processor is the true owner, it proceeds as in the centralized manager algorithm Else, forwards the request to the processor indicated by its probOwner field P1P2P3Pn …… page fault true ownerprobable owner request forward

37 Distributed manager algorithms Dynamic Distributed Manager w/o manager

38 e.g. [Read page fault in P2]  e.g., e.g. [Read page fault in P2] P1 P2 P3 PTable access lock copy set probOwnerP1 PTable access lock R copy set{ P1 } P3 PTable access lock R copy set{ P1 } P3 probOwner Distributed manager algorithms

39 e.g. [Read page fault in P2]  e.g., e.g. [Read page fault in P2] P1 P2 P3 PTable access lock copy set probOwnerP1 PTable access lock R copy set{ P1 } P3 PTable access lock R copy set{ P1 } P3 probOwner Distributed manager algorithms 1. read request

40 e.g. [Read page fault in P2]  e.g., e.g. [Read page fault in P2] P1 P2 P3 PTable access lock copy set probOwnerP1 PTable access lock R copy set{ P1 } P3 PTable access lock R copy set{ P1 } probOwner 1. read request 2. forward request P3 Distributed manager algorithms Need not send a reply to the requesting processor

41 e.g. [Read page fault in P2]  e.g., P1 P2 P3 PTable access lock copy set { P1, P3 } probOwner P2 PTable access lock R copy set PTable access lock R copy set{ P1 } probOwner 1. read request 2. forward request 3. Send copy of p and copy set { P1, P3 } R P2 Distributed manager algorithms

42 e.g. [Write page fault in P1] P1 P2 P3 PTable access lock copy set { P1, P3 } probOwnerP2 PTable access lock R copy set PTable access lock R copy set{ P1 } probOwner { P1, P3 } R P2 Distributed manager algorithms

43 e.g. [Write page fault in P1] P1 P2 P3 PTable access lock copy set { P1, P3 } probOwnerP2 PTable access lock R copy set PTable access lock R copy set{ P1 } probOwner { P1, P3 } R P2 1. write request Distributed manager algorithms

44 e.g. [Write page fault in P1] P1 P2 P3 PTable access lock copy set { P1, P3 } probOwnerP2 PTable access lock R copy set PTable access lock R copy set{ P1 } probOwner { P1, P3 } P2 1. write request 2. Send p and copy set Distributed manager algorithms

45 e.g. [Write page fault in P1] P1 P2 P3 PTable access lock copy set { } probOwner P1 PTable access lock copy set PTable access lock R copy set probOwner P1 P2 1. write request 2. Send p and copy set 3. Invalidate in copy set { } Distributed manager algorithms

46 e.g. [Write page fault in P1] P1 P2 P3 PTable access lock copy set{ } probOwnerP1 PTable access lock copy set PTable access lock W copy set probOwner P1 1. write request 2. Send p and copy set 3. Invalidate in copy set { } Distributed manager algorithms

47 Distribution of Copy Set Copy set data is distributed and stored as a tree Improves system performance “Divide and conquer” effect –Balanced tree -> log m(faster invalidation) for m read copies Distributed manager algorithms P1 P2 P3P4 P5P6 P7 Copy set ProbOwner Write fault invalidation through Copy set Read fault collapse through ProbOwner Owner

48 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

49 Experiments Experiments  Experimental Setup Implement a prototype shared virtual memory system called IVY (Integrated shared Virtual memory at Yale) Implemented on top of a modified Aegis operating system of Apollo DOMAIN computer system IVY can be used to run parallel programs on any number of processors on an Apollo ring network Parallel Programs Parallel Jacobi program –Solving three dimensional PDEs(partial differential equations) Parallel sorting Parallel matrix multiplication Parallel dot-product

50 Speedups Experiments  The speedup of a program is the ratio of the execution time of the program on a single processor to that on the shared virtual memory system. Experimental Setup  All the speedups are compared against the ideal

51 Experiments  Speed up of 3D-PDE, where n = 50^3  Super-linear speed up  Data structure for the problem is greater than the size of physical memory on a single processor Ideal Solution Experimental Result

52 Experiments  Disk paging  Data structure for the problem > Phy. Mem.  As if combined physical memory by SVM. 1 processor Processor 1 with initialization Processor 2 without initialization

53  Speed up of 3-D PDE, where n=40^3  Data structure for the problem < Phy. Mem Ideal Solution Experimental Result Experiments

54  Speedup of the merge-split sort, where 200K elements Ideal solution Even with no communication costs, merge-split sort does not yield linear speed up the costs of all memory references are the same Ideal Solution Experimental Result Experiments n 2 merging sort with n processors

55  Speed up of parallel dot-product where each vector has 128K elements Shared virtual memory system is not likely to provide speed up Ratio of the communication cost to the computation cost is large Ideal Solution Two vector have a random distribution on each processor Two vector is in one processor Experiments

56  Speed up of matrix multiplication, where 128 x 128 square matrix Ideal Solution Experimental Result Experiments C = AB Program exhibits a high degree of localized computation

57  Compare to memory coherence algorithms All algorithms have similar number of page faults Number or requests of fixed distributed manager algorithm is similar to improved centralized manager algorithm Both algorithm need a request to locate the owner of the page for almost every page fault that occurs on a non-manager processor Experiments Overhead =Messages =requests ProbOwner fields usually give correct hints

58 Agenda  Introduction  Shared virtual memory  Memory coherence problem  Centralized manager algorithms  Distributed manager algorithms  Experiments  Conclusion

59  This paper studied two general classes of algorithms for solving the memory coherence problem Centralized manager Straightforward and easy to implement Traffic bottleneck at the central manager Distributed Manager Fixed dist. Manager –Alleviates the bottleneck Dynamic dist. Manager –The most desirable overall features  Scalability and granularity Over 8 processors ? And other than 1 KB ? Conclusion Conclusion

60 Questions or Comments?