Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Slides:



Advertisements
Similar presentations
Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
CMSC 611: Advanced Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
Memory Management 2010.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェン トアン ドゥク.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Page Replacement.
Outline Midterm results summary Distributed file systems – continued
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Lecture 3: Main Memory.
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Ch 17 - Binding Protocol Addresses
Lecture 24: Multiprocessors
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
CSE 542: Operating Systems
Presentation transcript:

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C

Motivation  Shared address space parallel programming is conceptually simpler than message passing  NOWs are more cost effective than SMPs  However, NOWs are a more natural fit for message passing  Two approaches to supporting a shared address space with distributed shared memory: 1.Simulate the hardware solution, using coherent replication 2.Translate all accesses to shared variables into explicit messages

 Split-C uses the second method (no caching)  This makes the Split-C implementation much simpler  The programmer labels variables as local or global  Global accesses become function calls to the Split-C library  Disadvantage: The demand on the programmer is much greater The programmer must provide efficient distribution and access The programmer must manage "caching"

Our Solution Add automatic coherent caching to Split-C SWCC-Split-C: Software Cache Coherent Split-C (Almost) No changes to the Split-C programming language The programmer gets a shared memory system with automatic replication on a NOW The programmers task is simpler, not as much emphasis on placement Good for irregular applications

Next: Design Results Conclusion

Design Overview Fine-grained coherence at the level of blocks of memory Simple MSI invalidate protocol Directory structure tracks the state of blocks as they move through the system Each block is associated with a home node NACKs and retries are used to achieve coherence

LOCAL NODE HOME NODE REMOTE NODE (1) request (2) request (3) request (4) response (5) response (6) response Notation:

Address Blocks : Split-C shared variable has Processor Number and Local Address (virtual memory address) SWCC: partition the entire address space into blocks Coherence is maintained at the level of blocks The upper bits of the Local Address part of a global variable determines its block address Addresses associated with directory structure and coherence protocols are block addresses

DIR ENTRY DIR ENTRY DIR ENTRY BLOCK_ADDR (shifted &masked) Proc Num Directory Structure: Hash table of pointers to linked lists of directory entries Lives in local memory (malloc’ed at beginning of program)

Directory Entry: Block Addr State Data Linked list pointer user vector (maintained and used only by home node) Directory entry for every shared block which a program accesses (not only at home node) At home node, the directory entry gets a copy of local memory

Directory Lookup (hit): Calculate the directory hash table index Load the address of the directory entry Load the block addr field of the directory entry Check that it matches the block addr of the global variable Load the state of the directory entry Check the state of the entry Perform the memory access Only user optimization: Check that the node is the home node Calculate the directory hash table index Load the entry from the directory hash table Check that the entry is NULL Perform the memory access

Coherence Protocol: 3 stable states: Modified, Shared, Invalid Also: Read-Busy, Write-Busy If data is available in appropriate state, no communication. Otherwise, Local node sends request to home node. Home node does necessary processing to reply with the data. May send invalidate or flush requests to remote nodes. Serialization at home node: NACKs and retries Messages via Active Messages Active Message deadlock rules? State transition diagrams (simplified)...

MSI READ / READ_REQ LOCAL NODE WRITE / READ / WRITE / WRITE_REQ READ_RESP / WRITE_RESP /

M (self) S I WRITE_REQ / WRITE_RESP HOME NODE M (other) Write Busy Read Busy READ_REQ / FLUSH_REQ FLUSH_RESP / READ_RESP READ OR WRITE REQ / NACK READ_REQ / READ_RESP READ OR WRITE REQ / NACK WRITE_REQ / N * INV_REQ N * INV_RESP / WRITE_RESP FLUSH_X_RESP / WRITE_RESP WRITE_REQ / FLUSH_X_REQ READ_REQ / READ_RESP READ_REQ / READ_RESP WRITE_REQ / WRITE_RESP

MSI FLUSH_X_REQ / FLUSH_X_RESP FLUSH_REQ / FLUSH_RESP INV_REQ / INV_RESP REMOTE NODE

Other Design Points: race conditions, write lock flag non-FIFO network, NACKs and Retries duplicate requests bulk transactions stores

Performance Results Micro-Benchmarks Matrix-Multiply EM3D

Read Micro-Benchmarks

Write Micro-Benchmarks

Matrix Multiplication Naïve Blocked Optimized Blocked

Naïve MM, Scaling the Number of Processors

Naïve MM, Scaling the Matrix Size

MM: Fixed Size, Fixed Resources, Different Versions Naive Basic Blocked Optimized Blocked Block Size MFLOPS

EM3D: H nodes and E nodes depend on each other Each iteration the values of H nodes are updated based on the values of the E nodes it depends on and vice-versa. parameters: number of nodes degree of nodes remote probability distance span number of iterations

EM3D: Scaling Remote Dependency Percentage

EM3D: Scaling Number of Processors

Conclusions Automatic coherent caching can make the programmer's life easier Initial data placement is less important For some applications it is even more difficult to predict access patterns or do “caching” in the user program, e.g. Barnes Hut or Ray-Tracing Cache coherence is also useful in exploiting spatial locality Sometimes caching isn’t useful and just provides extra overhead. Potentially the user or compiler could decide to use caching on a per-variable basis.