Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C

Motivation  Shared address space parallel programming is conceptually simpler than message passing  NOWs are more cost effective than SMPs  However, NOWs are a more natural fit for message passing  Two approaches to supporting a shared address space with distributed shared memory: 1.Simulate the hardware solution, using coherent replication 2.Translate all accesses to shared variables into explicit messages

 Split-C uses the second method (no caching)  This makes the Split-C implementation much simpler  The programmer labels variables as local or global  Global accesses become function calls to the Split-C library  Disadvantage: The demand on the programmer is much greater The programmer must provide efficient distribution and access The programmer must manage "caching"

Our Solution Add automatic coherent caching to Split-C SWCC-Split-C: Software Cache Coherent Split-C (Almost) No changes to the Split-C programming language The programmer gets a shared memory system with automatic replication on a NOW The programmers task is simpler, not as much emphasis on placement Good for irregular applications

Next: Design Results Conclusion

Design Overview Fine-grained coherence at the level of blocks of memory Simple MSI invalidate protocol Directory structure tracks the state of blocks as they move through the system Each block is associated with a home node NACKs and retries are used to achieve coherence

LOCAL NODE HOME NODE REMOTE NODE (1) request (2) request (3) request (4) response (5) response (6) response Notation:

Address Blocks : Split-C shared variable has Processor Number and Local Address (virtual memory address) SWCC: partition the entire address space into blocks Coherence is maintained at the level of blocks The upper bits of the Local Address part of a global variable determines its block address Addresses associated with directory structure and coherence protocols are block addresses

DIR ENTRY DIR ENTRY DIR ENTRY BLOCK_ADDR (shifted &masked) Proc Num Directory Structure: Hash table of pointers to linked lists of directory entries Lives in local memory (malloc’ed at beginning of program)

Directory Entry: Block Addr State Data Linked list pointer user vector (maintained and used only by home node) Directory entry for every shared block which a program accesses (not only at home node) At home node, the directory entry gets a copy of local memory

Directory Lookup (hit): Calculate the directory hash table index Load the address of the directory entry Load the block addr field of the directory entry Check that it matches the block addr of the global variable Load the state of the directory entry Check the state of the entry Perform the memory access Only user optimization: Check that the node is the home node Calculate the directory hash table index Load the entry from the directory hash table Check that the entry is NULL Perform the memory access

Coherence Protocol: 3 stable states: Modified, Shared, Invalid Also: Read-Busy, Write-Busy If data is available in appropriate state, no communication. Otherwise, Local node sends request to home node. Home node does necessary processing to reply with the data. May send invalidate or flush requests to remote nodes. Serialization at home node: NACKs and retries Messages via Active Messages Active Message deadlock rules? State transition diagrams (simplified)...

MSI READ / READ_REQ LOCAL NODE WRITE / READ / WRITE / WRITE_REQ READ_RESP / WRITE_RESP /

M (self) S I WRITE_REQ / WRITE_RESP HOME NODE M (other) Write Busy Read Busy READ_REQ / FLUSH_REQ FLUSH_RESP / READ_RESP READ OR WRITE REQ / NACK READ_REQ / READ_RESP READ OR WRITE REQ / NACK WRITE_REQ / N * INV_REQ N * INV_RESP / WRITE_RESP FLUSH_X_RESP / WRITE_RESP WRITE_REQ / FLUSH_X_REQ READ_REQ / READ_RESP READ_REQ / READ_RESP WRITE_REQ / WRITE_RESP

MSI FLUSH_X_REQ / FLUSH_X_RESP FLUSH_REQ / FLUSH_RESP INV_REQ / INV_RESP REMOTE NODE

Other Design Points: race conditions, write lock flag non-FIFO network, NACKs and Retries duplicate requests bulk transactions stores

Performance Results Micro-Benchmarks Matrix-Multiply EM3D

Read Micro-Benchmarks

Write Micro-Benchmarks

Matrix Multiplication Naïve Blocked Optimized Blocked

Naïve MM, Scaling the Number of Processors

Naïve MM, Scaling the Matrix Size

MM: Fixed Size, Fixed Resources, Different Versions Naive Basic Blocked Optimized Blocked 01 1664 16 64128 4 Block Size 1 10 100 MFLOPS

EM3D: H nodes and E nodes depend on each other Each iteration the values of H nodes are updated based on the values of the E nodes it depends on and vice-versa. parameters: number of nodes degree of nodes remote probability distance span number of iterations

EM3D: Scaling Remote Dependency Percentage

EM3D: Scaling Number of Processors

Conclusions Automatic coherent caching can make the programmer's life easier Initial data placement is less important For some applications it is even more difficult to predict access patterns or do “caching” in the user program, e.g. Barnes Hut or Ray-Tracing Cache coherence is also useful in exploiting spatial locality Sometimes caching isn’t useful and just provides extra overhead. Potentially the user or compiler could decide to use caching on a per-variable basis.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Similar presentations

Presentation on theme: "Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Similar presentations

Presentation on theme: "Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C."— Presentation transcript:

Similar presentations

About project

Feedback