Download presentation
Presentation is loading. Please wait.
Published byCordelia Todd Modified over 9 years ago
1
Cache Coherent Distributed Shared Memory
2
Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected with a bus Large processor count –Distributed Shared Memory Machines –Largely message passing architectures
3
Programming Concerns Message passing –Access to memory involve send/request packets –Communication costs Shared memory model –Ease of programming –But not very scalable Scalable and easy to program?
4
Distributed Shared Memory Physically distributed memory Implemented with a single shared address space Also known as NUMA machines since memory access times are non-uniform –Local access times < Remote access times
5
DSM and Memory access Big difference in accessing local versus remote data Large differences make it difficult to hide latency How about caching? –In short, it’s difficult –Cache coherence
6
Cache coherence Cache Coherence –Different processors may access values at same memory location –How to ensure data integrity at all times? An update by a processor at time t is available for other processors at time t+1 –Snoopy protocol –Directory based protocol
7
Snoopy Coherence Protocols Transparent to user Easy to implement For a read –Data fetched from other cache or from memory For a write –All data at other caches are invalidated –Delayed or immediate write-back. The Bus plays an important role
8
Example
9
But it does not scale! Not feasible for machines with memory distributed across a large number of systems Broadcast on bus approach is bad Leads to bus saturation Waste of processor cycles to snoop all caches in system
10
Directory-Based Cache Coherence A directory tracks which processor have cached a block of memory Directory contains information for all cache blocks in system Each cache block can have 1 of 3 states –Invalid –Shared –Exclusive To enter exclusive state, all other cache blocks for same memory location is invalidated
11
Original form not popular Compared to snoopy protocols –Directory systems avoid broadcasting on bus But requests served by 1 directory server –May saturate a directory server Still not scalable How about distributing the directory –Load balancing –Hierarchical model?
12
Distributed Directory Protocol Involved sending messages among 3 node types –Local node Requesting processor node –Home node Node containing memory location –Remote node Node containing cache block in exclusive state
13
3 Scenarios Scenario 1 –Local node sends request to home node –Home node sends data back to local node Scenario 2 –Local node sends request to home node –Home node redirects request to remote node –Remote node sends data back to local node Scenario 3 –Local node sends request for exclusive state –Home node redirects request to other remote nodes for invalidation
14
Example
15
Stanford DASH Multiprocessor 1 st operational multiprocessor to support scalable coherence protocol Demonstrates scalability and cache coherence are not incompatible 2 hypotheses –Shared memory machines easier to program –Cache coherence vital
16
Past experience From experience –Memory access times differ widely between physical location –Latency and bandwidth is important for shared memory systems –Caching helps amortize cost of memory access in a memory hierarchy
17
DASH Multiprocessor Relaxed memory consistency model Observation –Most programs use explicit synchronization –Sequential consistency is not necessary –Allows system to perform writes without waiting till all invalidations are performed Offers advantages in hiding memory latency
18
DASH Multiprocessor Non-Binding software prefetch –Prefetches data into cache –Maintains coherence –Transparent to user Compiler can issue such instructions to help runtime performance –If data is invalidated, it will refresh the data when it is accessed Helps to hide latency as well
19
DASH Multiprocessor Remote Access Cache –Remote access combined and buffered within individual nodes –Can be likened to having a 2-level cache hierarchy
20
Lessons High performance require careful planning of remote data access Scaling applications depend on other factors –Load balancing –Limited parallelism –Difficult to scale application into using more processor
21
Challenges Programming model? –Model that helps programmers reason about code rather than fine-tuning for a specific machine Fault tolerance and recovery? –More computers = Higher chance of failure Increasing latency? –Increasing hierarchies = Larger variety of latencies
22
Callisto Previously networking gateways –Handle diverse set of services –Handles 1000s of channels –Complex designs involving many chips –High power requirement Callisto is a gateway on a chip –Used to implement communication gateways for different networks
23
In a nutshell Integrates DSPs, CPUs, RAM, IO channels on chip Programmable multi-service platform Handles 60 to 240 channels per chip An array of Callisto chips can fit in a small space –Power efficient –Handles a large number of channels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.