Download presentation
Presentation is loading. Please wait.
Published byCathleen Small Modified over 9 years ago
1
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008
2
Insight and Motivation SMP abandoned for more scalable cc-NUMA But SMP bandwidth has scaled faster than CPU speed cc-NUMA is scalable but more complicated –Program/OS specialization necessary –Communication to remote memory is slow –May not be optimal for real access patterns SMP (UMA) is simpler –More straightforward programming model –Simpler scheduling, memory management –No slow remote memory access Why not leverage SMP to the extent that it scales?
3
Multiple SMP (MSMP) Connect few, large SMPs (nodes) together Distributed shared memory –Weren't we just NUMA-bashing? Several CPUs per node => many local memory refs Few nodes => unscalable coherence protocol OK
4
WildFire Hardware MSMP with 112 UltraSPARCs –Four unmodified Sun E6000 SMPs GigaPlane bus (3.2 GB/s within a node) 16 2CPU or I/O cards per node WildFire Interface (WFI) is just another I/O board (cool!) –SMPs (UMA) connected via WFI == cc-NUMA (!) But this is OK... few, large nodes Full cache coherence, both intra- & inter-node WildFire from 30,000 ft (emphasis on a single SMP node)
5
WildFire Software ABI-compatible with Sun SMPs –It's the software, stupid! Slightly (allegedly) modified Solaris 2.6 Threads in same process grouped onto same node Hierarchical Affinity Scheduler (HAS) Coherent Memory Replication (CMR) –OK, so this isn't purely a software technique
6
Coherent Memory Replication (CMR) S-COMA with fixed home locations for each block –For those keeping score, that means it's not COMA Local physical pages “shadow” remote physical pages –Keep frequently-read pages close: less avg. latency Implementation: hardware counters –CMR page allocation handled within OS –Coherence still in hardware at block granularity Enabled/disabled at page granularity CMR memory allocation adjusts with mem. pressure
7
Hierarchical Affinity Scheduling (HAS) Exploit locality by scheduling a process on the last node on which it executed Only reschedule onto another node when load imbalance exceeds a threshold Works particularly well when combined with CMR –Frequently-accessed remote pages still shadowed locally after a context switch Lagniappe locality
8
WildFire Implementation A single Sun E6000 with WFI. Recall: WildFire Interface is just one of 16 standard cards on the GigaPlane bus.
9
WildFire Implementation Network Interface Address Controller (NIAC) + Network Interface Data Controller (NIDC) == WFI NIAC interfaces with GigaPlane bus and handles inter- node coherence Four NIDCs talk to point-to-point interconnect between nodes –Three ports per NIDC (one for each remote node) –800MB/s in each direction with each remote node
10
WildFire Cache Coherence Intra-node coherence: bus + snoopy Inter-node (global) coherence: directory –Directory state kept at a block's home node Directory cache (SRAM) backed by memory –Home node determined by high-order address bits –MOSI –Nothing special since scalability not an issue Blocking directory, 3-stage WB => no corner cases NIAC sits on bus and asserts “ignore” signal for requests that global coherence must attend to –NIAC intervenes if block's state is inadequate or resides in remote memory
11
WildFire Cache Coherence Coherent Memory Replication complicates matters –A local shadow page has a different physical address from its corresponding remote page –If a block's state is insufficient, must look up global address in order for WFI to issue remote request Stored in LPA2GA SRAM Also cache the reverse lookup (GA2LPA)
12
Coherence Example
13
WildFire Memory Latency WildFire compared to SGI Origin (2x R10K per node) and Sequent NUMA-Q (4x Xeon per node) WF's remote mem. latency mediocre (2.5x Origin, similar to NUMA-Q), but less relevant because remote accesses less frequent (1/14 as many as Origin, 1/7 as many as NUMA-Q)
14
Evaluating WildFire Clever performance evaluation methodology: isolate on WildFire itself by comparing single-node 16-cpu system with two-node, 8cpu/node system –Pure SMP vs. WildFire Also compare with NUMA-fat –Basically WF with no OS support, i.e. no CMR, no HAS, no locality-aware memory allocation, no kernel replication And compare with NUMA-thin –NUMA-fat but with small (2 CPU) nodes Finally, turn off HAS and CMR to evaluate their contribution to WF's performance
15
Evaluating WildFire WF with HAS+CMR comes within 13% of pure SMP Speedup(HAS+CMR) >> Speedup(HAS)*Speedup(CMR) Locality-aware allocation and large nodes are important
16
Evaluating WildFire Performance trends correlate with locality of reference HAS + CMR + Kernel Replication + Initial Allocation improve locality of access from 50% (i.e. uniform distribution between two nodes) to 87%
17
Summary WildFire = a few large SMP nodes + directory-based coherence between nodes + fast point-to-point interconnect + clever scheduling and replication techniques Pretty good performance (unfortunately, no numbers for 112 CPUs) Good idea? –I think so, but I doubt much room for scalability Then again, that wasn't the point Criticisms? –Authors are very proud of their slow directory protocol –Kernel modifications may not be so slight
18
Questions? Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008
19
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.