WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Cache Optimization Summary

Thoughts on Shared Caches Jeff Odom University of Maryland.

Multiple Processor Systems

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

UMA Bus-Based SMP Architectures

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Chapter 17 Parallel Processing.

Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

ThreadsThreads operating systems. ThreadsThreads A Thread, or thread of execution, is the sequence of instructions being executed. A process may have.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.

Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Supporting Multi-Processors Bernard Wong February 17, 2003.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Full and Para Virtualization

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs – COMA & Beyond Copyright 2004 Daniel J. Sorin Duke.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Background Computer System Architectures Computer System Software.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

The University of Adelaide, School of Computer Science

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Lecture 13: Multiprocessors Kai Bu

Parallel Architecture

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

CMSC 611: Advanced Computer Architecture

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

DDM – A Cache-Only Memory Architecture

High Performance Computing

Chapter 4 Multiprocessors

CSE 486/586 Distributed Systems Cache Coherence

Lecture 10: Directory-Based Examples II

Presentation transcript:

WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008

Insight and Motivation SMP abandoned for more scalable cc-NUMA But SMP bandwidth has scaled faster than CPU speed cc-NUMA is scalable but more complicated –Program/OS specialization necessary –Communication to remote memory is slow –May not be optimal for real access patterns SMP (UMA) is simpler –More straightforward programming model –Simpler scheduling, memory management –No slow remote memory access Why not leverage SMP to the extent that it scales?

Multiple SMP (MSMP)‏ Connect few, large SMPs (nodes) together Distributed shared memory –Weren't we just NUMA-bashing? Several CPUs per node => many local memory refs Few nodes => unscalable coherence protocol OK

WildFire Hardware MSMP with 112 UltraSPARCs –Four unmodified Sun E6000 SMPs GigaPlane bus (3.2 GB/s within a node)‏ 16 2CPU or I/O cards per node WildFire Interface (WFI) is just another I/O board (cool!)‏ –SMPs (UMA) connected via WFI == cc-NUMA (!)‏ But this is OK... few, large nodes Full cache coherence, both intra- & inter-node WildFire from 30,000 ft (emphasis on a single SMP node)‏

WildFire Software ABI-compatible with Sun SMPs –It's the software, stupid! Slightly (allegedly) modified Solaris 2.6 Threads in same process grouped onto same node Hierarchical Affinity Scheduler (HAS)‏ Coherent Memory Replication (CMR)‏ –OK, so this isn't purely a software technique

Coherent Memory Replication (CMR)‏ S-COMA with fixed home locations for each block –For those keeping score, that means it's not COMA Local physical pages “shadow” remote physical pages –Keep frequently-read pages close: less avg. latency Implementation: hardware counters –CMR page allocation handled within OS –Coherence still in hardware at block granularity Enabled/disabled at page granularity CMR memory allocation adjusts with mem. pressure

Hierarchical Affinity Scheduling (HAS)‏ Exploit locality by scheduling a process on the last node on which it executed Only reschedule onto another node when load imbalance exceeds a threshold Works particularly well when combined with CMR –Frequently-accessed remote pages still shadowed locally after a context switch Lagniappe locality

WildFire Implementation A single Sun E6000 with WFI. Recall: WildFire Interface is just one of 16 standard cards on the GigaPlane bus.

WildFire Implementation Network Interface Address Controller (NIAC) + Network Interface Data Controller (NIDC) == WFI NIAC interfaces with GigaPlane bus and handles inter- node coherence Four NIDCs talk to point-to-point interconnect between nodes –Three ports per NIDC (one for each remote node)‏ –800MB/s in each direction with each remote node

WildFire Cache Coherence Intra-node coherence: bus + snoopy Inter-node (global) coherence: directory –Directory state kept at a block's home node Directory cache (SRAM) backed by memory –Home node determined by high-order address bits –MOSI –Nothing special since scalability not an issue Blocking directory, 3-stage WB => no corner cases NIAC sits on bus and asserts “ignore” signal for requests that global coherence must attend to –NIAC intervenes if block's state is inadequate or resides in remote memory

WildFire Cache Coherence Coherent Memory Replication complicates matters –A local shadow page has a different physical address from its corresponding remote page –If a block's state is insufficient, must look up global address in order for WFI to issue remote request Stored in LPA2GA SRAM Also cache the reverse lookup (GA2LPA)‏

Coherence Example

WildFire Memory Latency WildFire compared to SGI Origin (2x R10K per node) and Sequent NUMA-Q (4x Xeon per node)‏ WF's remote mem. latency mediocre (2.5x Origin, similar to NUMA-Q), but less relevant because remote accesses less frequent (1/14 as many as Origin, 1/7 as many as NUMA-Q)‏

Evaluating WildFire Clever performance evaluation methodology: isolate on WildFire itself by comparing single-node 16-cpu system with two-node, 8cpu/node system –Pure SMP vs. WildFire Also compare with NUMA-fat –Basically WF with no OS support, i.e. no CMR, no HAS, no locality-aware memory allocation, no kernel replication And compare with NUMA-thin –NUMA-fat but with small (2 CPU) nodes Finally, turn off HAS and CMR to evaluate their contribution to WF's performance

Evaluating WildFire WF with HAS+CMR comes within 13% of pure SMP Speedup(HAS+CMR) >> Speedup(HAS)*Speedup(CMR)‏ Locality-aware allocation and large nodes are important

Evaluating WildFire Performance trends correlate with locality of reference HAS + CMR + Kernel Replication + Initial Allocation improve locality of access from 50% (i.e. uniform distribution between two nodes) to 87%

Summary WildFire = a few large SMP nodes + directory-based coherence between nodes + fast point-to-point interconnect + clever scheduling and replication techniques Pretty good performance (unfortunately, no numbers for 112 CPUs)‏ Good idea? –I think so, but I doubt much room for scalability Then again, that wasn't the point Criticisms? –Authors are very proud of their slow directory protocol –Kernel modifications may not be so slight

Questions? Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008

WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008