Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

Caching IV Andreas Klappenecker CPSC321 Computer Architecture.

Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University P & H Chapter

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

Memory Problems Prof. Sin-Min Lee Department of Mathematics and Computer Sciences.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Virtual memory.

Memory Organization.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.

Memory/Storage Architecture Lab 1 Virtualization History of Computing = History of Virtualization  e.g., process abstraction, virtual memory, cache memory,

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Supporting Multi-Processors Bernard Wong February 17, 2003.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

COS 318: Operating Systems Virtual Memory and Its Address Translations.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.

Computer Architecture Lecture 27 Fasih ur Rehman.

Memory Management & Virtual Memory © Dr. Aiman Hanna Department of Computer Science Concordia University Montreal, Canada.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1  1998 Morgan Kaufmann Publishers Chapter Seven.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

University of Toronto Department of Electrical And Computer Engineering Jason Zebchuk RegionTracker: Optimizing On-Chip Cache.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Virtual Memory Chapter 7.4.

Memory COMPUTER ARCHITECTURE

Architecture and Design of AlphaServer GS320

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

CS 147 – Parallel Processing

Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?

Cache Memory Presentation I

Lecture 13: Large Cache Design I

What we need to be able to count to tune programs

CMSC 611: Advanced Computer Architecture

Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011

Distributed Shared Memory

Part V Memory System Design

Virtual Memory 4 classes to go! Today: Virtual Memory.

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Parallel and Multiprocessor Architectures – Shared Memory

FIGURE 12-1 Memory Hierarchy

Chapter 5 Memory CSE 820.

Outline Midterm results summary Distributed file systems – continued

Lecture 24: Memory, VM, Multiproc

Horizontally Partitioned Hybrid Main Memory with PCM

/ Computer Architecture and Design

TLB Performance Seung Ki Lee.

High Performance Computing

Design Alternatives for SAS: The Beauty of Mobile Homes

Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Virtual Memory: Working Sets

The University of Adelaide, School of Computer Science

Presentation transcript:

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood Computer Science Department University of Wisconsin, Madison Presented by Anita Lungu February 17, 2006

Context and Motivation Large-Scale Distributed Shared Memory parallel machines Directory coherence between SMPs Local access fast / Remote access slow Problem: Hide remote memory access latency Solutions: Cache Coherent NUMA (CC-NUMA) Best when: coherency misses dominate Simple Cache Only Memory Architectures (S-COMA) Best when: capacity misses dominate Opportunity: Hybrid: R-NUMA = CC-NUMA + S-COMA Support both: dynamically select protocol for each page Better performance than each separately =>Best of both worlds

CC-NUMA Data elements: home node allocated Remote cluster cache Keeps only remote data Block level granularity Small & fast (SRAM) Can be larger & slower (DRAM) Data elements: home node allocated Advantage when: Remote working set fits in small block cache Mostly coherence misses Disadvantage when: Many data accesses are remote

S-COMA Distributed main memory = 2nd level cache for remote data Data elements: NO home node Allocation and mapping Page granularity (Software) Standard Virtual Address Translation hardware Coherence Block granularity (Hardware) Extra hardware: Access control tags 2 bits/block, trigger to inhibit memory Auxiliary SRAM translation table Convert Local Physical Pages<->Global Physical Pages (home) Advantage when: Mostly capacity/cold misses Remote data is reused often

R-NUMA Classify remote pages: Default all pages to CC-NUMA Reuse: accessed many times by a node Communication: Used to communicate data between nodes Default all pages to CC-NUMA Dynamically change page to S-COMA Threshold: #remote capacity/conflict misses per page (in block cache) Per node decision

Qualitative Performance Worst case scenario Page relocated from block cache (CC-NUMA) to memory (S-COMA) and not referenced again Worst case performance Depends on cost of relocation (change page from CC-NUMA to S-COMA) relative to cost of page allocation R-NUMA can be 3x worse than either CC-NUMA or S-COMA But… Threshold for optimal worst case performance <> threshold for optimal average performance

Base System Results Best case: Worst case: R-NUMA reduces execution time by 37% Worst case: R-NUMA increases execution time by 57% CC-NUMA can be 179% worse than S-COMA S-COMA can be 315% worse that CC-NUMA

Sensitivity Results 2 1 3 1. S-COMA and R-NUMA sensitivity to page-fault and TLB invalidation overhead 2. R-NUMA sensitivity to relocation threshold value 3. CC-NUMA and R-NUMA sensitivity to cache size

Questions?