ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs – COMA & Beyond Copyright 2004 Daniel J. Sorin Duke.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
To Include or Not to Include? Natalie Enright Dana Vantrease.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Cache Optimization Summary
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
UMA Bus-Based SMP Architectures
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter 17 Parallel Processing.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
EECC756 - Shaaban #1 lec # 11 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.
Cache Coherence in Scalable Machines CSE 661 – Parallel and Vector Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.
Scalable Cache Coherent Systems
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Multiprocessor Cache Coherency
Copyright 2004 Daniel J. Sorin
Scalable Cache Coherent Systems
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Scalable Cache Coherent Systems
The Stanford FLASH Multiprocessor
Death Match ’92: NUMA v. COMA
Outline Midterm results summary Distributed file systems – continued
Multiprocessors - Flynn’s taxonomy (1966)
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
DDM – A Cache-Only Memory Architecture
Scalable Cache Coherent Systems
High Performance Computing
Cache Coherence: Part II Scalable Approaches Todd C
Lecture 24: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Lecture 23: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Lecture 10: Directory-Based Examples II
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Presentation transcript:

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs – COMA & Beyond Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Cache Only Memory Architecture (COMA) –Basics –Data Diffusion Machine (DDM) –Simple COMA (S-COMA) –Reactive NUMA Hierarchical Coherence –Basics –Sequent NUMA-Q –Chip Multiprocessor (CMP) –Sun Wildfire Token Coherence

3 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Review Basic idea of directories –Per-processor cache hierarchies –Directory interleaved with memory Directory limitations/drawbacks –Limited capacity for replication –High design & implementation cost –Single hard-wired protocol –Limitations of shared physical address space

4 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Cache Only Memory Architecture (COMA) Make all memory available for migration & replication All memory is DRAM cache called Attraction Memory Examples –Data Diffusion Machine (next) –Flat COMA (fixed home for directory but not data) –KSR-1 (hierarchy of snooping rings) But how do you –Find data? –Deal with replacements?

5 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 COMA example: Data Diffusion Machine (DDM) All hardware COMA Attraction Memory  One giant hardware cache Maintains both address tags and state Data addressed, allocated, & kept coherent in blocks Directory info on a per cache-block basis Not home based: –Data is migratory  AM attracts data –Must find a home when replacing the data –Must find the directory entry before finding the data

6 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 DDM Directory Directory is hierarchical in a tree form Each is a set-associative cache of directory info Tree maintains inclusion: –Higher levels keep replica of lower sub-trees D D D DDDD

7 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 DDM Coherence/Placement Protocol Simple write-invalidate protocol Cache states: Invalid, Shared, Exclusive Must traverse the directory: –To find a copy on a read or write miss –To invalidate on a write to Shared Directory is hierarchical set-associative caches –Q1: Is the block in my sub-tree? –Q2: Does the block exist outside my sub-tree? –Request goes up until Q2==no and then down –Request goes down until Q1=no or leaf On a replacement: –for an Exclusive copy, must find another home (HARD!) –for a Shared copy, must make sure other copies exist –else must find another home

8 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Simple COMA (S-COMA) (Pure) COMA –Block granularity to find/allocate/replace (complex hardware) –Block granularity for coherence/transfers (good for false sharing) Software DSM –Page granularity to find/allocate/replace (use VM: good) –Page granularity for coherence/transfers (bad for false sharing) Simple COMA –Page granularity to find/allocate/replace (use VM: good) –Block granularity for coherence/transfers (good for false sharing) –Blocks act like sub-blocks on page

9 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 S-COMA-like Examples Wisconsin Typhoon [Reinhardt et al. ISCA 1994] –On access, VM system checks if page present –On access, HW/SW checks block state –Failure invokes user-level protocol in SW –Good flexibility, but SW slow & users don’t want to write protocols Sun Wildfire [Hagersten/Koster HPCA 1999] –Begin with up to four SMP nodes –Add pseudo-processor board to each as proxy for rest of system –Can run CC-NUMA directory protocol –Can selectively use S-COMA (called Coherent Memory Replication) –Selects between with competitive algorithm [Falsafi/Wood ISCA97] –Hierarchical method of building parallel machines –WE’LL TALK MORE ABOUT THIS LATER

10 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 A Taxonomy of Issues Allocation/Replication –Cache line vs page Access Control (Coherence) –Cache line vs page –HW vs SW Protocol Processing –HW vs SW Communication –Cache line vs page –HW vs SW (message passing)

11 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Reactive NUMA (R-NUMA) PRESENTATION

12 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline Cache Only Memory Architecture (COMA) –Basics –Data Diffusion Machine (DDM) –Reactive NUMA (R-NUMA) Hierarchical Coherence –Basics –NUMA-Q –Chip multiprocessor (CMP) –Sun Wildfire –Intel Profusion

13 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Hierarchical Coherence Many older systems were flat –E.g., a directory that points to 1K processors Use hierarchy –Intra-node coherence (e.g., snooping in SMP node) –Inter-node coherence (e.g., directory between nodes) Why? –Divide & conquer markets (e.g., sell node) –Divide & conquer complexity (but must interface protocols)

14 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example Two-level Hierarchies

15 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Advantages of Multiprocessor Nodes Amortization of node fixed costs over multiple processors –Applies even if processors simply packaged together but not coherent Can use commodity SMPs Less nodes for directory to keep track of (coarser grain) Much communication may be contained within node (cheaper) Nodes prefetch data for each other (fewer “remote” misses) Combining of requests (like hierarchical, only two-level) Can even share caches (overlapping of working sets) Benefits depend on sharing pattern (and mapping) –Good for widely read-shared: e.g. tree data in Barnes-Hut –Good for nearest-neighbor, if properly mapped –Not so good for all-to-all communication

16 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Disadvantages of Coherent MP Nodes Bandwidth shared among nodes –All-to-all example –Applies to coherent or not Bus increases latency to local memory With coherence, typically wait for local snoop results before sending remote requests Snoopy bus at remote node increases delays there, too, increasing latency and reducing bandwidth Overall, may hurt performance if sharing patterns don’t comply

17 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Sequent NUMA-Q System Overview Use of high-volume SMPs as building blocks Quad bus is 532MB/s split-transaction in-order responses –Limited facility for out-of-order responses for off-node accesses Cross-node interconnect is 1GB/s unidirectional ring Larger SCI systems built by bridging multiple rings

18 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 NUMA-Q IQ-Link Board IQ-Link board plays the role of Hub Chip in SGI Origin Can generate interrupts between quads Remote cache (visible to SCI) block size is 64 bytes (32MB, 4-way) –Processor caches not visible (snoopy-coherent within SMP node) to SCI –Remote cache is inclusive with respect to processor caches on SMP Data Pump (GaAs) implements SCI, pulls off relevant packets Interface to quad bus. Manages remote cache data and bus logic. Pseudo- memory controller and pseudo-processor. Interface to data pump, OBIC, interrupt controller and directory tags. Manages SCI protocol using program- mable engines.

19 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 NUMA-Q cont. IQ-Link is key –Local directory: {home (I), fresh (S), gone (E)} + pointer –“L3” remote cache for remote data (tags + data) –Internal bus & external network interfaces –Two ASICs + memory; protocols use microcode Global protocol –Coherence between nodes, not processors –Data in processor cache either »From local memory (directory knows) »In L3 remote cache (inclusion) Local protocol –MESI snooping (Illinois protocol) –IQ-Link asserts “delayed reply” if »Access to local memory & directory says “gone” »Access to remote memory (but L3?)

20 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Chip Multiprocessor (CMP) Chip multiprocessor (CMP) common now –Good building block for larger MP systems Natural hierarchy –Intra-chip protocol vs. inter-chip protocol Examples –Compaq Piranha –Stanford Hydra

21 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Sun Wildfire PRESENTATION

22 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Token Coherence PRESENTATION