Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
CMSC 611: Advanced Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Lecture 8: Snooping and Directory Protocols
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Presented by: Nick Kirchem Feb 13, 2004
Lecture 21 Synchronization
Architecture and Design of AlphaServer GS320
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
Copyright 2004 Daniel J. Sorin
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 9: Directory-Based Examples
Lecture 10: Consistency Models
High Performance Computing
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 23: Transactional Memory
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 11: Consistency Models
Presentation transcript:

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

Motivation Make money – server revenue at the time was in 4 – 64 processor systems Snooping protocols work really well on small systems (<8 processors) but don’t scale well Directory structures at the time were made for large (>64 processors) systems, but are too slow for mid-range multiprocessors

The problems Snooping  Limited by bandwidth  Too much for each controller to do per cycle Directories  Long latency  Too much glue (Amdahl’s Law)

Overview 32 or 64 processor directory machine 8 Quad-Processor Building Blocks connected in a crossbar Each QBB has:  4 processors (with external L2)  4 memory modules  1 I/O interface  1 Global Port DTAG DIR (14 bits per line) TTT 4 request types: read, readX, X, X without data

Reducing Latency No waiting for invalidated copies to ACK on a GETX No Nack’ing Directory updates state as soon as the request arrives Dirty-Sharing NUMA

The Three Lane Information Super-Highway Information is passed on three virtual lanes (and an IO lane).  Q0: Carries a message from processor to the block’s home Point to point ordering must occur  Q1: Carries messages from the home Point of serialization! Must have total order  Q2: Replies/data

An example Reproduction of Figure 2d

Caveats Early request race - request gets to the owner before the data does  Solution: Stall the Q1 until the data arrives Late request race – request for data arrives after a writeback operation  Solution: Buffer victim until a writeback ACK is received Intra-Node transactions – Check TTT, possible loop through global Markers – Used to preserve global order

Memory Consistency A quick very high-level overview:  Separation of data and requests  The previously atomic response has been split into two parts: the commit and the data  Lots of regulations of what can go when (still)

Questions The total ordering of the Q1 lane “comes naturally in a crossbar switch”? The GS320 is said to be expandable to 64 processors, but the system detailed in the paper is tailored to 32 processors. How easily can it be expanded? Addressing has been a major issue in other papers, but it is not discussed in this one. Why?