AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
The University of Adelaide, School of Computer Science
G Robert Grimm New York University Disco.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Router Architectures An overview of router architectures.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
CE 478: Microcontroller Systems University of Wisconsin-Eau Claire Dan Ernst The Pentium Pro® (P6) Bus Reference: “Penium Pro and Pentium II System Architecture”
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Presented by: Nick Kirchem Feb 13, 2004
Architecture and Design of AlphaServer GS320
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
CMSC 611: Advanced Computer Architecture
Multiple Processor Systems
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Distributed File Systems
Distributed File Systems
CSE 451: Operating Systems Spring Module 21 Distributed File Systems
Distributed File Systems
TORNADO OPERATING SYSTEM
Multiple Processor and Distributed Systems
High Performance Computing
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Virtual Memory, Multiprocessors
Distributed File Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
The University of Adelaide, School of Computer Science
Synonyms v.p. x, process A v.p # index Map to same physical page
Distributed File Systems
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cluster Computers.
Presentation transcript:

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏ Presented by Matt Johnson CPS221/ECE259, Advanced Computer Arch. II Duke University, 1/30/08

AlphaServer GS320 Architecture & Design Sold by HP until 2004, now discontinued

Overview Design Goals Architecture (from 10E+3 ft.)‏ Coherence Protocol Memory Consistency Performance Analysis/Questions

Design Goals Targeting small/medium multiprocessors Exploit known (and limited) system size to implement ideas that don't scale well e.g. protocol optimizations (limited queue sizes)‏ Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping

Design Goals RAS (reliability, availability, serviceability)‏ Modularity (QBBs, we'll get to them in a moment)‏ Hardware partitions (failure containment)‏ Efficiency Tight integration with CPUs (Alpha 21264)‏ CPU support for coherence/consistency operations Directory Protocol avoids NACKs and stalls

Architecture Between 4 and 32 Alpha CPUs Arranged in Quad-processor Building Blocks 7M+ ASIC Gates 4 CPUs 32GB Memory 8 PCI Slots 10-Port Switch

Architecture 10-Port Local Switch (per QBB)‏ 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global 2 QBBs can be connected directly, up to 8 with a global switch Hardware Coherence Support DIRectory: 14 bits/64-byte cache line store owner/sharer info Duplicate TAG Store copies CPUs' L2 cache tags Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries)‏ All implemented in ASICs, some supported by 21264

Architecture

Coherence Protocol 4 Types of Requests Read (not writing, don't need an exclusive copy)‏ Read-Exclusive (don't have it, want to write to it)‏ Exclusive (have a shared copy, want to write to it)‏ Exclusive-Without-Data Used when you want to write an entire cache line (64B)‏ Don't need to transfer the old data in this case

Coherence Protocol Satisfies all requests w/o NACKs or retries Blocks at the host Saves bandwidth Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester State machines at nodes can be simple,fast,small Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol

Coherence Protocol Deadlock is prevented by using 3 virtual lanes Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 Split responses into 2 parts (↓Latency,↑Perf.)‏ Commit (yeah, I heard ya)‏ Data (except exclusive-without-data requests)‏

Coherence Protocol Instead of building their protocol to handle the general case, they optimize it for a specific case e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements they can delay certain responses because they can bound the latency by a reasonable time

Coherence Protocol

(u,v)=(1,0) should be disallowed (would violate sequential consistency)‏

Performance

33.5 Gflops/sec on Linpack workload Supports 2720 users on SAP Benchmark Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun (64-CPU) systems NUMA+Lightweight Protocol->↓Mem. Latency Much better for applications where this matters

Analysis/Questions