AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

The University of Adelaide, School of Computer Science

G Robert Grimm New York University Disco.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Router Architectures An overview of router architectures.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

CE 478: Microcontroller Systems University of Wisconsin-Eau Claire Dan Ernst The Pentium Pro® (P6) Bus Reference: “Penium Pro and Pentium II System Architecture”

6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris

The University of Adelaide, School of Computer Science

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Presented by: Nick Kirchem Feb 13, 2004

Architecture and Design of AlphaServer GS320

Computer Engineering 2nd Semester

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

Multiple Processor Systems

CSE 451: Operating Systems Winter Module 22 Distributed File Systems

Distributed File Systems

Distributed File Systems

CSE 451: Operating Systems Spring Module 21 Distributed File Systems

Distributed File Systems

TORNADO OPERATING SYSTEM

Multiple Processor and Distributed Systems

High Performance Computing

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 23: Virtual Memory, Multiprocessors

Distributed File Systems

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 23: Transactional Memory

The University of Adelaide, School of Computer Science

Synonyms v.p. x, process A v.p # index Map to same physical page

Distributed File Systems

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cluster Computers.

Presentation transcript:

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏ Presented by Matt Johnson CPS221/ECE259, Advanced Computer Arch. II Duke University, 1/30/08

AlphaServer GS320 Architecture & Design Sold by HP until 2004, now discontinued

Overview Design Goals Architecture (from 10E+3 ft.)‏ Coherence Protocol Memory Consistency Performance Analysis/Questions

Design Goals Targeting small/medium multiprocessors Exploit known (and limited) system size to implement ideas that don't scale well e.g. protocol optimizations (limited queue sizes)‏ Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping

Design Goals RAS (reliability, availability, serviceability)‏ Modularity (QBBs, we'll get to them in a moment)‏ Hardware partitions (failure containment)‏ Efficiency Tight integration with CPUs (Alpha 21264)‏ CPU support for coherence/consistency operations Directory Protocol avoids NACKs and stalls

Architecture Between 4 and 32 Alpha CPUs Arranged in Quad-processor Building Blocks 7M+ ASIC Gates 4 CPUs 32GB Memory 8 PCI Slots 10-Port Switch

Architecture 10-Port Local Switch (per QBB)‏ 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global 2 QBBs can be connected directly, up to 8 with a global switch Hardware Coherence Support DIRectory: 14 bits/64-byte cache line store owner/sharer info Duplicate TAG Store copies CPUs' L2 cache tags Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries)‏ All implemented in ASICs, some supported by 21264

Architecture

Coherence Protocol 4 Types of Requests Read (not writing, don't need an exclusive copy)‏ Read-Exclusive (don't have it, want to write to it)‏ Exclusive (have a shared copy, want to write to it)‏ Exclusive-Without-Data Used when you want to write an entire cache line (64B)‏ Don't need to transfer the old data in this case

Coherence Protocol Satisfies all requests w/o NACKs or retries Blocks at the host Saves bandwidth Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester State machines at nodes can be simple,fast,small Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol

Coherence Protocol Deadlock is prevented by using 3 virtual lanes Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 Split responses into 2 parts (↓Latency,↑Perf.)‏ Commit (yeah, I heard ya)‏ Data (except exclusive-without-data requests)‏

Coherence Protocol Instead of building their protocol to handle the general case, they optimize it for a specific case e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements they can delay certain responses because they can bound the latency by a reasonable time

Coherence Protocol

(u,v)=(1,0) should be disallowed (would violate sequential consistency)‏

Performance

33.5 Gflops/sec on Linpack workload Supports 2720 users on SAP Benchmark Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun (64-CPU) systems NUMA+Lightweight Protocol->↓Mem. Latency Much better for applications where this matters

Analysis/Questions