Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.

Lecture 7. Multiprocessor and Memory Coherence

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Cache Coherence in Scalable Machines (IV) Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation.

The University of Adelaide, School of Computer Science

1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

Cache Coherence CS433 Spring 2001 Laxmikant Kale.

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

1 Lecture 7: Implementing Cache Coherence Topics: implementation details.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

COSC6385 Advanced Computer Architecture

Presented by: Nick Kirchem Feb 13, 2004

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

Copyright 2004 Daniel J. Sorin

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 9: Directory-Based Examples II

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Multi-core systems COMP25212 System Architecture

Cache Coherence Protocols 15th April, 2006

Shared Memory Consistency Models: A Tutorial

Lecture 5: Snooping Protocol Design Issues

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Lecture 25: Multiprocessors

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

The University of Adelaide, School of Computer Science

Cache coherence CEG 4131 Computer Architecture III

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 10: Directory-Based Examples II

Multiprocessors and Multi-computers

Presentation transcript:

Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg

Motivations Coherence Protocol –Bandwidth limitations of snoopy-based protocol –Inefficiencies in directory protocol –Correctness issues related to rare protocol races Implementation of Consistency models –Burdens the common transaction flow

Paper Contributions Exploiting network ordering to simplify cache coherence protocol Solutions to decrease network occupancy Elegant solution for deadlock, livelock, starvation, and fairness problems Techniques for efficiently supporting memory ordering

Overview Architecture Overview MOESI Cache Coherence Protocol GS320 optimized Cache Coherence Protocol Alpha Consistency Model Consistency Model Implementation Performance

Architecture Overview

Block Diagram 8x8 Global Crossbar Switch QBB QBB – Quad-Processor Building Block 1.6 GB/s

Quad-Processor Building Block (QBB) 10-Port Local Crossbar Switch 1.6 GB/s 3.2 GB/s PL2 P P P SDRAM Memory 8GB, 64-bit 200 MHz 64-entry Cache I/O PCI: 4 PCI Bus 64-bit, 33 MHz Global Port DTAG DIR TTT Arbitration Point 32 Alpha Duplicate Tag Store Transactions In Transit Buffer

The Directory Owner = 0S0S1S2S3S4S5S6S7 14-bit per 64 Byte Memory Line 6-bit Forward QBB0 DTAG QBB3 P0P1P2P3 Invalidate

Crossbar Switch Network bi-section Bandwidth: Global Switch (8x8): 12.8 GB/s Local Switch (10-port): 6.4 GB/s

MOESI Cache Coherence Protocol

MOESI - Directory States –Invalid (I) : –Shared (S) : Valid, (potentially) shared, clean –Exclusive (E) : Valid, exclusive, clean –Modified (M) : Valid, exclusive, (potentially) dirty –Owner (O) : Valid, (potentially) shared, clean Responsible for supplying Data instead of memory (potentially) Request Messages –Read (Rd) : Data needed in shard state (S/E) –Read Exclusive (RE) : Data needed in Modified State (M) –Exclusive (Ex) : Data needed in Modified State (M) Home node – Original owner of data (directory)

MOESI Read H/D N3/I N4/IN5/I N2/M N1/I N2/O Rd Forward Marker Reply N5/S

MOESI Read-Exclusive H/D N3/I N4/IN5/S N2/ON1/I RE Forward Marker Invalidate N5/I N2/I Ack Reply N3/E

GS320 Optimized Cache Coherence Protocol Dirty Sharing No negative acknowledgment –3 Deadlock Conditions due to races

Late Request Race Condition H/D N3/I N4/IN5/I N2/M N1/I Rd Forward Marker N2/X Write Back Ack Reply Write Buffer N5/S DEADLOCK ?

Early Request Race Condition H/D N3/M N4/IN5/I N2/IN1/I RE H/D Forward Rd Marker Forward H/D N3/I Reply N2/MN2/O Reply N5/S Marker DEADLOCK ?

Crossbar Network Q0 Queue: Request to Home Node – (point to point order) Q1 Queue: Forward, Replies and Invalidations from Home Node – (global order) Q2 Queue: Data Replies from Owner to Requester Node

Total Ordering on Q1!! P1 A (O) Cache Q1 Inbound Queue P2 B (O) Cache Q1 Inbound Queue HA Q1 Outbound Queue HB Q1 Outbound Queue Crossbar Switch A (P1)B (P2) RE1(B)RE2(A) A (P2)B (P1) P1 – RE2(A)P2 – RE1(B) RE4(A)RE3(B) A (X)B (Y) P1 – RE3(B)P2 – RE4(A) P1 – RE3(B)P2 – RE4(A) P1 – RE2(A)P2 – RE1(B) DEADLOCK ?

Desirable Characteristics Dirty sharing - efficient for migratory accesses All directory changes are instant. Needs just single access to home node and directory Eliminate livelock, starvation, and fairness problems Writes can start as soon as Exclusive request is issued

Alpha Consistency Model MB: Memory Barrier LOAD STORE LOAD STORE LOAD Oldest Memory Operation Program Order LOAD STORE LOAD STORE LOAD Atomicity is not violated: Read others write early

Consistency Model Implementation Barrier Performance (Commit Event) –Early acknowledge of Invalidates –Early acknowledge of Forwards of (Exclusive, Read Exclusive and Read Requests) Overall Performance –Relax total order condition on Q1 at commit points. Let replies (Q1->Q2) bypass forwards (Q1), and invalidations (Q1)

Early Acknowledgement of Invalidation Request P1 A = 0 B = 0 Cache Q1 Inbound Queue P2 A = 0 Cache Q1 Inbound Queue Crossbar Switch SC A = 1; B = 1; SC u = B; v = A; u? v? u = 1 v = 0 SC A = 1; B = 1; EX INVAL A A = 1 SC A = 1; B = 1; B = 1 SC u = B; v = A; u? v? u = 1 v = 0 Rd Marker P1 B = 1 Not a Race Condition B = 1 Commit Races MB 1.Optimize memory barrier at P1 for write to write/read ordering 2.Commit events in Q1 queue for ordering purposes in case of replies 3.Sufficient condition: Commit events not to bypass invalidates 4.Memory Barrier at P2 wait for all the commits before going ahead INV Commit Commit pt INV Ack

Commit Points 8x8 Global Crossbar Switch QBB DTAG DIR TTT Commit Point

Early Acknowledge of Forwards P1 A = 0 B = 0 Cache Q1 Inbound Queue P2 A = 0 Cache Q1 Inbound Queue Crossbar Switch A = 1; MB B = 1; u = B; MB v = A; u? v? u = 1 v = 0 Commit pt u = B; MB v = A; u? v? u = 1 v = 0 Read BCommit/RdB u = B; MB v = A; = 0 u? v? u = ? v = 0 Fwd Ack A = 1; MB B = 1; INVAL A Commit/INV A bypass Sufficient condition: Commit events not to bypass invalidates, reads and read-exclusive forwards A = 1 A = 1; MB B = 1;

Optimization Summary Dirty Sharing – Reduces home node traffic No negative acknowledgements –Reduces network traffic (Home Node) –Simple implementation of directory –Removes livelock, starvation, and fairness problems –Network total ordering avoid deadlocks –Write optimization Bypass of replies in Q1 queue –Improve overall performance Improves barrier performance –Early invalidation acknowledgements –Early Forward responses (Rd, RE, EX) –Memory ordering based on commit events

Performance

DOUBTS?