Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M1 076602. グェントアンドゥク.

Slides:

Advertisements

Similar presentations

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Advertisements

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Distributed Operating Systems CS551 Colorado State University at Lockheed-Martin Lecture 4 -- Spring 2001.

Memory consistency models Presented by: Gabriel Tanase.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

PRASHANTHI NARAYAN NETTEM.

Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.

Cache Organization of Pentium

Distributed File Systems Sarah Diesburg Operating Systems CS 3430.

A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

Distributed Shared Memory Systems and Programming

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Cache Organization of Pentium

Distributed File Systems

תרגול מס' 5: MESI Protocol

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

CSI 400/500 Operating Systems Spring 2009

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Outline Midterm results summary Distributed file systems – continued

Lecture: Cache Innovations, Virtual Memory

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

High Performance Computing

Lecture 8: Directory-Based Examples

The University of Adelaide, School of Computer Science

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Distributed Resource Management: Distributed Shared Memory

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Presentation transcript:

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク

Agenda Basic design of Shasta Protocol optimizations Performance result

Basic design of Shasta

Cache coherent protocol 3 states: invalid, shared, exclusive Shared miss: (invalid) read, (invalid/shared) write Block size can be different for different ranges of shared address space Address space is divided into fixed-size ranges, call lines –Block = n * lines Maintain state information for each line in state table

Basic shared miss check

Shared miss check optimizations Invalid flag technique: –Set invalid line’s long word (4 byte) value = Special flag value –Compare word value with flag value -> miss or not Batching miss checks –Batch together checks for multiple loads / stores

Protocol optimizations Minimizing protocol messages –Owner node guarantees to service request forwarded to it –No need retry request due to transient states or deadlock: save request into queue Multiple coherence granularity –Block size based on data structure Small object: single unit Large object: divide into lines –Associate different granularities to different virtual pages

Protocol optimizations (2) Exploiting relaxed memory model –Non-blocking load / store –Non-blocking release –Eager exclusive replies Read exclusive: sending data back immediately to the requested processor, delay request from other processors Batching Detecting migratory shared patterns –Migratory sharing: data is read and modified by different processors -> migration from one processor to another

Performance

Effect of Release Consistency Non blocking release

Effect of upgrade & sharing writeback Support for upgrade messages is important for some application (VolRend) Sharing writeback messages hurt performance

Effect of migratory optimization Disappointing!

Summary of results Support for variable granularity communication is the most important optimization in Shasta Support for upgrade messages and a dirty- sharing protocol are also important Exploiting RC provides small performance gains because processors are busing handling protocol messages while they are waiting for their own request to complete Migratory optimization is not useful in Shasta

Conclusion Shasta supports fine-grain access to shared memory by inserting code before load / store instructions to check state of the shared data Shasta supports shared memory entirely in software -> flexibility in granularity & optimizations Variable granularity is the most important optimization