CMSC 611: Advanced Computer Architecture

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

Cache Coherence Mechanisms (Research project) CSCI-5593

CMSC 611: Advanced Computer Architecture

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Snooping Cache and Shared-Memory Multiprocessors

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

CS 704 Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

תרגול מס' 5: MESI Protocol

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Cache Coherence for Shared Memory Multiprocessors

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Flynn’s Taxonomy Flynn classified by data and control streams in 1966

The University of Adelaide, School of Computer Science

Lecture 9: Directory-Based Examples II

Cache Coherence Protocols:

Cache Coherence Protocols:

Advanced Computer Architectures

Lecture 2: Snooping-Based Coherence

Chapter 6 Multiprocessors and Thread-Level Parallelism

Chapter 5 Multiprocessor and Thread-Level Parallelism

Multiprocessors - Flynn’s taxonomy (1966)

CPE 631 Session 20: Multiprocessors

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 25: Multiprocessors

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis

Shared Memory Processors communicate with shared address space Multi-core processors Advantages: Model of choice for uniprocessors, small-scale multiprocessor Ease of programming Lower latency Easier to use hardware controlled caching Difficult to handle node failure

Centralized Shared Memory Processor Processor Processor Processor One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Zero or more levels of cache I/O Memory Processors share a single centralized memory Feasible for small processor count to limit memory contention Model for multi-core CPUs

Cache Coherency A cache coherence problem arises when the cache reflects a view of memory which is different from reality A memory system is coherent if: P reads X, P writes X, no other processor writes X, P reads X Always returns value written by P P reads X, Q writes X, P reads X Returns value written by Q (provided sufficient W/R separation) P writes X, Q writes X Seen in the same order by all processors Processor b) Writes green, red stale c) Update memory (green), red stale in cache

Snooping Cache Coherency AKA Snoopy Bus Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market)

Basic Snooping Protocols Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Cache invalidation will force a cache miss when accessing the modified shared item For multiple writers only one will win the race ensuring serialization of the write operations Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy

Basic Snooping Protocols Write Broadcast (Update) Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies To limit impact on bandwidth, track data sharing to avoid unnecessary broadcast of written data that is not shared Read miss: memory is always up-to-date Write serialization: bus serializes requests!

Invalidate vs. Update Write-invalidate has emerged as the winner for the vast majority of designs Qualitative Performance Differences : Spatial locality WI: 1 transaction/cache block; WU: 1 broadcast/word Latency WU: lower write–read latency WI: must reload new value to cache

Invalidate vs. Update Because the bus and memory bandwidth is usually in demand, write-invalidate protocols are very popular Write-update can causes problems for some memory consistency models, reducing the potential performance gain it could bring The high demand for bandwidth in write-update limits its scalability for large number of processors

An Example Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state: Shared: block can be read OR Exclusive: cache has only copy, it is writeable, and dirty OR Invalid: block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

Snoopy-Cache Controller Complications Cannot update cache until bus is obtained Two step process: Arbitrate for bus Place miss on bus and complete operation Split transaction bus: Bus transaction is not atomic Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block

Example Assumes memory blocks A1 and A2 map to same cache block, initial cache state is invalid

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block A1 A1 Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event.

Modern Variations MESI(F) Modified (renamed vs. 3-state protocol) 1 core can be modified, rest must be Invalid Data has been changed, will need to write back Exclusive (new) Single read only core Can change to Modified without asking Forward data, and change to Shared on request Shared Read only: many cores can be shared, read only Invalid: no longer valid Forward (new) Most recent shared core, designated to forward data Forwarding saves slower memory access

Examples ARM Cortex-A57 Intel Nehalem

ARM Cortex-A57

ARM Cortex-A57

ARM Cortex-A57

ARM Cortex-A57

Intel Nehalem Appaloosa, “Intel Nehalem Microarchitecture”, Wikimedia project, November 2008

Intel Nehalem

Intel Nehalem

Intel Nehalem

Cache Performance Benchmark Same core Write a bunch of memory from CPU0 Read from CPU0 Other core, same chip Read from CPU1-3 Other core, other socket Read from CPU4-7

Cache Performance Michael Thomadakis: The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms