EE 155 / COMP 122: Parallel Computing

Slides:

Advertisements

Similar presentations

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Advertisements

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Cache Optimization Summary

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Lecture 13: Multiprocessors Kai Bu

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Lecture 13: Multiprocessors Kai Bu

Multiprocessors – Locks

Lecture 8: Snooping and Directory Protocols

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Software Coherence Management on Non-Coherent-Cache Multicores

תרגול מס' 5: MESI Protocol

Computer Engineering 2nd Semester

How will execution time grow with SIZE?

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Cache Memory Presentation I

EE 193: Parallel Computing

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 9: Directory-Based Examples II

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

/ Computer Architecture and Design

Lecture 9: Directory-Based Examples

Lecture 10: Consistency Models

CS 3410, Spring 2014 Computer Science Cornell University

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

Instructor: Joel Grodstein

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Why we have Counterintuitive Memory Models

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

EE 193: Parallel Computing

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 10: Directory-Based Examples II

10/18: Lecture Topics Using spatial locality

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

EE 155 / COMP 122: Parallel Computing Spring 2019 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 3: cache coherence

Goals Primary goals: Learn how caches work on most multi-core processors Learn what cache coherence is, and why it can make writes slower than reads

Cache Coherence in Multicores Two cores on one die each have their own L1, but share some higher-level cache (e.g., L2). The issue is a cache coherence problem. It only occurs when two threads share the same memory space. Read X Read X Core 1 Core 2 Write 1 to X L1 I$ L1 D$ L1 I$ L1 D$ X = 0 X = 1 X = 0 For lecture Assumes a write through L1 cache X = 0 Unified (shared) L2 EE 155 / COMP 122 Joel Grodstein

A Coherent Memory System The problem: when processors share slow memory, data must move to local caches in order to get reasonable latency and bandwidth. But this can produce “unreasonable” results The solution is to have coherence and consistency Coherence – defines what values can be returned by a read every private cached copy of a memory location has the same value guarantees that caches are invisible to the software, since all cores read the same value for any memory location The issue on the last slide was a coherence violation. We’ll focus on coherence in this course EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Consistency Consistency defines the ordering of two accesses by the same thread to different memory locations So, core #0 writes mem[100]=0 and then mem[200]=1. Will all cores see mem[100]=0 before they see mem[200]=1? Surprisingly, the answer is not always "yes." Different models of memory consistency are common (SPARC and x86 support Total Store Ordering, which answers “no” at times) However, whichever write happens first, all cores will see the same ordering Much more info: “A Primer on Memory Consistency and Cache Coherency” online at Tisch. Remember the example: I change the date of the quiz on the calendar, and then send out a notice. What if you get the notice before the calendar is written? EE 155 / COMP 122 Joel Grodstein

Cache coherence strategies Coherence is usually implemented via write invalidate As long as processors are only trying to read a cache line, as many people can all cache the line as desired If a processor P1 wants to write a cache line, then P1 must be the only one caching the line. If a different processor P0 then wants to read the line, P0 must first write its modified cache line back to memory and is no longer allowed to write the line. This is the key to everything: the rest is all implementation details For repeated writes by one P to the same address, write broadcast requires a full broadcast of each write, and negates much of the advantage of having a writeback cache in the first place. EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 L1 cache L1 cache L1 cache L1 cache X = 0 X = I X = 0 Unified (shared) L2 Start with X=0 in the shared L2. EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = I X = 0 Unified (shared) L2 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it Core #0 reads X EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = I X = 0 X = 0 Unified (shared) L2 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it Next, core #2 reads X EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein And the fun begins… Core #0 Core #1 Core #2 Core #3 Write X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = 1 X = I X = 0 X = 0 Unified (shared) L2 Core #3 writes X=1 Only one core at a time can own a line & write it Messages sent to cores #0, #2: get rid of the line! Core #3 pulls the line in and writes X=1 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Still more fun Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 1 X = 1 X = 0 X = 1 Unified (shared) L2 Core #2 reads X again Maybe it wasn’t done with X yet Only one core at a time can own a line & write it Message sent to cores #3: you cannot keep a dirty copy! Core #3 updates the shared L2 (slow) Core #2 reads X For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Cost of coherence Somebody has to keep track of which cache lines are owned by which cores harder and harder the more cores there are Lots of messages to be sent can swamp the on-core interconnect fabric What operations are reasonably fast? Lots of cores reading and nobody writing What can get really ugly and slow? A variable that keeps getting written by different cores Lots of messages, lots of writebacks to the shared cache EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Summary Cache coherence makes the existence of caching invisible to the software ensures that every thread has the same opinion about the value of any address Coherence requires lots of communication between different cores and caches Coherence is: fast when lines are only read but not written slow when lines are shared and also written, especially if they are written by multiple threads EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Hot spots and π Let’s use some of what we’ve learned about caches Remember the π program? 𝜋=4 1− 1 3 + 1 5 − 1 7 +…+ −1 𝑛 1 2𝑛+1 int owner=0; void th_func (int me, int stride) { for (int i=me; i<N_TERMS; i += stride) { bool pos = ((i&1) == 0); while (owner != me) ; sum += (pos? term : -term); owner = (owner+1) % N_THREADS; } EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein void th_func (int me, int stride) { for (int i=me; i<N_TERMS; i += stride) { bool pos = ((i&1) == 0); while (owner != me) ; sum += (pos? term : -term); owner = (owner+1) % N_THREADS; } What might be a hot spot(s)? Why? The variable owner is accessed by every thread near continuously, and is written by each thread in turn. Every time it’s written, the writer must invalidate every other copy. Then the other threads must re-fetch the line… all very tedious. The more threads we have, the more invalidates we need Now we know why this was so slow! Coming up next week Look at our histogram from this point of view Don’t worry about getting all of these explanations into Lab #1 report EE 155 / COMP 122 Joel Grodstein