EE 155 / COMP 122: Parallel Computing

Slides:



Advertisements
Similar presentations
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Advertisements

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Cache Optimization Summary
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Lecture 13: Multiprocessors Kai Bu
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Multiprocessors – Locks
Lecture 8: Snooping and Directory Protocols
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Software Coherence Management on Non-Coherent-Cache Multicores
תרגול מס' 5: MESI Protocol
Computer Engineering 2nd Semester
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
Cache Memory Presentation I
EE 193: Parallel Computing
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
Lecture 2: Snooping-Based Coherence
CMSC 611: Advanced Computer Architecture
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
/ Computer Architecture and Design
Lecture 9: Directory-Based Examples
Lecture 10: Consistency Models
CS 3410, Spring 2014 Computer Science Cornell University
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Instructor: Joel Grodstein
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Why we have Counterintuitive Memory Models
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
EE 193: Parallel Computing
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
10/18: Lecture Topics Using spatial locality
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 11: Consistency Models
Presentation transcript:

EE 155 / COMP 122: Parallel Computing Spring 2019 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 3: cache coherence

Goals Primary goals: Learn how caches work on most multi-core processors Learn what cache coherence is, and why it can make writes slower than reads

Cache Coherence in Multicores Two cores on one die each have their own L1, but share some higher-level cache (e.g., L2). The issue is a cache coherence problem. It only occurs when two threads share the same memory space. Read X Read X Core 1 Core 2 Write 1 to X L1 I$ L1 D$ L1 I$ L1 D$ X = 0 X = 1 X = 0 For lecture Assumes a write through L1 cache X = 0 Unified (shared) L2 EE 155 / COMP 122 Joel Grodstein

A Coherent Memory System The problem: when processors share slow memory, data must move to local caches in order to get reasonable latency and bandwidth. But this can produce “unreasonable” results The solution is to have coherence and consistency Coherence – defines what values can be returned by a read every private cached copy of a memory location has the same value guarantees that caches are invisible to the software, since all cores read the same value for any memory location The issue on the last slide was a coherence violation. We’ll focus on coherence in this course EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Consistency Consistency defines the ordering of two accesses by the same thread to different memory locations So, core #0 writes mem[100]=0 and then mem[200]=1. Will all cores see mem[100]=0 before they see mem[200]=1? Surprisingly, the answer is not always "yes." Different models of memory consistency are common (SPARC and x86 support Total Store Ordering, which answers “no” at times) However, whichever write happens first, all cores will see the same ordering Much more info: “A Primer on Memory Consistency and Cache Coherency” online at Tisch. Remember the example: I change the date of the quiz on the calendar, and then send out a notice. What if you get the notice before the calendar is written? EE 155 / COMP 122 Joel Grodstein

Cache coherence strategies Coherence is usually implemented via write invalidate As long as processors are only trying to read a cache line, as many people can all cache the line as desired If a processor P1 wants to write a cache line, then P1 must be the only one caching the line. If a different processor P0 then wants to read the line, P0 must first write its modified cache line back to memory and is no longer allowed to write the line. This is the key to everything: the rest is all implementation details For repeated writes by one P to the same address, write broadcast requires a full broadcast of each write, and negates much of the advantage of having a writeback cache in the first place. EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 L1 cache L1 cache L1 cache L1 cache X = 0 X = I X = 0 Unified (shared) L2 Start with X=0 in the shared L2. EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = I X = 0 Unified (shared) L2 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it Core #0 reads X EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Coherence example Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = I X = 0 X = 0 Unified (shared) L2 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it Next, core #2 reads X EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein And the fun begins… Core #0 Core #1 Core #2 Core #3 Write X L1 cache L1 cache L1 cache L1 cache X = 0 X = 0 X = 1 X = I X = 0 X = 0 Unified (shared) L2 Core #3 writes X=1 Only one core at a time can own a line & write it Messages sent to cores #0, #2: get rid of the line! Core #3 pulls the line in and writes X=1 For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Still more fun Core #0 Core #1 Core #2 Core #3 Read X L1 cache L1 cache L1 cache L1 cache X = 1 X = 1 X = 0 X = 1 Unified (shared) L2 Core #2 reads X again Maybe it wasn’t done with X yet Only one core at a time can own a line & write it Message sent to cores #3: you cannot keep a dirty copy! Core #3 updates the shared L2 (slow) Core #2 reads X For lecture Snooping bus with write-back L1 cache. Requires an additional state – called owner – which indicates that a line may be shared, but the owning core is responsible for updating any other processors and memory when it changes the line or replaces it EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Cost of coherence Somebody has to keep track of which cache lines are owned by which cores harder and harder the more cores there are Lots of messages to be sent can swamp the on-core interconnect fabric What operations are reasonably fast? Lots of cores reading and nobody writing What can get really ugly and slow? A variable that keeps getting written by different cores Lots of messages, lots of writebacks to the shared cache EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Summary Cache coherence makes the existence of caching invisible to the software ensures that every thread has the same opinion about the value of any address Coherence requires lots of communication between different cores and caches Coherence is: fast when lines are only read but not written slow when lines are shared and also written, especially if they are written by multiple threads EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein Hot spots and π Let’s use some of what we’ve learned about caches Remember the π program? 𝜋=4 1− 1 3 + 1 5 − 1 7 +…+ −1 𝑛 1 2𝑛+1 int owner=0; void th_func (int me, int stride) { for (int i=me; i<N_TERMS; i += stride) { bool pos = ((i&1) == 0); while (owner != me) ; sum += (pos? term : -term); owner = (owner+1) % N_THREADS; } EE 155 / COMP 122 Joel Grodstein

EE 155 / COMP 122 Joel Grodstein void th_func (int me, int stride) { for (int i=me; i<N_TERMS; i += stride) { bool pos = ((i&1) == 0); while (owner != me) ; sum += (pos? term : -term); owner = (owner+1) % N_THREADS; } What might be a hot spot(s)? Why? The variable owner is accessed by every thread near continuously, and is written by each thread in turn. Every time it’s written, the writer must invalidate every other copy. Then the other threads must re-fetch the line… all very tedious. The more threads we have, the more invalidates we need Now we know why this was so slow! Coming up next week Look at our histogram from this point of view Don’t worry about getting all of these explanations into Lab #1 report EE 155 / COMP 122 Joel Grodstein