The University of Adelaide, School of Computer Science

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
The University of Adelaide, School of Computer Science
Cache Optimization Summary
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
Course 5MD00 Henk Corporaal January 2015
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snooping Cache and Shared-Memory Multiprocessors
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
22/12/2005 Distributed Shared-Memory Architectures by Seda Demirağ Distrubuted Shared-Memory Architectures by Seda Demirağ.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
COMP 740: Computer Architecture and Implementation
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Chip-Multiprocessor.
CPE 631 Session 21: Multiprocessors (Part 2)
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
11 – Snooping Cache and Directory Based Multiprocessors
Multiprocessor and Thread-Level Parallelism-II Chapter 4
Lecture 25: Multiprocessors
High Performance Computing
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

The University of Adelaide, School of Computer Science Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 16 April 2017 Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Introduction Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model Targeted for tightly-coupled shared-memory multiprocessors For n processors, need at least n threads Amount of computation assigned to each thread = grain size Threads can be used for data-level parallelism, but the overhead may outweigh the benefit Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Types Introduction Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory access/latency (UMA) Distributed shared memory (DSM) Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Cache Coherence Processors may see different values through their caches: Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Cache Coherence Coherence All reads by any processor must return the most recently written value Writes to the same location by any two processors are seen in the same order by all processors Protocols Distributed Each core tracks sharing status of each block Centralized Sharing status of each block kept in one location Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols Where is the most up-to-date version of the data? Depends on cache organization: Write buffers? May be in a write buffer Write-through or write-back caches? Write-through: in memory or in a write buffer Write-back: could be in any cache or write buffer How to efficiently find this version of the data? Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Write invalidate On write, invalidate all other copies Use bus itself to serialize Write cannot complete until bus access is obtained Write update On write, update all copies Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Caches “snoop” for updates Can use the valid bit already associated with each cache block Can also use dirty bit, assuming write-back caches MSI protocol (write invalidate) Every cache block is in one of three states Modified – dirty (and exclusive) – read/write access Shared – valid and clean – read-only access Invalid – not valid – empty Write hits to modified blocks do not require invalidations Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Snoopy Coherence Protocols The University of Adelaide, School of Computer Science 16 April 2017 Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Snoopy Coherence Protocols Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block, initial cache state is invalid Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Why write miss first? Because in general, only write a piece of block, may need to read it first so that can have a full vblock; therefore, need to get Write back is low priority event. Assumes A1 and A2 map to same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Coherence Protocols: Extensions The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols: Extensions Complications for the basic MSI protocol: Operations are not atomic E.g. detect miss, acquire bus, send invalidation Creates possibility of deadlock and races Protocols need to enforce atomic transactions Extensions: Add exclusive state to indicate clean block in only one cache (MESI protocol) Avoid needing to send invalidation when writing to an exclusive block Add owned state to indicate the newest copy of data (MOESI protocol) Like a modified block, but can coexist with shared blocks Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Coherence Protocols: Extensions The University of Adelaide, School of Computer Science 16 April 2017 Coherence Protocols: Extensions Shared memory bus and snooping bandwidth is bottleneck for scaling symmetric multiprocessors Duplicating tags Place directory in outermost cache Use crossbars or point-to-point networks with banked memory Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Performance A new type of cache miss: coherence (4th C) Two types True sharing misses Write to a word of a shared block Another processor reads the word from the invalidated block (read miss) False sharing misses Another processor reads an unmodified word in an invalidated block Can’t exist if block size is one word Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Performance Study: Commercial Workload The University of Adelaide, School of Computer Science 16 April 2017 Performance Study: Commercial Workload Performance of Symmetric Shared-Memory Multiprocessors Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Copyright © 2012, Elsevier Inc. All rights reserved. Directory Protocols Can reduce snoopy-protocol traffic by adding a directory to L3 Add a bit string of length n for n cores to each block Keeps track of which core has the entry in a private cache Allows point-to-point communication between relevant cores Can not scale this approach to multiple processors because each processor has a distinct L3 cache Must instead add a directory to main memory Copyright © 2012, Elsevier Inc. All rights reserved.

The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols Directory is implemented in main memory Tracks all blocks in main memory Which nodes (processors) have each block Sharing state of each block Implement in a distributed fashion: Statically divide shared memory between processors to afford easy block location Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For each block, possible states: Uncached No nodes have the block cached Shared One or more nodes have the block cached, value in memory is up-to-date Set of node IDs Modified Exactly one node has a copy of the cache block, value in memory is out-of-date Owner node ID Caches also still maintain state as in snoopy protocol Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Copyright © 2012, Elsevier Inc. All rights reserved. Directory Protocols Directory maintains block states and sends state messages Each message is either sent or received by the Local node – node which originated the request Either from or to one of the following nodes Home node – node which controls the requested address Remote node – any node which needs to know about a change in state Copyright © 2012, Elsevier Inc. All rights reserved.

The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For uncached block: Read miss Requesting node is sent the requested data and is made the only sharing node, block is now shared Write miss The requesting node is sent the requested data and becomes the sharing node, block is now exclusive For shared block: The requesting node is sent the requested data from memory, node is added to sharing set The requesting node is sent the value, all nodes in the sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Directory Protocols For exclusive block: Read miss The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor Write miss Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Messages Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

State Transitions (cache) The University of Adelaide, School of Computer Science 16 April 2017 State Transitions (cache) Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

State Transitions (directory) The University of Adelaide, School of Computer Science 16 April 2017 State Transitions (directory) Distributed Shared Memory and Directory-Based Coherence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Example Configuration Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. Example Processor 1 Processor 2 Interconnect Directory Memory A1 and A2 map to the same cache block Copyright © 2012, Elsevier Inc. All rights reserved. CS258 S99

Copyright © 2012, Elsevier Inc. All rights reserved. AMD Opteron 8439 Transistors: 904 M Power consumption: 137 W Max cores/chip: 6 Max threads/core: 1 Max nodes: 8 Instruction issue/clock: 3 Clock rate: 2.8 GHz Multicore coherence protocol: MOESI (snooping) Multichip coherence protocol: Snoopy or directory Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. IBM Power 7 Transistors: 1200 M Power consumption: 140 W Max cores/chip: 8 Max threads/core: 6 Max nodes: 32 Instruction issue/clock: 6 Clock rate: 4.1 GHz Multicore coherence protocol: MESI (snooping/directory) Multichip coherence protocol: Directory Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Intel Xeon 7560 Transistors: 2300 M Power consumption: 130 W Max cores/chip: 8 Max threads/core: 2 Max nodes: 8 Instruction issue/clock: 4 Clock rate: 2.7 GHz Multicore coherence protocol: MESIF (snooping/directory) Multichip coherence protocol: Directory Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Sun UltraSPARC T2 Transistors: 500 M Power consumption: 95 W Max cores/chip: 8 Max threads/core: 8 Max nodes: 4 Instruction issue/clock: 2 Clock rate: 1.6 GHz Multicore coherence protocol: MOESI (snooping/directory) Multichip coherence protocol: Snooping Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Comparison Copyright © 2012, Elsevier Inc. All rights reserved.

The University of Adelaide, School of Computer Science 16 April 2017 Synchronization Synchronization Need to control access to resources between threads Basic building blocks: Need to be able to read from and write to memory in an uninterruptable instruction Atomic exchange Swaps register with memory location Test-and-set Sets under condition Fetch-and-increment Reads original value from memory and increments it in memory load linked/store conditional If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Implementing Locks Synchronization Spin lock If no coherence: DADDUI R2,R0,#1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? If coherence: lockit: LD R2,0(R1) ;load of lock BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1) ;swap BNEZ R2,lockit ;branch if lock wasn’t 0 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 16 April 2017 Implementing Locks Synchronization Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Copyright © 2012, Elsevier Inc. All rights reserved. Fallacies Amdahl’s law doesn’t apply to parallel computers Linear speedups are needed to make multiprocessors cost effective Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2012, Elsevier Inc. All rights reserved. Pitfalls Measuring performance of multiprocessors by linear speedup versus execution time Not developing the software to take advantage of, or optimize for, a multiprocessor architecture Copyright © 2012, Elsevier Inc. All rights reserved.