1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS252/Patterson Lec /2/01 CS252 Graduate Computer Architecture Lecture 13: Multiprocessor 3: Measurements, Crosscutting Issues, Examples, Fallacies.
EEL 5708 Multiprocessor 2 Larger multiprocessors Lotzi Bölöni Fall 2003.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /2/01 CS 213 Lecture 6: Multiprocessor 3: Measurements, Crosscutting Issues, Examples, Fallacies & Pitfalls.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
1 Lecture 22: Multiprocessor Parallel programming, cache coherency, memory consistency, bus snooping, directory-based CC, SMP using ILP processors Adapted.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
CPE 631 Session 21: Multiprocessors (Part 2)
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 19: Coherence Protocols
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB

2 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss:  Write-through: memory is always up-to-date  Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write serialization: bus serializes requests! Bus is single point of arbitration Good for a small number of processors; how about 16 or more?

3 Larger MPs Separate Memory per Processor Local or Remote access via memory controller 1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have a copies of block, dirty vs. clean,... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

4 Distributed Directory MPs Interconnection Network Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory DirI/O Proc +Cache Memory DirI/O Proc +Cache Memory DirI/O

5 Directory Protocol Similar to Snoopy Protocol: Three states Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor hasit; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple(r): Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent See textbook for directory state machine

6 Directory Protocol No bus and don’t want to broadcast: interconnect no longer single arbitration point all messages have explicit responses Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared Example messages on next slide: P = processor number, A = address

7 Directory Protocol Messages Message typeSourceDestinationMsg Content Read miss Local cacheHome directoryP, A Processor P reads data at address A; make P a read sharer and arrange to send data back Write miss Local cache Home directory P, A Processor P writes data at address A; make P the exclusive owner and arrange to send data back Invalidate Home directory Remote cachesA Invalidate a shared copy at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory Fetch/Invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache Data value reply Home directory Local cache Data Return a data value from the home memory (read miss response) Data write-back Remote cache Home directory A, Data Write-back a data value for address A (invalidate response)

8 Parallel App: Commercial Workload Online transaction processing workload (OLTP) (like TPC-B or -C) Decision support system (DSS) (like TPC-D) Web index search (Altavista)

9 Alpha 4100 SMP 4 CPUs 300 MHz Apha 300 MHz L1$ 8KB direct mapped, write through L2$ 96KB, 3-way set associative L3$ 2MB (off chip), direct mapped Memory latency 80 clock cycles Cache to cache 125 clock cycles

10 OLTP Performance as vary L3$ size

11 L3 Miss Breakdown

12 Memory CPI as increase CPUs L3: 2MB 2-way

13 OLTP Performance as vary L3$ size

14 SGI Origin 2000 A pure NUMA 2 CPUs per node, Scales up to 2048 processors Design for scientific computation vs. commercial processing Scalable bandwidth is crucial to Origin

15 Parallel App: Scientific/Technical FFT Kernel: 1D complex number FFT 2 matrix transpose phases => all-to-all communication Sequential time for n data points: O(n log n) Example is 1 million point data set LU Kernel: dense matrix factorization Blocking helps cache miss rate, 16x16 Sequential time for nxn matrix: O(n 3 ) Example is 512 x 512 matrix

16 FFT Kernel

17 LU kernel

18 Parallel App: Scientific/Technical Barnes App: Barnes-Hut n-body algorithm solving a problem in galaxy evolution n-body algs rely on forces drop off with distance; if far enough away, can ignore (e.g., gravity is 1/d 2 ) Sequential time for n data points: O(n log n) Example is 16,384 bodies Ocean App: Gauss-Seidel multigrid technique to solve a set of elliptical partial differential eq.s’ red-black Gauss-Seidel colors points in grid to consistently update points based on previous values of adjacent neighbors Multigrid solve finite diff. eq. by iteration using hierarch. Grid Communication when boundary accessed by adjacent subgrid Sequential time for nxn grid: O(n 2 ) Input: 130 x 130 grid points, 5 iterations

19 Barnes App Barnes Processor count Average cycles per memory reference 3-hop miss to remote cache Remote miss to home Miss to local memory Hit

20 Ocean App

21 Example: Sun Wildfire Prototype 1. Connect 2-4 SMPs via optional NUMA technology 1. Use “off-the-self” SMPs as building block 2. For example, E6000 up to 15 processor or I/O boards (2 CPUs/board) 1. Gigaplane bus interconnect, 3.2 Gbytes/sec 3. Wildfire Interface board (WFI) replace a CPU board => up to 112 processors (4 x 28), 1. WFI board supports one coherent address space across 4 SMPs 2. Each WFI has 3 ports connect to up to 3 additional nodes, each with a dual directional 800 MB/sec connection 3. Has a directory cache in WFI interface: local or clean OK, otherwise sent to home node 4. Multiple bus transactions

22 Example: Sun Wildfire Prototype 1. To reduce contention for page, has Coherent Memory Replication (CMR) 2. Page-level mechanisms for migrating and replicating pages in memory, coherence is still maintained at the cache-block level 3. Page counters record misses to remote pages and to migrate/replicate pages with high count 4. Migrate when a page is primarily used by a node 5. Replicate when multiple nodes share a page

23 Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization Study textbook for details

24 Fallacy: Amdahl’s Law doesn’t apply to parallel computers Since some part linear, can’t go 100X? 1987 claim to break it, since 1000X speedup Instead of using fixed data set, scale data set with # of processors! Linear speedup with 1000 processors

25 Multiprocessor Future What have been proved for: multiprogrammed workloads, commercial workloads e.g. OLTP and DSS, scientific applications in some domains Supercomputing 2004: High-performance computing is growing?! Cluster systems are unexpectedly powerful and inexpensive Optical networking is being deployed Grid software is under intensive research Claims: Students should learn parallel program from high school, and Undergraduates should be required to learn! Multiprocessor advances CMP or Chip-level multiprocessing, e.g. IBM Power5 (with SMT) MPs no longer dominate TOP 500, but stay as the building blocks for clusters