1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections 4.1-4.2) Reminder: assignment 5 due Thursday (don’t procrastinate!)

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Lecture 18: Multiprocessors
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
CS 147 – Parallel Processing
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture: Memory, Multiprocessors
Lecture 26: Multiprocessors
Lecture 12: Cache Innovations
Lecture 26: Multiprocessors
Shared Memory Multiprocessors
Lecture 2: Parallel Programs
Chapter 17 Parallel Processing
Lecture: Cache Innovations, Virtual Memory
Lecture: Coherence Protocols
Lecture 27: Pot-Pourri Today’s topics:
Lecture 24: Memory, VM, Multiproc
Lecture: Coherence Protocols
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture: Cache Hierarchies
Lecture 26: Multiprocessors
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 3: Coherence Protocols
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Presentation transcript:

1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)

Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Core 4 L1 D$ L1 I$ L2 $ Core 5 L1 D$ L1 I$ L2 $ Core 6 L1 D$ L1 I$ L2 $ Core 7 L1 D$ L1 I$ L2 $ Memory Controller for off-chip access A single tile composed of a core, L1 caches, and a bank (slice) of the shared L2 cache The cache controller forwards address requests to the appropriate L2 bank and handles coherence operations Distributed Shared Cache

3 The L2 (or L3) can be a large shared cache, but is physically partitioned into banks and distributed on chip Each core (tile) has one L2 cache bank adjacent to it One bank stores a subset of “sets” and all ways for that set OS-based first-touch page coloring can force a thread’s pages to have physical page numbers that map to the thread’s local L2 bank Physical Address | color | Physical page # Cache Index

4 UCA and NUCA The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture The distributed shared cache is an example of a NUCA cache: variable latency to each bank

5 NUCA Design Space Distribute Sets: Static-NUCA: Each block has a unique location; easy to find data; page coloring for locality; page migration if initial mapping is sub-optimal Distribute Ways: Dynamic-NUCA: More flexibility in block placement; complicated search mechanisms; blocks migrate to be closer to their accessor Private data are easy to handle; Shared data must be placed at the center-of-gravity of accesses

6 Prefetching Hardware prefetching can be employed for any of the cache levels It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy”  wasted memory bandwidth Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)

7 Stream Buffers Simplest form of prefetch: on every miss, bring in multiple cache lines When you read the top of the queue, bring in the next line L1 Stream buffer Sequential lines

8 Stride-Based Prefetching For each load, keep track of the last address accessed by the load and a possibly consistent stride FSM detects consistent stride and issues prefetches init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) tagprev_addrstridestate PC

9 Taxonomy SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility

10 Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization  uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

11 SMPs or Centralized Shared-Memory Processor Caches Processor Caches Processor Caches Processor Caches Main Memory I/O System

12 Memory Organization - II For higher scalability, memory is distributed among processors  distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared  distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data  cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

13 Distributed Memory Multiprocessors Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Interconnection network

14 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence  simpler hardware Explicit communication  easier for the programmer to restructure code Sender can initiate data transfer

15 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

16 Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

17 Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid != 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid != nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile

18 Title Bullet