1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students  74-68: 8 students  63-54: 5 students Q1 (HM), Q2 (bpred), Q3 (stalls), Q7 (loops) mostly correct Q4 (ooo) – 50% correct; many didn’t stall renaming Q5 (multi-core) – less than a handful got 8+ points Q6 (memdep) – less than a handful got part (b) right and correctly articulated the effect on power/energy

2 Shared Vs. Private Caches in Multi-Core Advantages of a shared cache:  Space is dynamically allocated among cores  No wastage of space because of replication  Potentially faster cache coherence (and easier to locate data on a miss) Advantages of a private cache:  small L2  faster access time  private bus to L2  less contention

3 UCA and NUCA The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture

4 Large NUCA CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication

5 Static and Dynamic NUCA Static NUCA (S-NUCA)  The address index bits determine where the block is placed  Page coloring can help here as well to improve locality Dynamic NUCA (D-NUCA)  Blocks are allowed to move between banks  The block can be anywhere: need some search mechanism  Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!)  Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

6 Example Organization Latency 13-17cyc Latency 65 cyc Data must be placed close to the center-of-gravity of requests

7 Examples: Frequency of Accesses Dark  more accesses  OLTP (on-line transaction processing) Ocean  (scientific code)

Core 0 L1 D$ L1 I$ Core 1 L1 D$ L1 I$ Core 2 L1 D$ L1 I$ Core 3 L1 D$ L1 I$ Core 4 L1 D$ L1 I$ Core 5 L1 D$ L1 I$ Core 6 L1 D$ L1 I$ Core 7 L1 D$ L1 I$ Shared L2 Cache and Directory State L2 Cache Controller Scalable Non-broadcast Interconnect

Core 0 L1 D$ L1 I$ Core 1 L1 D$ L1 I$ Core 2 L1 D$ L1 I$ Core 3 L1 D$ L1 I$ Core 4 L1 D$ L1 I$ Core 5 L1 D$ L1 I$ Core 6 L1 D$ L1 I$ Core 7 L1 D$ L1 I$ Replicated Tags of all L2 and L1 Caches Controller that handles L2 misses Scalable Non-broadcast Interconnect L2$ Off-chip access

Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Core 4 L1 D$ L1 I$ L2 $ Core 5 L1 D$ L1 I$ L2 $ Core 6 L1 D$ L1 I$ L2 $ Core 7 L1 D$ L1 I$ L2 $ Memory Controller for off-chip access A single tile composed of a core, L1 caches, and a bank (slice) of the shared L2 cache The cache controller forwards address requests to the appropriate L2 bank and handles coherence operations

Bottom die with cores and L1 caches Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Memory controller for off-chip access Top die with L2 cache banks Each core has low-latency access to one L2 bank

12 The Best of S-NUCA and D-NUCA Employ S-NUCA (no search required) and use page coloring to influence the block’s cache index bits and hence the bank that the block gets placed in Page migration enables block movement just as in D-NUCA

13 Prefetching Hardware prefetching can be employed for any of the cache levels It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy”  wasted memory bandwidth Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)

14 Stream Buffers Simplest form of prefetch: on every miss, bring in multiple cache lines When you read the top of the queue, bring in the next line L1 Stream buffer Sequential lines

15 Stride-Based Prefetching For each load, keep track of the last address accessed by the load and a possibly consistent stride FSM detects consistent stride and issues prefetches init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) tagprev_addrstridestate PC

16 Compiler Optimizations Loop interchange: loops can be re-ordered to exploit spatial locality for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2 * x[i][j]; is converted to… for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2 * x[i][j];

17 Exercise for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yz yzx x Re-organize data accesses so that a piece of data is used a number of times before moving on… in other words, artificially create temporal locality

18 Exercise for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yzx yzx yzx yzxyzx

19 Exercise Original code could have 2N 3 + N 2 memory accesses, while the new version has 2N 3 /B + N 2 for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yz yzx x

20 Title Bullet

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Similar presentations

Presentation on theme: "1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Similar presentations

Presentation on theme: "1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students "— Presentation transcript:

Similar presentations

About project

Feedback