Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI206 - Computer Organization & Programming

Similar presentations


Presentation on theme: "CSCI206 - Computer Organization & Programming"— Presentation transcript:

1 CSCI206 - Computer Organization & Programming
Multilevel Caches Cache Performance zyBook: 12.4

2 Conflicting Cache Requirements
Want low miss rate large size (low capacity miss) large blocks (low compulsory miss) high associativity (low conflict miss) We want the hit time to be fast small size (indexing memory takes time) lower associativity (reduce tag searching)

3 Multilevel caches are a compromise
Modern CPUs typically use a small fast L1 cache a few KB not more than 4-way associative Larger L1 has diminishing rewards as hit time increases But we can reduce the miss penalty with an L2 cache L1 caches the CPU, L2 caches the L1 Larger size than L1 (hit time can be larger than L1) Increased associativity (8-16 way) slower than L1 but still much faster than main memory

4 CPU L1 L2 Memory Speed Size

5 Figure from "Computer Architecture, Fifth Edition: A Quantitative Approach" by John Hennessy and David Patterson (Morgan Kaufmann)

6 Review AMAT Average Memory Access Time Caching helps reduce this!

7 Multilevel AMAT L1 miss penalty is the AMAT for L2 access

8 Matrix Multiply C A B Matrix multiply is one of the most important scientific calculations 32 x 32 matrix of double precision floats = 8 K /* For each row i of A */ for (int i = 0; i < n; ++i) { /* For each column j of B */ for (int j = 0; j < n; ++j) { /* Compute C(i,j) */ double cij = C[i+j*n]; for( int k = 0; k < n; k++ ){ cij += A[i+k*n] * B[k+j*n]; } C[i+j*n] = cij;

9 Naïve: C = A X B (see the video)

10 Blocking Any algorithm can be “blocked” by modifying it to operate on smaller blocks (subsets) of data break n by n into a set of k by k matrices where k by k fits into the L1 cache greatly reduces cache misses

11 blocking

12 Matrix Multiply Speed

13 Instruction Caching Fetching instructions requires a memory access
These addresses are typically in the text (or code) segment Data usually comes from the data / heap / stack segments We will cache instructions but this has the possibility to cause conflicts

14 Unified vs Split Caches
Unified cache - a single cache that holds instructions and data Split caches - a separate cache for instruction fetch and data memory Typically both caches are the same size, e.g., 16 KB instruction / 16 KB data cache

15 Unified vs Split Caches
On modern processors Split caches - Used for the L1 cache Unified cache - Used for L2 / L3 …

16 Multicore In a multicore system, main memory is shared between the cores Each CPU has its own private L1 cache L2 / L3 may be shared or individual Private caches create a coherence problem

17 Coherence Example CPU A reads data (X), cached into private L1
CPU B reads data (X), cached into private L1 CPU A changes data (X), updated in private L1 CPU B has the old value cached

18 Coherence Protocol Extra work to keep private L1 caches coherent
Multiple CPUs can cache the same block (for reads) When a CPU writes to memory, that block must be invalidated from all other CPUs When a CPU has a read miss, we need to check all other L1 caches for that block (it may be dirty). If found, write back to memory to ensure the latest value is used.


Download ppt "CSCI206 - Computer Organization & Programming"

Similar presentations


Ads by Google