CSCI206 - Computer Organization & Programming

CSCI206 - Computer Organization & Programming
Multilevel Caches Cache Performance zyBook: 12.4

Conflicting Cache Requirements
Want low miss rate large size (low capacity miss) large blocks (low compulsory miss) high associativity (low conflict miss) We want the hit time to be fast small size (indexing memory takes time) lower associativity (reduce tag searching)

Multilevel caches are a compromise
Modern CPUs typically use a small fast L1 cache a few KB not more than 4-way associative Larger L1 has diminishing rewards as hit time increases But we can reduce the miss penalty with an L2 cache L1 caches the CPU, L2 caches the L1 Larger size than L1 (hit time can be larger than L1) Increased associativity (8-16 way) slower than L1 but still much faster than main memory

CPU L1 L2 Memory Speed Size

Figure from "Computer Architecture, Fifth Edition: A Quantitative Approach" by John Hennessy and David Patterson (Morgan Kaufmann)

Review AMAT Average Memory Access Time Caching helps reduce this!

Multilevel AMAT L1 miss penalty is the AMAT for L2 access

Matrix Multiply C A B Matrix multiply is one of the most important scientific calculations 32 x 32 matrix of double precision floats = 8 K /* For each row i of A */ for (int i = 0; i < n; ++i) { /* For each column j of B */ for (int j = 0; j < n; ++j) { /* Compute C(i,j) */ double cij = C[i+j*n]; for( int k = 0; k < n; k++ ){ cij += A[i+k*n] * B[k+j*n]; } C[i+j*n] = cij;

Naïve: C = A X B (see the video)

Blocking Any algorithm can be “blocked” by modifying it to operate on smaller blocks (subsets) of data break n by n into a set of k by k matrices where k by k fits into the L1 cache greatly reduces cache misses

blocking

Matrix Multiply Speed

Instruction Caching Fetching instructions requires a memory access
These addresses are typically in the text (or code) segment Data usually comes from the data / heap / stack segments We will cache instructions but this has the possibility to cause conflicts

Unified vs Split Caches
Unified cache - a single cache that holds instructions and data Split caches - a separate cache for instruction fetch and data memory Typically both caches are the same size, e.g., 16 KB instruction / 16 KB data cache

Unified vs Split Caches
On modern processors Split caches - Used for the L1 cache Unified cache - Used for L2 / L3 …

Multicore In a multicore system, main memory is shared between the cores Each CPU has its own private L1 cache L2 / L3 may be shared or individual Private caches create a coherence problem

Coherence Example CPU A reads data (X), cached into private L1
CPU B reads data (X), cached into private L1 CPU A changes data (X), updated in private L1 CPU B has the old value cached

Coherence Protocol Extra work to keep private L1 caches coherent
Multiple CPUs can cache the same block (for reads) When a CPU writes to memory, that block must be invalidated from all other CPUs When a CPU has a read miss, we need to check all other L1 caches for that block (it may be dirty). If found, write back to memory to ensure the latest value is used.

CSCI206 - Computer Organization & Programming

Similar presentations

Presentation on theme: "CSCI206 - Computer Organization & Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI206 - Computer Organization & Programming

Similar presentations

Presentation on theme: "CSCI206 - Computer Organization & Programming"— Presentation transcript:

Similar presentations

About project

Feedback