Memory – Caching: Performance

Memory – Caching: Performance
CS/COE 1541 (term 2174) Jarrett Billingsley

Class Announcements Quiz 2 next lecture!
At the end of a short lecture, like last time Cache visualizer! If you haven't had a look yet, it's on the site in the links column. Project notes and traces! Some notes for those having trouble getting started with C, linked below the project in the schedule. Also a couple memory traces for you to test, in the links column next to the project. Homework 3 on Wednesday! Aiming for 6 homeworks throughout the term... Will be due on weeks you don't have a quiz/exam/project due. 2/20/2017 CS/COE 1541 term 2174

Measuring Performance
2/20/2017 CS/COE 1541 term 2174

Get back to work Without caches, measuring memory performance is easy!
number of accesses × access time With caches... The access time is variable. What if you hit? What if you miss? What if you miss, but have to write a block back? What about the write buffer? What about multiple levels of caches? What about... AAAAAAAAAAAAAHHHHHH Realistically, we have to use simulation. But we can come up with some useful heuristics/estimates. 2/20/2017 CS/COE 1541 term 2174

The perfect cache... We often use the concept of an oracle: a prediction mechanism that works perfectly. An oracle cache never misses. This isn't real, of course... so why is this concept useful? It allows us to set bounds on performance. We simulated a benchmark using an oracle cache. The CPI was 3.3. Then we simulated the same benchmark with a real cache design. The CPI was 5.8. How much slower is the real cache than the oracle cache? You could do (5.8 / 3.3) = 1.76, or 76% slower. You could also do (3.3 / 5.8) = 0.57 = 57% as fast as the oracle. This gives us a better idea of the performance impacts of design changes by eliminating a variable from the performance equation. 2/20/2017 CS/COE 1541 term 2174

A mat? AMAT! It's useful to have performance equations that capture the most important aspects without getting bogged down by details. AMAT (Average Memory Access Time) is defined as follows: AMAT = hit time + (miss rate × miss penalty) The units of time can be whatever, as long as they're the same. We don't include the hit rate, because even when we miss, we have to "hit" after we bring the data into the cache. This gives us three "levers" we can tweak: Hit time: how long it takes to get the data when we hit Miss rate: how often we miss Miss penalty: how long it takes to get the data when we miss Let's try to make each one smaller! 2/20/2017 CS/COE 1541 term 2174

Reducing hit time 2/20/2017 CS/COE 1541 term 2174

Bigger is... slower look at this big beautiful boy (Core i7 Nehalem)
A THIRD OF THIS THING IS CACHE How long do you think it takes for data to make it from here... ...to here? It also has to pass through all this stuff. And update the L2 and L1 caches. Rocks, people. 2/20/2017 CS/COE 1541 term 2174

Physics! It's simple physics: the bigger your cache is and the longer the wires are, the longer it will take to access it. This means for the fastest hit time, you need to: Keep the cache small Keep it very close to the parts that access it The L1 caches (there are two per core) in the previous diagram are very small and integrated into the core itself. 2/20/2017 CS/COE 1541 term 2174

Also cache write buffers
Remember how with write-back, we had to check if we hit before we could continue? We used a cache write buffer to reduce the hit time. Do that. Yeah. 2/20/2017 CS/COE 1541 term 2174

Reducing miss rate 2/20/2017 CS/COE 1541 term 2174

The 3 C's Make the cache bigger (more data) to reduce capacity misses!
...but that increases hit time. Increase associativity to reduce conflict misses! And use LRU while you're at it! ...but that requires more hardware, more space, more hit time. What about compulsory misses? How would we get around the inconvenient issue of not having the data in the cache because we never used it before? How did we get around the inconvenient issue of not knowing which way a branch would go? 2/20/2017 CS/COE 1541 term 2174

Prediction for(i = 0 .. 100000) A[i]++; lw addi sub mul sw 00 04 08 0C
What do you notice about both these snippets of code? They both access memory sequentially. The first one data, the next instructions. These kinds of access patterns are very common. for(i = ) A[i]++; lw addi sub mul sw 00 04 08 0C 10 14 18 1C 20 00 04 08 0C 10 14 18 1C Sequential 00 04 08 0C 10 14 18 1C Reverse sequential 00 04 08 0C 10 14 18 1C Strided sequential (think "accessing one field from each item in an array of structs") 2/20/2017 CS/COE 1541 term 2174

Implementing address prediction
What kinds of things would you need? A table of the last n memory accesses would be a good start. Then some subtractors to calculate the stride... Then some comparators to see if the strides are the same... Then some logic to figure it all out... n-7 n-6 n-5 n-4 n-3 n-2 n-1 n 40C0 40C4 40C8 40CC 40D0 40D4 40D8 40DC - fun things = 2/20/2017 CS/COE 1541 term 2174

That's a lotta hardware But it's a prediction! It's non-essential! We don't have to stall for it. So it can be done in parallel with the rest of the cache hardware. Once the predictor has a pretty good idea of how the next few memory accesses will be, it can start prefetching memory from those predicted addresses into the cache. Memory has great bandwidth, and often has "burst mode" that allows you to transfer sequential blocks quickly. Address prediction takes advantage of those features. And of course, there's a static equivalent to this too – many architectures offer prefetch instructions to let the cache know that it will be accessing certain memory soon. You might put one before a loop over an array, for instance. 2/20/2017 CS/COE 1541 term 2174

(colors are different cache blocks)
The downsides Just like branch prediction, if you predict wrong, you have to pay the price. What's the price here? You brought in the wrong data and overwrote cache blocks. This means a misprediction increases miss rate. We're trading compulsory misses for capacity/conflict misses. Mispredictions in the cache are costly, so what kind of accuracy do we need? Really good! Unfortunately... data memory accesses usually look like this. 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C (colors are different cache blocks) 2/20/2017 CS/COE 1541 term 2174

Reducing capacity misses with unified caches
If we split our cache into two parts – one for instructions and one for data – we run into an issue. I-Cache D-Cache Code Data If our working set looks like this – say, in a small loop that's accessing a large array – then we run out of data space. Code Data If our working set looks like this – say, in a large function that's only using stack variables – then we run out of code space. 2/20/2017 CS/COE 1541 term 2174

Sharing is caring By unifying the cache – using a single cache for both code and data – we can better utilize the space available. This results in a measurable decrease in miss rate! But virtually all CPUs today split their first level cache. Why might that be? If the cache does prediction, it might be better to only do it on the I-cache. Different access patterns might also dictate different replacement schemes, write schemes, etc. And the big one is structural hazards! We need two memories to let instruction fetches happen at the same time as data accesses. That being said, most L2/L3/L4 caches are unified. 2/20/2017 CS/COE 1541 term 2174

Reducing misses with smarter programming
You remember linked lists, right? O(1) insertion, removal, and traversal? Yay! Except they're not, in the face of caching. Unless you get lucky with the allocator, the nodes will be all over the place in memory. Lots of blocks. So that constant will get big. One of the most important aspects of algorithm and data structure design today is cache locality: making sure your data fits in, and your code makes good use of, the cache. Arrays, with O(n) insertion and removal, are usually faster than linked lists today, despite being linear time. Smart array allocation schemes can amortize insertion and removal to much less than O(n), too! Use arrays. Your CPU will thank you. 2/20/2017 CS/COE 1541 term 2174

Reducing miss penalty 2/20/2017 CS/COE 1541 term 2174

Multi-level caching Cache the caches! And then cache that! AND CACHE THAT!!! As the processor-memory gap widens, we need more levels. Why? We don't want a huge jump between hit time and miss penalty. We're doing impedance matching: think of it like shifting gears in a car, or amplification in a circuit, or folding meringue into soufflé batter. Gotta do it gradually. L1 I-Cache L1 D-Cache L2 Cache L3 Cache 2/20/2017 CS/COE 1541 term 2174

A real example On a Core i7-4400 Haswell:
Split L1 caches, unified L2 and L3 caches. Each core has its own L1 and L2 caches; the L3 is shared by all. L1 D-cache and I-cache are each 32KB, 64B/block, 8-way assoc. L2 caches are 256KB, 64B/block, 8-way assoc. L3 caches is 8MB 64B/block, ?-way assoc. Access latencies (each level includes latency of previous levels): L1: 4-5 cycles (depending on access type) L2: 12 cycles (7-8 cycles on top of L1) L3: 36 cycles (24 cycles on top of L2) Memory: 232 cycles (196 cycles on top of L3) Notice the gradual increase in latency... but that big jump from L3 to memory is why L4 caches are now on the horizon. 2/20/2017 CS/COE 1541 term 2174

Reducing miss stall cycles
Let's say your cache block looks like this: V Tag D0 D1 D2 D3 D4 D5 D6 D7 Let's say we want to access word D2, but miss. What do we do? Bring in a new cache block. But this is a big block. Maybe the memory bandwidth restricts us to filling two words at a time. So it'll take 4 memory transfers to refill this block. Then we can finally access D2. Do we really need to wait for all 4 transfers to complete? 2/20/2017 CS/COE 1541 term 2174

Early restart With early restart, we continue execution as soon as we have the data, even if the whole block hasn't loaded yet. V Tag D0 D1 D2 D3 D4 D5 D6 D7 Want to access D2... Bring in those four blocks... But now we can access D2 right away! The rest of the transfers will happen in parallel with the CPU. We've reduced the stall to 2 memory transfers. How could we make this even faster? 2/20/2017 CS/COE 1541 term 2174

Critical word first Critical word first does one better, by designing the memory system in such a way that we load the requested word first. V Tag D0 D1 D2 D3 D4 D5 D6 D7 Want to access D2... Bring in those four blocks, starting with the second one. Boom, 1 memory transfer later, we can access D2. The rest of the transfers happen in parallel like before. Do you think these techniques work better with sequential or random memory accesses? These techniques are more likely used with I-cache as a result. 2/20/2017 CS/COE 1541 term 2174

Memory – Caching: Performance

Similar presentations

Presentation on theme: "Memory – Caching: Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory – Caching: Performance

Similar presentations

Presentation on theme: "Memory – Caching: Performance"— Presentation transcript:

Similar presentations

About project

Feedback