15-740/18-740 Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.

Slides:



Advertisements
Similar presentations
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Lecture 26 Fasih ur Rehman.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
18-447: Computer Architecture Lecture 24: Advanced Caches Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/1/2013.
Computer Architecture Lecture 20: Better Caching
Computer Architecture: Parallel Task Assignment
CSE 351 Section 9 3/1/12.
Computer Architecture Lecture 18: Caches, Caches, Caches
Cache Performance Samira Khan March 28, 2017.
18-447: Computer Architecture Lecture 23: Caches
Associativity in Caches Lecture 25
15-740/ Computer Architecture Lecture 20: Caching III
Computer Architecture Lecture 19: High-Performance Caches
CSC 4250 Computer Architectures
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Lecture 13: Large Cache Design I
Samira Khan University of Virginia Nov 5, 2017
Bojian Zheng CSCD70 Spring 2018
15-740/ Computer Architecture Lecture 13: More Caching
Lecture: Cache Innovations, Virtual Memory
15-740/ Computer Architecture Lecture 27: Prefetching II
Siddhartha Chatterjee
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
Prof. Onur Mutlu Carnegie Mellon University
15-740/ Computer Architecture Lecture 14: Prefetching
15-740/ Computer Architecture Lecture 18: Caching Basics
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011

Announcements Milestone II  Due November 4, Friday  Please talk with us if you are not making good progress CALCM Seminar Wednesday Nov 2 at 4pm  Hamerschlag Hall, D-210  Edward Suh, Cornell  Hardware-Assisted Run-Time Monitoring for Trustworthy Computing Systems  nar_11_11_02 nar_11_11_02 2

Review Sets 11 and 12 Was Due Today (Oct 31)  Qureshi and Patt, “Utility-Based Cache Partitioning: A Low- Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO  Complete reviews even if late Due Next Monday (Nov 7)  Joseph and Grunwald, “Prefetching using Markov Predictors,” ISCA  Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“, HPCA

Last Lecture Caching basics 4

Today More caching 5

Review: Handling Writes (Stores) Do we allocate a cache block on a write miss?  Allocate on write miss: Yes  No-allocate on write miss: No Allocate on write miss + Can consolidate writes instead of writing each of them individually to next level -- Requires (?) transfer of the whole cache block + Simpler because write misses can be treated the same way as read misses No-allocate + Conserves cache space if locality of writes is low 6

Inclusion vs. Exclusion in Multi-Level Caches Inclusive caches  Every block existing in the first level also exists in the next level  When fetching a block, place it in all cache levels. Tradeoffs: -- Leads to duplication of data in the hierarchy: less efficient -- Maintaining inclusion takes effort (forced evictions) + But makes cache coherence in multiprocessors easier  Need to track other processors’ accesses only in the highest-level cache Exclusive caches  The blocks contained in cache levels are mutually exclusive  When evicting a block, do you write it back to the next level? + More efficient utilization of cache space + (Potentially) More flexibility in replacement/placement -- More blocks/levels to keep track of to ensure cache coherence; takes effort Non-inclusive caches  No guarantees for inclusion or exclusion: simpler design  Most Intel processors 7

Maintaining Inclusion and Exclusion When does maintaining inclusion take effort?  L1 block size < L2 block size  L1 associativity > L2 associativity  Prefetching into L2  When a block is evicted from L2, need to evict all corresponding subblocks from L1  keep 1 bit per subblock in L2 saying “also in L1”  When a block is inserted, make sure all higher levels also have it When does maintaining exclusion take effort?  L1 block size != L2 block size  Prefetching into any cache level  When a block is inserted into any level, ensure it is not in any other 8

Multi-level Caching in a Pipelined Design First-level caches (instruction and data)  Decisions very much affected by cycle time  Small, lower associativity Second-level caches  Decisions need to balance hit rate and access latency  Usually large and highly associative; latency not as important  Serial tag and data access Serial vs. Parallel access of levels  Serial: Second level cache accessed only if first-level misses  Second level does not see the same accesses as the first First level acts as a filter. Can you exploit this fact to improve hit rate in the second level cache? 9

Improving Cache “Performance” Reducing miss rate  Caveat: reducing miss rate can reduce performance if more costly-to-refetch blocks are evicted Reducing miss latency Reducing hit latency 10

Improving Basic Cache Performance Reducing miss rate  More associativity  Alternatives/enhancements to associativity Victim caches, hashing, pseudo-associativity, skewed associativity  Software approaches Reducing miss latency/cost  Multi-level caches  Critical word first  Subblocking  Non-blocking caches  Multiple accesses per cycle  Software approaches 11

How to Reduce Each Miss Type Compulsory  Caching cannot help  Prefetching Conflict  More associativity  Other ways to get more associativity without making the cache associative Victim cache Hashing Software hints? Capacity  Utilize cache space better: keep blocks that will be referenced  Software management: divide working set such that each “phase” fits in cache 12

Victim Cache: Reducing Conflict Misses Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA Idea: Use a small (fully-associative) buffer to store evicted blocks + Can avoid ping ponging of cache blocks mapped to the same set (if two cache blocks continuously accessed in nearby time conflict with each other) -- Increases miss latency if accessed serially with the cache 13 Direct Mapped Cache Next Level Cache Victim cache

Hashing and Pseudo-Associativity Hashing: Better “randomizing” index functions + can reduce conflict misses by distributing the accessed memory blocks more evenly to sets  Example: stride where stride value equals cache size -- More complex to implement: can lengthen critical path Pseudo-associativity (Poor Man’s associative cache)  Serial lookup: On a miss, use a different index function and access cache again  Given a direct-mapped array with K cache blocks Implement K/N sets Given address Addr, sequentially look up: {0,Addr[lg(K/N)-1: 0]}, {1,Addr[lg(K/N)-1: 0]}, …, {N-1,Addr[lg(K/N)-1: 0]} 14

Skewed Associative Caches (I) Basic 2-way associative cache structure 15 Way 0 Way 1 Tag Index Byte in Block Same index function for each way =?

Skewed Associative Caches (II) Skewed associative caches  Each bank has a different index function 16 Way 0Way 1 tag index byte in block f0 same index same set same index redistributed to different sets =?

Skewed Associative Caches (III) Idea: Reduce conflict misses by using different index functions for each cache way Benefit: indices are randomized  Less likely two blocks have same index Reduced conflict misses  May be able to reduce associativity Cost: additional latency of hash function 17

Improving Hit Rate via Software (I) Restructuring data layout Example: If column-major  x[i+1,j] follows x[i,j] in memory  x[i,j+1] is far away from x[i,j] This is called loop interchange Other optimizations can also increase hit rate  Loop fusion, array merging, … What if multiple arrays? Unknown array size at compile time? 18 Poor code for i = 1, rows for j = 1, columns sum = sum + x[i,j] Better code for j = 1, columns for i = 1, rows sum = sum + x[i,j]

More on Data Structure Layout Pointer based traversal (e.g., of a linked list) Assume a huge linked list (1M nodes) and unique keys Why does the code on the left have poor cache hit rate?  “Other fields” occupy most of the cache line even though rarely accessed! 19 struct Node { struct Node* node; int key; char [256] name; char [256] school; } while (node) { if (node  key == input-key) { // access other fields of node } node = node  next; }

How Do We Make This Cache-Friendly? Idea: separate frequently- used fields of a data structure and pack them into a separate data structure Who should do this?  Programmer  Compiler Profiling vs. dynamic  Hardware?  Who can determine what is frequently used? 20 struct Node { struct Node* node; int key; struct Node-data* node-data; } struct Node-data { char [256] name; char [256] school; } while (node) { if (node  key == input-key) { // access node  node-data } node = node  next; }

Improving Hit Rate via Software (II) Blocking  Divide loops operating on arrays into computation chunks so that each chunk can hold its data in the cache  Avoids cache conflicts between different chunks of computation  Essentially: Divide the working set so that each piece fits in the cache But, there are still self-conflicts in a block 1. there can be conflicts among different arrays 2. array sizes may be unknown at compile/programming time 21

Midterm I Results

Solutions will be posted  Check the solutions and make sure you know the answers Max possible score 185 / 167 (including bonus) Average: 98 / 167 Median: 94 / 167 Maximum: 169 / 167 Minimum: 29 /

Midterm I Grade Distribution 24