Bojian Zheng CSCD70 Spring 2018

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Systems I Locality and Caching
CMPE 421 Parallel Computer Architecture
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Caches Samira Khan March 23, 2017.
CSE 351 Section 9 3/1/12.
Cache Performance Samira Khan March 28, 2017.
18-447: Computer Architecture Lecture 23: Caches
Associativity in Caches Lecture 25
Improving Memory Access 1/3 The Cache and Virtual Memory
Replacement Policy Replacement policy:
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 08: Memory Hierarchy Cache Performance
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
ECE232: Hardware Organization and Design
Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia
Lecture 22: Cache Hierarchies, Memory
15-740/ Computer Architecture Lecture 14: Prefetching
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache - Optimization.
Update : about 8~16% are writes
Cache Memory Rabi Mahapatra
Principle of Locality: Memory Hierarchies
10/18: Lecture Topics Using spatial locality
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson
Presentation transcript:

Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu Memory Hierarchy Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu

Memory Hierarchy From programmer’s point of view, memory has infinite capacity (i.e. can store infinite amount of data) has zero access time (latency) But those two requirements contradict with each other. Large memory usually has high access latency. Fast memory cannot go beyond certain capacity limit. Therefore, we want to have multiple levels of storage and ensure most of the data the processor needs is kept in the fastest level.

Memory Hierarchy What in reality What we see (Ideally)

Locality Temporal Locality: If an address gets accessed, then it is very likely that the exact same address will be accessed once again in the near future. unsigned a = 10; … // ‘a’ will hopefully be used again soon

Locality Spatial Locality: If an address gets accessed, then it is very likely that nearby addresses will be accessed in the near future. unsigned A[10]; A[5] = 10; … // ‘A[4]’ or ‘A[6]’ will hopefully be used soon

Cache Basics Memory is divided into fixed size blocks. Each block maps to one location in the cache (called cache block or cache line), determined by the index bites.

Direct-Mapped Cache

Direct-Mapped Cache: Problem Suppose that two variables A (address 0’b10000000) and B (address 0’b11000000) map to the same cache line. Interleaving accesses to them will lead to 0 hit rate (i.e. A->B->A->B->…). Those are called conflict misses (more later).

Set-Associative Cache

Set-Associative Cache: Problem More expansive tag comparison. More complicated design – lots of things to consider: Where should we insert the new incoming cache block? What happens when a cache hit occurs? How should we adjust the priorities? What happens when a cache miss occurs, and the cache set has been fully occupied (Replacement Policy)?

Replacement Policy Which one to evict under the condition that the cache set is full? Least-Recently-Used? Random? Pseudo-Least-Recently-Used? Belady’s (a.k.a. Optimal) Replacement Policy Evict block that will be reused furthest in the future. Impossible to implement in hardware … why? But is still useful … why?

Cache Misses There are 3 types of cache misses: Cold Misses: happens whenever we reference one variable (memory location) for the first time. Such misses are unavoidable (???). Capacity Misses: happens because our cache is too small. Misses that happen even under fully-associative cache and optimal replacement policy. Cache size is smaller than working set size. Conflict Misses: happens because we do not have enough associativity.

Handling Writes Write-Back vs. Write-Through Write-Back: Write modified data to memory when the cache block is evicted. (+) Can effectively combine multiple writes to the same block. (-) Needs an extra bit indicating dirty or clean. Write-Through: Write modified data to memory whenever write occurs. (+) Simple and makes it easy when arguing about consistency. (-) Cannot combine writes and is more bandwidth intensive.

Prefetching Risk: Cache Pollution Goal: Timeliness, Coverage, Accuracy Upon one cache miss, try to predict what the next cache miss will be. For instance, miss on A[0] ⇒ prefetch A[1] into the cache. Good for memory access patterns that are highly predictable, e.g. Instruction Memory Array Accesses (Uniform Stride) Risk: Cache Pollution Goal: Timeliness, Coverage, Accuracy

Questions? Cache Basics Cache Misses Handling Writes Prefetching Cache Locality Set-Associativity Replacement Policy Cache Misses Cold Capacity Conflict Handling Writes Write-Back Write-Through Prefetching Risk Goal

Preview: Cache & Compiler Ok … So how this is related to compiler? unsigned A[20][10]; for (unsigned i = 0; i < 10; ++i) for (unsigned j = 0; j < 20; ++j) A[j][i] = j * 10 + i; // Assume A is row-major. This is not very nice … because cache has been utilized badly : (

Preview: Cache & Compiler for (i ∈ 0, 10 ) for (j ∈ 0, 20 ) A[j][i] = j * 10 + i; for (j ∈ 0, 20 ) for (i ∈ 0, 10 ) A[j][i] = j * 10 + i;

Preview: Cache & Compiler Consider another example: unsigned A[100][100]; for (unsigned i = 0; i < 100; ++i) for (unsigned j = 0; j < 100; ++j) sum += A[i][j]

Preview: Cache & Compiler Apply prefetching: unsigned A[100][100]; for (unsigned i = 0; i < 100; ++i) for (unsigned j = 0; j < 100; j += $_BLOCK_SIZE) for (unsigned jj = j; jj < j + $_BLOCK_SIZE; ++jj) prefetch(&A[i][j] + $_BLOCK_SIZE) sum += A[i][jj]

Preview: Design Tradeoff Consider the following code: unsigned A[20][10]; for (unsigned i = 0; i < 10; ++i) for (unsigned j = 0; j < 19; ++j) A[j][i] = A[j+1][i];

Preview: Design Tradeoff Loop Parallelization Cache Locality for (i ∈ 0, 10 ) for (j ∈ 0, 19 ) A[j][i] = A[j+1][i]; for (j ∈ 0, 19 ) for (i ∈ 0, 10 ) A[j][i] = A[j+1][i];