Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Advertisements

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Revision Mid 2 Prof. Sin-Min Lee Department of Computer Science.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 11: Memory Hierarchy.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
Computer Organization CS224 Fall 2012 Lessons 37 & 38.
The Memory Hierarchy Cache, Main Memory, and Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CS161 – Design and Architecture of Computer Systems
Cache Memory and Performance
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
The Goal: illusion of large, fast, cheap memory
CS352H: Computer Systems Architecture
How will execution time grow with SIZE?
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
The Memory Hierarchy Chapter 5
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
ECE 445 – Computer Organization
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Example Cache Coherence Problem
Lecture 21: Memory Hierarchy
Systems Architecture II
ECE 445 – Computer Organization
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
CS 3410, Spring 2014 Computer Science Cornell University
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Sarah Diesburg Operating Systems CS 3430
Memory Principles.
Presentation transcript:

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy 11 September, 2018 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Memory Technology Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM) 50ns – 70ns, $20 – $75 per GB Magnetic disk 5ms – 20ms, $0.20 – $2 per GB Ideal memory Access time of SRAM Capacity and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Spatial locality Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Taking Advantage of Locality Morgan Kaufmann Publishers 11 September, 2018 Taking Advantage of Locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Mapping Function Cache size = 64KB  216 Bytes Cache block = 4B  22 Bytes  1 word Number of lines = 214

b t b

1 2 3

0001 0110 00110011100111 00 1 6 Line# = 00110011100111 = 0CE7

1 2 3

XNOR = XOR

Next slide

0001011000110011100111 00 0 5 8 C E 7 Line# = 00110011100111 = 0CE7

Set Associative Cache hybrid between a fully associative cache & direct mapped cache. considered a reasonable compromise between the complex hardware needed for fully associative caches (parallel searches of all slots), and the simplistic direct-mapped scheme, may cause collisions of addresses to the same slot

Set Associative Cache Parking lot analogy 1000 parking spots (0-999) – 3 digits needed to find slot Instead use 2 digits (say last 2 digits of SSN) to hash into 100 sets of 10 spots. Once you find your set, you can park in any of the 10 spots.

Finding a Set There are 213 sets (also known as 8K-way set associative cache) Parallel search all 213 sets to see if 9-bit tag matches a particular slot) If found, go to matching set (b14 – b2) bits get word at b1 – b0, o.w., decide which slot you want to put data in from memory.

1 2 3

Tag Data Set

Set#1 Set#0 Line# (14 bits) 9 bits Set#(13 bits) 2 bits 0001011000110011100111 00 0 2 C set# (2 way - consider 1 out of 13 bits) Line# = 00110011100111 = 0CE7

Measuring Cache Performance Morgan Kaufmann Publishers 11 September, 2018 Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Performance Example Morgan Kaufmann Publishers 11 September, 2018 Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 × 100 = 2 D-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Average Access Time Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = 1 + 0.05 × 20 = 2 2 cycles per instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Example Morgan Kaufmann Publishers 11 September, 2018 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4 GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 500 cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Example (cont.) Morgan Kaufmann Publishers 11 September, 2018 L1 (miss)  MEM = 0.02(400) = 8 cycle penalty (100/.25 = 400 for memory access) L1 (miss)  L2(miss)  MEM = 0.02(400) = 8 cycle penalty (5/.25 = 20 cycles for L2 access) + (400 for memory access) L1 miss: 0.02 L2 miss: 0.005 Total L1+L2 miss = 0.02(20) + 0.005(400) = 2.4 CPI = 1 + 2.4 = 3.4 Performance ratio = 9/3.4 = 2.6 (over L1 only cache) Since effective CPI = 1 + 0.02 × 400 = 9 for L1 cache (see two slides prior to this one) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Considerations Morgan Kaufmann Publishers 11 September, 2018 Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Interactions with Advanced CPUs Morgan Kaufmann Publishers 11 September, 2018 Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyze Use system simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Control 11 September, 2018 Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16 KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until access is complete Tag Index Offset 3 4 9 10 31 4 bits 10 bits 18 bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Interface Signals Cache Memory CPU Read/Write Read/Write Valid Valid 32 32 Address Address 32 128 Write Data Write Data 32 128 Read Data Read Data Ready Ready Multiple cycles per access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Finite State Machines Morgan Kaufmann Publishers 11 September, 2018 Use an FSM to sequence control steps Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Cache Controller FSM 11 September, 2018 Could partition into separate states to reduce clock cycle time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Coherence Problem Morgan Kaufmann Publishers Cache Coherence Problem 11 September, 2018 Suppose two CPU cores share a physical address space Write-through caches (all writes go through to memory) Time step Event CPU A’s cache CPU B’s cache Memory 1 CPU A reads X 2 CPU B reads X 3 CPU A writes 1 to X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Coherence Defined Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes)  read returns written value P1 writes X; P2 reads X (sufficiently later)  read returns written value i.e., CPU B reading X after step 3 in example P1 writes X, P2 writes X  all processors see writes in the same order End up with the same final value for X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Coherence Protocols Morgan Kaufmann Publishers 11 September, 2018 Cache Coherence Protocols Operations performed by caches in multiprocessors to ensure coherence Migration of data to local caches Reduces bandwidth for shared memory Replication of read-shared data Reduces contention for access Snooping protocols Each cache monitors bus reads/writes Directory-based protocols Caches and memory record sharing status of blocks in a directory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Invalidating Snooping Protocols Morgan Kaufmann Publishers 11 September, 2018 Invalidating Snooping Protocols Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory CPU A reads X Cache miss for X CPU B reads X CPU A writes 1 to X Invalidate for X 1 CPU B read X Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Memory Consistency When are writes seen by other processors “Seen” means a read returns the written value Can’t be instantaneously Assumptions A write completes only when all processors have seen it A processor does not reorder writes with other accesses Consequence P writes X then writes Y  all processors that see new Y also see new X Processors can reorder reads, but not writes Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel On-Chip Caches Morgan Kaufmann Publishers 11 September, 2018 Multilevel On-Chip Caches Intel Nehalem 4-core processor (Corei7) §5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem Core Core Core Core Shared L3 Cache Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Nehalem Core §5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem Core Core Core Core Shared L3 Cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

3-Level Cache Organization Morgan Kaufmann Publishers 11 September, 2018 3-Level Cache Organization Intel Nehalem AMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles n/a: data not available Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 11 September, 2018 Concluding Remarks Fast memories are small, large memories are slow We really want fast, large memories  Caching gives this illusion  Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache  L2 cache  …  DRAM memory  disk Memory system design is critical for multiprocessors Chapter 5 — Large and Fast: Exploiting Memory Hierarchy