The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science
Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 26 June 2019 Chapter 2 Memory Hierarchy Design Chapter 2 — Instructions: Language of the Computer

26 June 2019 Introduction Introduction Programmers want very large memory with low latency Fast memory technology is more expensive per bit Solution: organize memory system into a hierarchy Entire memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory to the processor Chapter 2 — Instructions: Language of the Computer

Memory Hierarchy Processor L1 Cache L2 Cache L3 Cache Main Memory
Hard Drive or Flash Latency Capacity (KB, MB, GB, TB)

PROCESSOR I-Cache D-Cache U-Cache I-Cache D-Cache U-Cache Main Memory
L1: U-Cache L2: I-Cache D-Cache L2: U-Cache L3: Main Memory Main: 25 GB/s Intel Corei7: 2 data memory accesses per clock per core. 3.2 GHz clock + 4 cores = 25.6e9*64bit data access 12.8e9*128bit instruction access 409.6 GB/s

Morgan Kaufmann Publishers
Hit vs. Miss 26 June, 2019 Block (line): unit of copying May be multiple words If data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio Also fetch the other words contained within the block Takes advantage of spatial locality Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

26 June, 2019 Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Example Address from CPU 20 Tag 10 Index Index Valid Tag Data Hit 20
4KB direct-mapped cache with 1 word (32-bit) blocks How many blocks (cache lines) are there in the cache? Byte offset Address from CPU address CPU Core Cache 20 Tag 10 Index data Index 1 2 . 1021 1022 1023 Valid Tag Data Hit 20 32 Data

Placement Policies WHERE to put a block in cache
Mapping between main memory and cache. Main memory has a much larger capacity than cache.

Memory Fully Associative Cache
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 Block number Block can be placed in any location in cache.

Memory Direct Mapped Cache
Block can be placed ONLY in a single location in cache. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 Block number Index = (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block address = data address/block size

8 KB direct mapped Cache organization
32 byte Block CPU address Data 2GB addressable Memory Tag Index blk <21> <8> <5> Valid Tag Data <1> <21> <256> : 256 lines = MUX

Memory Set Associative Cache
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Set no. 1 2 3 4 5 6 7 1 2 3 Block number Block number Index = (Block address) MOD (Number of sets in cache) 12 MOD 4 = 0 Block can be placed in one of n locations in n-way set associative cache.

4-way Set Associative Cache Organization
Morgan Kaufmann Publishers 26 June, 2019 4-way Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

64 KB Two-way cache (AMD Opteron)
1024 64 Byte blocks Write buffer

26 June, 2019 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

26 June 2019 Handling Writes Introduction When it comes to a read miss, we bring the block to cache and supply data to CPU When it comes to a write miss, there are options you can choose from Upon a write-miss, Write-allocate: bring the block into cache and write Write no-allocate: write directly to main memory When writing, Write-back: update values only to the blocks in the cache and write the modified blocks to memory (or the lower level of the hierarchy) when the block is replaced Write-through: update both the cache and the lower level of the memory hierarchy Write-allocate is usually associated with write-back policy Write no-allocate is usually associated with write-through policy Chapter 2 — Instructions: Language of the Computer

Write buffer & Dirty bit
In Write Through Mechanism: Write buffer holds data waiting to be written to memory. CPU Only stalls on write if write buffer is already full In Write Back Mechanism: Dirty bit indicates if the block has been written. No need in I-caches. Can also use a write buffer to allow replacing block to be read first.

Write-Allocate & Write-back
Processor address Main memory (DRAM) CPU Cache CPU data 0x0000_1208 0x0000_0004 dddddddd Cache line (block) (4 words in this example) memory 0x121C 0x1218 0x1214 0x0000_12 0x1210 1 1 0x120C 0x1208 ... ... ... ... ... ... ... 0x1204 0x1200 0x0000_00 ... 1 1 TAG (Address) D0 D1 D2 D3 V D 0x000C 0x0008 Allocate first upon a write miss 0x0004 0x0000

Write No-allocate & Write-through
Processor address Main memory (DRAM) CPU Cache CPU data 0x0000_1208 0x0000_0004 dddddddd Cache line (block) (4 words in this example) memory 0x121C 0x1218 0x1214 0x0000_12 0x1210 1 0x120C 0x1208 ... ... ... ... ... ... 0x1204 0x1200 ... TAG (Address) D0 D1 D2 D3 V 0x000C 0x0008 Do not allocate to cache upon a write miss 0x0004 0x0000

Memory Hierarchy Basics
The University of Adelaide, School of Computer Science 26 June 2019 Memory Hierarchy Basics Introduction Miss rate Fraction of cache access that result in a miss Causes of misses (Four Cs) Compulsory First reference to a block. Does not relate to cache size Capacity Blocks discarded and later retrieved Conflict Occurs in non-fully-associative caches. Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache Coherency Multi-core systems (discussed later) Chapter 2 — Instructions: Language of the Computer

Measuring Cache Performance
Morgan Kaufmann Publishers Measuring Cache Performance 26 June, 2019 Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Cache Performance Example
Morgan Kaufmann Publishers 26 June, 2019 Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 × 100 = 2 D-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = = 5.44 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Average memory access time
The University of Adelaide, School of Computer Science 26 June 2019 Average memory access time Introduction For 2 Level Cache = Speculative and multithreaded (Out Of Order) processors may execute other instructions during a miss Reduces performance impact of misses The cache must service requests while handling a miss Chapter 2 — Instructions: Language of the Computer

Memory Hierarchy Basics
The University of Adelaide, School of Computer Science 26 June 2019 Memory Hierarchy Basics Introduction Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity Reduces capacity and conflict misses Increases hit time, increases power consumption Higher associativity Reduces conflict misses Higher number of cache levels Reduces overall memory access time Giving priority to read misses over writes Checking the write buffer on a read miss Reduces miss penalty Avoiding address translation in cache indexing Using part of the page offset in virtual address to index the cache Reduces hit time Chapter 2 — Instructions: Language of the Computer

Ten Advanced Optimizations
The University of Adelaide, School of Computer Science 26 June 2019 Ten Advanced Optimizations Advanced Optimizations Optimization categories: Reducing the hit time Increase cache bandwidth Reducing miss penalty Reducing miss rate Parallelism Chapter 2 — Instructions: Language of the Computer

1) Small and simple L1 caches
Limited size and lower associativity results lower hit time. Critical timing path: addressing tag memory using the index comparing read tags with address selecting correct word by multiplexing the offset Lower associativity reduces power because fewer cache lines are accessed - Not direct mapped because: Virtual indexing needs the cache size to be the page size times the associativity Multi-threading increase conflict misses in lower associativities.

26 June 2019 2) Way Prediction To improve hit time, predict the way Uses predictor bits in each cache line for next access prediction. Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better prediction accuracy than D-cache First used on MIPS R10000 in mid-90s Used on 4-way caches in ARM Cortex-A8 decreases energy by 0.28 in I-cache and 0.35 in D-cache Increases mis-prediction penalty and average access time to in I-cache and 1.13 in D-cache. Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

3) Pipelining Cache 26 June 2019 Pipeline cache access to improve bandwidth Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

4) Nonblocking Caches 26 June 2019 Allow hits before previous misses complete “Hit under miss” “Hit under multiple miss” Core-i7 supports both but ARM-A8 supports first 9% reduction in miss penalty using SPECINT2006 & 12.5% reduction using SPECFP2006 Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

26 June 2019 For integer programs Chapter 2 — Instructions: Language of the Computer

5) Multibanked Caches 26 June 2019 Organize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

6) Critical Word First, Early Restart
The University of Adelaide, School of Computer Science 6) Critical Word First, Early Restart 26 June 2019 Critical word first Request missed word from memory first Send it to the processor as soon as it arrives Early restart Request words in normal order Send missed word to the processor as soon as it arrives Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

26 June 2019 7) Merging Write Buffer When storing a word in a block that is already pending in the write buffer, update write buffer Reduces stalls due to full write buffer Advanced Optimizations No write buffering Write buffering Chapter 2 — Instructions: Language of the Computer

8) Compiler Optimizations
The University of Adelaide, School of Computer Science 26 June 2019 8) Compiler Optimizations Loop Interchange Swap nested loops to access memory in sequential order Improves spatial locality X= [5000,100] , row major stored Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

9,10) Hardware, Compiler Prefetching
The University of Adelaide, School of Computer Science 26 June 2019 9,10) Hardware, Compiler Prefetching - Hardware: Fetch two blocks on miss (include next sequential block) Put the prefetched block in stream buffer 8 stream buffers can capture 60% of cache misses. Intel Core-i7 supports next block prefetching in L1 & L2 - Compiler: Insert prefetch instructions before data is needed Loops are a good prefetching option. Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

Cache Optimization Summary
The University of Adelaide, School of Computer Science 26 June 2019 Cache Optimization Summary Advanced Optimizations Chapter 2 — Instructions: Language of the Computer

26 June 2019 Memory Technology Memory Technology Performance metrics Main memory Latency is primery concern of cache Bandwidth is concern of multiprocessors and I/O With increased block size, main memory bandwidth is also important to caches Access time Time between read request and when desired word arrives Cycle time Minimum time between unrelated requests to memory DRAM used for main memory, SRAM used for cache Flash memory for PMDs. Chapter 2 — Instructions: Language of the Computer

26 June 2019 Memory Technology SRAM Access time is close to cycle time Requires low power to retain bit Requires 6 transistors/bit Largest L3 SRAM = 12 MB 4 times faster than DRAM DRAM Must be re-written after being read Cycle time is longer than the access time Must also be periodically refreshed Every ~ 8 ms : takes 5% of memory time Each row can be refreshed simultaneously One transistor/bit Address lines are multiplexed due to large size: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) Memory Technology Chapter 2 — Instructions: Language of the Computer

Internal organization of DRAM
dual inline memory modules (DIMM) contain 4-16 DRAM chips and are 8 bytes wide.

Memory performance improvement
The University of Adelaide, School of Computer Science 26 June 2019 Memory performance improvement Memory Technology Chapter 2 — Instructions: Language of the Computer

Improving DRAM performance
The University of Adelaide, School of Computer Science Improving DRAM performance 26 June 2019 Multiple accesses to same row (using the row buffer as a cache) Synchronous DRAM (SDRAM) Added clock to DRAM interface Burst mode with critical word first Double data rate (DDR) transfer data on both edges of clock Multiple banks on each DRAM device (up to 8 banks in DDR3) Memory Technology Chapter 2 — Instructions: Language of the Computer

26 June 2019 DDR evolution Memory Technology DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR4 1-1.2 V 1600 MHz GDRAM (Graphical DRAM) GDDR5 (based on DDR3) Wider bus: 32 bits per chip – 256 bit total Higher clock rates by soldered (direct) connection to GPU Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The ARM Cortex-A8 ISA: ARMv7
Core technology: IP (Intellectual Property) Dominant core in embedded and PMDs. Can incorporate application specific processors (e.g. video) + I/O + memory IP cores 2 instructions per clock (1GHz max.) Hard core: not configurable, better performance Soft core: reconfigurable, bigger die size

ARM Cortex-A8 memory organization
First level cache: 16 KB or 32 KB, I & D caches in first level 4-way set associative with way prediction. Virtually indexed and physically tagged Second level cache (optional): 128 KB – 1 MB 8-way set associative 1-4 banks Physically indexed and tagged 64 , 128 bit memory bus The primary input pin A64n128 of the processor determines the width 64 byte block size Copyright © 2012, Elsevier Inc. All rights reserved.

ARM Cortex-A8 Performance
Instruction cache miss rate: < 1% L1 miss penalty: 11 cycles L2 miss penalty: 60 cycles

Intel Core-i7 64 bit extension of 80x86 ISA: x86-64
4 cores (6 or 8 cores in some generations) 4 instructions per clock per core. 16 stage pipeline 2 simultaneous threads per core. 3.3 GHz: up to 50 billion instruction per sec. 3 memory channels: 25GB/sec. 48 bit virtual address(256TB), 36 bit physical address(64GB)

3rd Generation Intel Core i7
I: 32KB 4-way D: 32KB 8-way I D I D I D I D 256KB 8-way 8MB 16-way

Core-i7 memory hierarchy

Core-i7 memory hierarchy
Miss penalty for critical Instructions: 135 clocks Miss penalty for entire block: = 180 clocks Write-back mechanism 10 entry merging write buffer between L3/L2 & L2/L1 Prefetching next block in L1 and L2 is supported

Performance of i7L1, L2 , L3 caches

The University of Adelaide, School of Computer Science

Similar presentations

Presentation on theme: "The University of Adelaide, School of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The University of Adelaide, School of Computer Science

Similar presentations

Presentation on theme: "The University of Adelaide, School of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback