1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition.

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
The University of Adelaide, School of Computer Science
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
The University of Adelaide, School of Computer Science
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 10: Memory Hierarchy Design Kai Bu
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 Memory Hierarchy Design Computer Architecture A Quantitative Approach, Fifth Edition.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
1 CACHE BASICS : DRAM faster than microprocessors.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
– 1 – CSCE 513 Fall 2015 Lec08 Memory Hierarchy IV Topics Pipelining Review Load-Use Hazard Memory Hierarchy Review Terminology review Basic Equations.
– 1 – CSCE 513 Fall 2015 Lec07 Memory Hierarchy II Topics Pipelining Review Load-Use Hazard Memory Hierarchy Review Terminology review Basic Equations.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2013 年 1.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CS 704 Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
Reducing Hit Time Small and simple caches Way prediction Trace caches
CS352H: Computer Systems Architecture
The University of Adelaide, School of Computer Science
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lec09 Memory Hierarchy yet again
The University of Adelaide, School of Computer Science
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Design
Lecture 10: Memory Hierarchy Design
Adapted from slides by Sally McKee Cornell University
Main Memory Background
Areg Melik-Adamyan, PhD
The University of Adelaide, School of Computer Science
Presentation transcript:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition

2 Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Programmers want very large memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor Introduction

3 Memory Hierarchy Processor Latency L1 Cache L2 Cache L3 Cache Main Memory Hard Drive or Flash Capacity (KB, MB, GB, TB)

4 PROCESSOR I-Cache D-Cache L1: I-Cache D-Cache L2: U-Cache L2: U-Cache L3: Main Memory Main: Intel Corei7: 2 data memory accesses per clock per core. 3.2 GHz clock + 4 cores = 25.6 e9 64bit data access 12.8 e9 128bit instruction access GB/s 25 GB/s - Multi-port, pipelined caches - Independent caches - Shared third-level cache

5 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Introduction

6 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Performance Gap Introduction

7 Copyright © 2012, Elsevier Inc. All rights reserved. Performance and Power High-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget Static power: leakage Dynamic power: read or write 25% - 50% of power consumption in PMDs Introduction

8 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a higher latency reference Lower level may be another cache or the main memory Also fetch the other words contained within the block Takes advantage of spatial locality Place block into cache in any location within its set, determined by address block address MOD number of sets Introduction

9 Placement Problem Main Memory Cache Memory

10 Placement Policies WHERE to put a block in cache Mapping between main and cache memories. Main memory has a much larger capacity than cache memory.

Memory Block number Fully Associative Cache Block can be placed in any location in cache.

Memory Block number Direct Mapped Cache (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block can be placed ONLY in a single location in cache.

Memory Block number Set Associative Cache (Block address) MOD (Number of sets in cache) 12 MOD 4 = Set no. Block number Block can be placed in one of n locations in n-way set associative cache.

14 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics n blocks per set => n-way set associative Direct-mapped cache => one block per set Fully associative => one set Writing to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Only update lower levels of hierarchy when an updated block is replaced Both strategies use write buffer to make writes asynchronous Introduction

15 Dirty bit(s) Indicates if the block has been written to. No need in I-caches. No need in write through D-cache. Write back D-cache needs it.

16 Write back C P U Main memory cache D

17 Write through C P U Main memory cache

18 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Miss rate Fraction of cache access that result in a miss Causes of misses (Three Cs) Compulsory First reference to a block. Does not relate to cache size Capacity Blocks discarded and later retrieved Conflict Occurs in non-fully-associative caches. Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache Introduction

19 Cache miss statistics Copyright © 2012, Elsevier Inc. All rights reserved.

20 Speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses The cache must service requests while handling a miss Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Introduction

21 Cache organization : : Valid Tag Data CPU address Data = MUX TagIndexblk

22 64 KB Two-way cache (AMD Opteron) Byte blocks Write buffer

23 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption Higher associativity Reduces conflict misses Increases hit time, increases power consumption Higher number of cache levels Reduces overall memory access time to A design compromise between cache and block size, hit time, and power Giving priority to read misses over writes Checking the write buffer on a read miss Reduces miss penalty Avoiding address translation in cache indexing Using part of the page offset in virtual address to index the cache Reduces hit time Introduction

24 Copyright © 2012, Elsevier Inc. All rights reserved.

25 Copyright © 2012, Elsevier Inc. All rights reserved. Ten Advanced Optimizations Optimization categories: Reducing the hit time Increase cache bandwidth Reducing miss penalty Reducing miss rate Reducing miss penalty or miss rate via parallelism Advanced Optimizations

26 1) Small and simple L1 caches Limited size and lower associativity results lower hit time. Critical timing path: addressing tag memory using the index comparing read tags with address selecting correct set by multiplexing the offset Direct-mapped caches can overlap tag compare and transmission of data Lower associativity reduces power because fewer cache lines are accessed Copyright © 2012, Elsevier Inc. All rights reserved.

27 Copyright © 2012, Elsevier Inc. All rights reserved. L1 Size and Associativity Access time vs. size and associativity Advanced Optimizations 2-way is 1.2 times faster than 4-way 4-way is 1.4 times faster than 8-way

28 Copyright © 2012, Elsevier Inc. All rights reserved.

29 Copyright © 2012, Elsevier Inc. All rights reserved. L1 Size and Associativity Energy per read vs. size and associativity Advanced Optimizations

30 3 reason for higher associativity in L1 Taking at least 2 clock cycles for cache access that makes the hit time less important Virtual indexing needs the cache size to be the page size times the associativity Multi-threading increase conflict misses in lower associativities. Copyright © 2012, Elsevier Inc. All rights reserved.

31 Copyright © 2012, Elsevier Inc. All rights reserved. 2) Way Prediction To improve hit time, predict the way to pre-set mux Mis-prediction gives longer hit time Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on 4-way caches in ARM Cortex-A8 Extend to predict block as well “Way selection”, decreases energy to 0.28 in I-cache and 0.35 in D-cache Increases mis-prediction penalty, increases average access time to 1.04 in I-cache and 1.13 in D-cache. Advanced Optimizations

32 Copyright © 2012, Elsevier Inc. All rights reserved.

33 Copyright © 2012, Elsevier Inc. All rights reserved. 3) Pipelining Cache Pipeline cache access to improve bandwidth Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Increases branch misprediction penalty Makes it easier to increase associativity Advanced Optimizations

34 4) Nonblocking Caches Allow hits before previous misses complete “Hit under miss” “Hit under multiple miss” Core-i7 supports both but ARM-A8 supports first 9% reduction in SPECINT2006 & 12.5% reduction in SPECFP2006 The miss penalty reduction is hard to analyze because of time overlap between hits and misses. Advanced Optimizations

35 Copyright © 2012, Elsevier Inc. All rights reserved. For integer programs

36 Copyright © 2012, Elsevier Inc. All rights reserved. 5) Multibanked Caches Organize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address Advanced Optimizations

37 6) Critical Word First, Early Restart Critical word first Request missed word from memory first Send it to the processor as soon as it arrives Early restart Request words in normal order Send missed word to the processor as soon as it arrives Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched Advanced Optimizations

38 Copyright © 2012, Elsevier Inc. All rights reserved. 7) Merging Write Buffer When storing a word in a block that is already pending in the write buffer, update write buffer Reduces stalls due to full write buffer Do not apply to I/O addresses Advanced Optimizations No write buffering Write buffering

39 Copyright © 2012, Elsevier Inc. All rights reserved. 8) Compiler Optimizations Loop Interchange Swap nested loops to access memory in sequential order Improves spatial locality X= [5000,100], row major stored Advanced Optimizations

40 Blocking Instead of accessing entire rows or columns, subdivide matrices into blocks Requires less memory accesses by improving spatial & temporal locality Useful in orthogonal matrix accesses. 2N³+N² memory accesses 2N³/B+N² memory accesses

41 Copyright © 2012, Elsevier Inc. All rights reserved. 9) Hardware Prefetching Fetch two blocks on miss (include next sequential block) Put the prefetched block in stream buffer 8 stream buffers can capture 60% of cache misses. Intel Core-i7 supports next block prefetching in L1 & L2 Advanced Optimizations Pentium 4 Pre-fetching

42 Copyright © 2012, Elsevier Inc. All rights reserved. 10) Compiler Prefetching Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Register prefetch Loads data into register Cache prefetch Loads data into cache Loops are a good prefetching option. Advanced Optimizations

43 “a” access : 3 * 100/2 = 150 misses “b” access : b[0][0] and b[j+1][0] when i=0 would miss = = 101 misses total loop : 251 misses Copyright © 2012, Elsevier Inc. All rights reserved.

44 Revised version

45 Original loop executes 3*100 = 300 times that takes 300*7=2100 cycles 251 misses takes cycles Total execution time = cycles First prefetch loop : 100 * 9 cycles = 900 cycles + 11 misses * 100 = 1100 cycles = 2000 cycles Second loop : 2*100 = 200 * 8 = 1600 cycles + 8 misses *100 = 800 cycles = 2400 cycles Revised loop : 4400 cycles = 27200/4400 = 6.2 times faster

46 Cache Optimization Summary Advanced Optimizations

47 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Performance metrics Latency is concern of cache Bandwidth is concern of multiprocessors and I/O With increased cache blocks, bandwidth is also important for cache Access time Time between read request and when desired word arrives Cycle time Minimum time between unrelated requests to memory DRAM used for main memory, SRAM used for cache Flash memory for PMDs. Memory Technology

48 Memory Technology SRAM Access time is close to cycle time Requires low power to retain bit Requires 6 transistors/bit Largest L3 SRAM = 12 MB 4 times faster than DRAM DRAM Must be re-written after being read Cycle time is longer than the access time Must also be periodically refreshed Every ~ 8 ms : takes 5% of memory time Each row can be refreshed simultaneously One transistor/bit Address lines are multiplexed due to large size: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) Memory Technology

49 Internal organization of DRAM dual inline memory modules (DIMM) contain 4-16 DRAM chips and are 8 bytes wide.

50 Copyright © 2012, Elsevier Inc. All rights reserved. Memory performance improvement Memory Technology

51 Improving DRAM performance Multiple accesses to same row (using the row buffer as a cache) Synchronous DRAM (SDRAM) Added clock to DRAM interface Burst mode with critical word first Wider busses (16 bit bus in DDR2 and DDR3) Double data rate (DDR) transfer data on both edges of clock Multiple banks on each DRAM device (up to 8 banks in DDR3) Memory Technology

52 Copyright © 2012, Elsevier Inc. All rights reserved. DDR evolution DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR V 1600 MHz GDRAM (Graphical DRAM) GDDR5 (based on DDR3) Wider bus: 32 bits Higher clock rates by soldered (direct) connection to GPU Memory Technology

53 Reducing power in SDRAMs Lower voltage Low power mode (ignores clock, continues to refresh) Takes 200 clocks to recover from low power mode Memory Technology

54 Copyright © 2012, Elsevier Inc. All rights reserved. Flash Memory Type of EEPROM Backup storage in PMDs Must be erased (in blocks) before being overwritten Non volatile Limited number of write cycles typically Cheaper than SDRAM, more expensive than disk $2/GB in flash, $30/GB in SDRAM, $0.09/GB in hard disk Slower than SDRAM (x 1/4), faster than disk (x1000) May be a replacement for disks. Memory Technology

55 Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected by parity (1 bit per byte) Detected and fixed by error correcting codes (ECC) (1 byte per 8 bytes) Hard errors: permanent errors Use spare rows to replace defective rows Chipkill: a RAID-like error recovery technique Distribution of data in ECC between memory chips Used in very large cluster computers like Google. IBM analysis on a10000 processor server with 4GB memory per each one in 3 years: Parity: unrecovered failures – 1 every 17 min. ECC : 3500 unrecovered failures – 1 every 7.5 hr. Chipkill : 6 unrecovered failures - 1 every 2 months Memory Technology

56 Protection via Virtual Memory Protection via virtual memory Keeps processes in their own memory space Role of architecture: Provide user mode and supervisor mode Protect certain aspects of CPU state User/supervisor mode bit, exception enable/disable, memory protection information Provide mechanisms for switching between user mode and supervisor mode (system calls) Provide mechanisms to limit memory access between processes Provide TLB to translate addresses and implement protection Protection scheme: R/W & code execution from memory page Virtual Memory and Virtual Machines

57 Protection via Virtual Machines Supports isolation and security Low security and reliability of standard OSes. Sharing a computer among many unrelated users Enabled by raw speed of processors, making the overhead more acceptable Allows different ISAs and operating systems to be presented to user programs “System Virtual Machines”: IBM VM/370, VMware ESX, Xen SVM software is called “virtual machine monitor” or “hypervisor” Individual virtual machines run under the monitor are called “guest VMs” VMM is much smaller than a OS  better security Os runs as a user process IBM 370 architecture is virtualizable but 80x86, RISC are not Virtual Memory and Virtual Machines

58 Copyright © 2012, Elsevier Inc. All rights reserved. Impact of VMs on Virtual Memory Each guest OS maintains its own set of page tables VMM adds a level of memory between physical and virtual memory called “real memory” VMM maintains shadow page table that maps guest real addresses to physical addresses The guest page tables are only refreshed by VMM Requires VMM to detect guest’s changes to its own page table Occurs naturally if accessing the page table pointer is a privileged operation Virtual Memory and Virtual Machines

59 Crosscutting issues Protection & ISA ISA should be changed to reduce the cost of virtualization Coherency of cached data I/O cache coherency Hardware or software checking of I/O addresses Invalidating the cache data if I/O is performed Multi-processor cache coherency Will be covered soon Copyright © 2012, Elsevier Inc. All rights reserved.

60 The ARM Cortex-A8 ISA: ARMv7 Core technology: IP (Intellectual Property) Dominant core in embeddeds and PMDs. Can incorporate application specific processors (e.g. video) + I/O + memory IP cores 2 instructions per clock (1GHz max.) Hard core: not configurable, better performance Soft core: reconfigurable, bigger die size

61 ARM Cortex-A8 memory organization First level cache: 16 KB or 32 KB, I & D caches in first level 4-way set associative with way prediction. Virtually indexed and physically tagged Second level cache (optional): 128 KB – 1 MB 8-way set associative 1-4 banks Physically indexed and tagged 64, 128 bit memory bus The primary input pin A64n128 of the processor determines the width 64 byte block size Copyright © 2012, Elsevier Inc. All rights reserved.

62

63 ARM Cortex-A8 memory organization 2 TLBs (for I & D) Fully associative 32 entries Variable page size (4, 16, 64 KB, 1, 16 MB) 32 bit virtual address translation (32 KB L1, 1 MB L2, 16 KB page)

64 ARM Cortex-A8 Performance Copyright © 2012, Elsevier Inc. All rights reserved. L1 miss penalty: 11 cycles L2 miss penalty: 60 cycles Instruction cache miss rate: < 1%

65 Intel Core-i7 64 bit extension of 80x86 ISA: x cores 4 instructions per clock per core. 16 stage pipeline 2 simultaneous threads per core. 3.3 GHz: up to 50 billion instruction per sec. 3 memory channels: 25GB/sec. 48 bit virtual address, 36 bit physical address: 36 GB Copyright © 2012, Elsevier Inc. All rights reserved.

66 Core-i7 Caches and TLBs 64 byte block size nonblocking, writeback, L3 is sahred Copyright © 2012, Elsevier Inc. All rights reserved.

67 3rd Generation Intel Core i7 Core 0 L1 L2 L3 I D Core 1 I D Core 2 I D Core 3 I D I: 32KB 4-way D: 32KB 4-way 256KB 8MB

68 Core-i7 memory hierarchy

69 Core-i7 memory hierarchy Miss penalty for critical Instructions: 135 clocks Miss penalty for entire block: = 180 clocks Write-back mechanism 10 entry merging write buffer between L3, L2 and L2, L1 Prefetching next block in L1 and L2 is supported

70 Performance of i7 L1 cache

71 Performance of i7 L2, L3 caches