Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 12 Pipelining Strategies Performance Hazards.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Systems I Locality and Caching

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

CMPE 421 Parallel Computer Architecture

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Chapter 5 Memory Hierarchy Design

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院通信邮箱：提交作业邮箱： 2014 年 1.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Yu-Lun Kuo Computer Sciences and Information Engineering

Reducing Hit Time Small and simple caches Way prediction Trace caches

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers Memory & Cache

Lecture 14: Reducing Cache Misses

Lecture 08: Memory Hierarchy Cache Performance

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Cache Performance Improvements

Presentation transcript:

Chapter 5 Memory Hierarchy Design

2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same MCM as CPU) Physical memory (usu. mounted on same board as CPU) Virtual memory (on hard disk, often in same enclosure as CPU) Disk files (on hard disk often in same enclosure as CPU) Network-accessible disk files (often in the same building as the CPU) Tape backup/archive system (often in the same building as the CPU) Data warehouse: Robotically-accessed room full of shelves of tapes (usually on the same planet as the CPU) Our focus in chapter 5 Usually made invisible to the programmer (even assembly programmers) Invisible only to high-level language programmers There can also be a 3 rd (or more) cache levels here

3 Simple Hierarchy Example Note many orders of magnitude change in characteristics between levels: ×128→ ×8192→×200→ ×4 →×100 → ×50,000 → (for random access) (2 GB) (1 TB 10 ms)

4 Why More on Memory Hierarchy? Processor-Memory Performance Gap Growing

5 Three Types of Misses Compulsory –During a program, the very first access to a block will not be in the cache (unless pre-fetched) Capacity –The working set of blocks accessed by the program is too large to fit in the cache Conflict –Unless cache is fully associative, sometimes blocks may be evicted too early (compared to fully-associative) because too many frequently-accessed blocks map to the same limited set of frames.

6 An Alternative Metric (Average memory access time) = (Hit time) + (Miss rate)×(Miss penalty) The times T acc, T hit, and T +miss can be either: –Real time (e.g., nanoseconds), or, number of clock cycles –T +miss means the extra (not total) time (or cycle) for a miss in addition to T hit, which is incurred by all accesses CPUCache Lower levels of hierarchy Hit time Miss penalty

7 Multiple-Level Caches Avg mem acc time = Hit time (L1) + Miss rate (L1) x Miss Penalty (L1) Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss Penalty (L2) Can plug 2nd equation into the first: –Avg mem access time = Hit time(L1) + Miss rate(L1) x (Hit time(L2) + Miss rate(L2)x Miss penalty(L2))

Eleven Advanced Optimizations of Cache Performance

1. Small & simple caches - Reduce hit time Cost: indexing, Smaller is faster L2 small enough to fit on processor chip Direct mapping is simple Overlap tag check with data transmit CACTI - Simulate impact on hit time E.g., Fig 5.4 Access vs. size & associativity Suggest: Hit time Direct mapped is x faster than 2-way set associative 2-way is x 4-way 4-way is x fully associative

Access Time versus Cache Size

2. Way prediction - Reduce hit time Extra bits kept in cache to predict way, block within set of next cache access Set multiplexor early to select desired block Single tag compare in that cycle in parallel with reading cache data Miss? Check other blocks for matches in next cycle Saves pipeline stages 85% of accesses for 2- way ==> Good match for speculative processors Pentium 4 uses

3. Trace caches - Reduce hit time ILP challenge: Enough instructions to execute every cycle without dependencies Trace cache - dynamic traces of executed instructions Not static sequences of instructions from memory Branch prediction folded into instruction cache More complicated address mapping Better use of long blocks Disadvantage: Conditional branches making different choices put same instructions in separate traces Pentium 4 uses trace cache of decoded micro- instructions

4. Pipelined cache access - Increase cache bandwidth Pipeline cache access Effective latency of L1 cache hit is multiple clock cycles Fast clock high bandwidth Slow hits Pentium 4 L1 cache hit takes 4 cycles Increased pipeline stages

5. Nonblocking cache (hit under miss) - Increase cache bandwidth With out-of-order completion, processor need not stall on cache miss Continue fetching instructions while waiting for cache data If cache does not block, allow cache to supply data to hits while processing a miss Reduces effective miss penalty Overlap multiple misses? Called hit under multiple misses or miss under miss Requires memory to service multiple misses simultaneously

6. Multi-banked caches - Increase cache bandwidth Independent banks supporting simultaneous access Originally used in main memory AMD Opteron has 2 banks of L2 Sun Niagara has 4 banks of L2 Best when accesses spread across banks Spread addresses sequentially across banks - Sequential interleaving

7. Critical word first, Early restart - Reduce miss penalty Processor needs 1 word of block Give it what it needs first How is block retrieved from memory? Critical word first - Get requested word first Return it Continue with memory transfer Early restart - Fetch in normal order When requested word comes, return it Benefits only with large blocks. Why? Disadvantage: Spatial locality. Why? Miss penalty is hard to estimate

8. Merge write buffers - Reduce miss penalty Write-through relies on write buffers All stores sent to lower level Write-back uses simple buffer when block is replaced Case: Write buffer is empty Data & addresses written from cache block to buffer Cache thinks write is done Case: Write buffer contained modified blocks Is this block already in write buffer? Write merging - Combine newly modified with buffer contents Case: Buffer full & no address match Must wait for empty buffer block Uses memory more efficiently - multi-word writes

9. Compiler optimizations - Reduce miss rate Compiler research improvements Instruction misses Data misses Optimizations include Code & data rearrangement –Reorder procedures - might reduce conflict misses –Align basic blocks to beginning of a cache block - decreases chance of cache miss –Branch straightening - Change sense of branch test, swap basic blocks of branches –Data - Arrange to improve spatial & temporal locality E.g., arrays by block

9. Compiler optimizations - Reduce miss rate Loop interchange - Make code access data in order it is stored, e.g., /* Before, stride 100 */ for (j = 0; j < 100; j++) for (i = 0; i < 500; i++) x[i][j] = 2 * x[i][j]; vs. /* After, stride 1 */ for (i = 0; i < 500; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; vs. blocking for Gaussian elimination?

10. Hardware prefetch instructions & data - Reduce miss penalty or miss rate Prefetch instructions and data before processor requests Fetch by block already tries On miss, fetch missed block and next one Block prediction? Data access, similarly Multiple streams? e.g., matrix * matrix Pentium 4 can prefetch data into L2 from 8 streams from 8 different 4 Kb pages

11. Compiler-controlled prefetch - Reduce miss penalty or miss rate Compiler inserts instructions to prefetch To register To cache Faulting or nonfaulting? Should prefetch cause page fault or memory protection fault? Assume nonfaulting cache prefetch Does not change contents of registers or memory Does not cause memory fault Goal: Overlap execution with prefetching