Multilevel Memories (Improving performance using alittle “cash”)

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CS161 – Design and Architecture of Computer Systems

CMSC 611: Advanced Computer Architecture

Main Memory Cache Architectures

Soner Onder Michigan Technological University

CSE 351 Section 9 3/1/12.

Associativity in Caches Lecture 25

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lecture: Cache Hierarchies

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers

Lecture: Cache Hierarchies

Peng Liu Lecture 11 Cache Peng Liu

Peng Liu Lecture 10 Cache Peng Liu

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

Rose Liu Electrical Engineering and Computer Sciences

Krste Asanovic Electrical Engineering and Computer Sciences

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

Performance metrics for caches

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Peng Liu Lecture 10 Cache (2) Peng Liu

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Lecture 22: Cache Hierarchies, Memory

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 6 – Memory II Krste Asanovic Electrical Engineering and Computer.

CSC3050 – Computer Architecture

Lecture 21: Memory Hierarchy

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Cache Memory Rabi Mahapatra

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Multilevel Memories (Improving performance using alittle “cash”)

Four-issue superscalar executes 400 instructions during cache miss!

Strategy: Hide latency using small, fast memories called caches. Caches are a mechanism to hide memory latency based on the empirical observation that the stream of memory references made by a processor exhibits locality

Two locality properties of memory references: – If a location is referenced it is likely to be referenced again in the near future. This is called temporal locality. – If a location is referenced it is likely that locations near it will be referenced in the near future. This is called spatial locality.

Caches exploit both properties of patterns. – Exploit temporal locality by remembering the contents of recently accessed locations. – Exploit spatial locality by remembering blocks of contents of recently accessed locations.

Due to cost Due to size of DRAMDue to cost and wire delays (wires on-chip cost much less, and are faster)

• For high associativity, use content-addressable memory (CAM) for tags • Overhead? Advantage is low power because only hit data is accessed.

write through: write both cache & memory Cache hit: write through: write both cache & memory -generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) -a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate: (aka fetch on write) fetch block into cache Common combinations: write through and no write allocate write back with write allocate

In an associative cache, which block from a set should be evicted when the set becomes full? • Random • Least Recently Used (LRU) • LRU cache state must be updated on every access • true implementation only feasible for small sets (2-way easy) • pseudo-LRU binary tree often used for 4-8 way • First In, First Out (FIFO) a.k.a. Round-Robin • used in highly associative caches This is a second-order effect. Why?

• Compulsory: first-reference to a block a.k.a. cold start misses -misses that would occur even with infinite cache • Capacity: cache is too small to hold all data needed by the program -misses that would occur even under perfect placement & replacement policy • Conflict: misses that occur because of collisions due to block-placement strategy -misses that would not occur with full associativity Determining the type of a miss requires running program traces on a cache simulator

Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: • reduce the miss rate (e.g., larger cache) • reduce the miss penalty (e.g., L2 cache) • reduce the hit time Aside: What is the simplest design strategy? Design the largest primary cache without slowing down the clock Or adding pipeline stages.

Works very well with a direct-mapped cache

Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.

– A valid bit added to units smaller than the full • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be large. • Sub-block placement – A valid bit added to units smaller than the full block, called sub-locks – Only read a sub-lock on a miss – If a tag matches, is the word in the cache?

-Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit -Design data RAM that can perform read and write in one cycle, restore old value after tag miss -Hold write data for store in single buffer ahead of cache, write cache data during next store’s tag check -Need to bypass from write buffer if read matches write buffer tag