COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Processor - Memory Interface
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Computer Architecture Key Points
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke.
Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Cache Memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Fall EE 333 Lillevik 333f06-l16 University of Portland School of Engineering Computer Organization Lecture 16 Write-through, write-back cache Memory.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CS161 – Design and Architecture of Computer
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Update : about 8~16% are writes
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn Sound, WA

Memory Bottleneck State-of-the-art processor f = 3 GHz t clock = 330ps 1-2 instructions per cycle ~25% memory reference Memory response 4 instructions x 330ps ~1.2ns needed! Bulk semiconductor RAM 100ns+ for a ‘random’ access!  Processor will spend most of its time waiting for memory!

Cache Small, fast memory Typically ~50kbytes (1998) 2 cycle access time Same die as processor “Off-chip” cache possible Custom cache chip closely coupled to processor Use fast static RAM (SRAM) rather than slower dynamic RAM Several levels possible 2 nd level of the memory hierarchy “Caches” most recently used memory locations “closer” to the processor closer = closer in time

Cache Etymology cacher (French) = “to hide” Transparent to a program Programs simply run slower without it Modern processors rely on it Reduces the cost of main memory access Enables instruction/cycle throughput Typical program ~25% memory accesses

Cache Relies upon locality of reference Programs continually use - and re-use - the same locations Instructions loops, common subroutines Data look-up tables “working” data sets

Cache - operation Memory requests checked in cache first If the word sought is in the cache, it’s read from cache (or updated in cache)  Cache hit If not, request is passed to main memory and data is read (written) there  Cache miss CPU MMU Cache Main Mem D or I VA PA D or I

Cache - operation Hit rates of 95% are usual Cache: 16 kbytes Effective Memory Access Time Cache: 2 cycles Main memory: 10 cycles Average access: 0.95* *10 = 2.4 cycles

Cache - organisation Direct-mapped cache Each word in the cache has a tag Assume cache size - 2 k words machine words - p bits byte-addressed memory m = log 2 ( p/8 ) bits not used to address words m = 2 for 32-bit machines p-k-mmk p bits tagcache address byte address Address format

Cache - organisation Direct-mapped cache p-k-mmk tagcache address byte address tagdata Hit? memory CPU 2 k lines p-k-mp A cache line Memory address

Cache - Direct Mapped Conflicts Two addresses separated by 2 k+m will hit the same cache location 32-bit machine, 64kbyte (16kword) cache  m = 2, k = 14  Any program or data set larger than 64kb will generate conflicts On a conflict, the ‘old’ word is flushed Unmodified word ( Program, constant data ) overwritten by the new data from memory Modified data needs to be written back to memory before being overwritten

Cache - Conflicts Modified or dirty words When a word is modified in cache  Write-back cache Only writes data back when needed  Misses  Two memory accesses Write modified word back Read new word  Write-through cache Low priority write to main memory is queued Processor is delayed by read only Memory write occurs in parallel with other work Instruction and necessary data fetches take priority

Cache - Write-through or write-back? Write-through Allows an intelligent bus interface unit to make efficient use of a serious bottle-neck Processor - memory interface (Main memory bus) Reads (instruction and data) need priority! They stall the processor Writes can be delayed At least until the location is needed! More on intelligent system interface units later but...

Cache - Write-through or write-back? Write-through Seems a good idea! but... Multiple writes to the same location waste memory bus bandwidth  Typical programs run better with write-back caches however Often you can easily predict which will be best  Some processors ( eg PowerPC) allow you to classify memory regions as write-back or write-through

Cache - more bits Cache lines need some status bits Tag bits +.. Valid All set to false on power up Set to true as words are loaded into cache Dirty Needed by write-back cache Write- through cache always queues the write, so lines are never ‘dirty’

Cache - Improving Performance Conflicts ( addresses 2 k+m bytes apart ) Degrade cache performance Lower hit rate Murphy’s Law operates Addresses are never random! Some locations ‘thrash’ in cache Continually replaced and restored

Cache - Fully Associative All tags are compared at the same time Words can use any cache line

Cache - Fully Associative Associative Each tag is compared at the same time Any match  hit Avoids ‘unnecessary’ flushing Replacement Least Recently Used - LRU Needs extra status bits Cycles since last accessed Hardware cost high Extra comparators Wider tags p-m bits vs p-k-m bits

Cache - Set Associative Each line - two words two comparators only 2-way set associative

Cache - Set Associative n -way set associative caches n can be small: 2, 4, 8 Best performance Reasonable hardware cost Most high performance processors Replacement policy LRU choice from n Reasonable LRU approximation 1 or 2 bits Set on access Cleared / decremented by timer Choose cleared word for replacement

Cache - Locality of Reference  Temporal Locality Same location will be referenced again soon Access same data again Program loops - access same instruction again Caches described so far exploit temporal locality  Spatial Locality Nearby locations will be referenced soon Next element of an array Next instruction of a program

Cache - Line Length Spatial Locality Use very long cache lines Fetch one datum  Neighbours fetched also PowerPC 601 (Motorola/Apple/IBM) first of the single chip Power processors 64 sets 8-way set associative 32 bytes per line 32 bytes (8 instructions) fetched into instruction buffer in one cycle 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches Unified cache Instructions and Data in same cache Two caches - * Instructions * Data  Increases total bandwidth MIPS R Kbyte Instruction; 32Kbyte Data Instruction cache is pre-decoded! (32  36bits) Data 8-word (64byte) line, 2-way set associative 256 sets Replacement policy?