Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Memory Management 2010.
Chapter 3.2 : Virtual Memory
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
Review of Memory Management, Virtual Memory CS448.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Lecture 19: Virtual Memory
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CS203 – Advanced Computer Architecture Virtual Memory.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CS161 – Design and Architecture of Computer
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Virtual Memory 4 classes to go! Today: Virtual Memory.
Lecture 14: Reducing Cache Misses
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Virtual Memory Overcoming main memory size limitation
CSC3050 – Computer Architecture
Main Memory Background
Presentation transcript:

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism –Reduce hit time AMAT = Hit time + Miss rate × Miss penalty

5.5. Reducing Miss Rate Three sources of misses: –Compulsory “cold start misses” –Capacity Cache is full –Conflict Set is full/block is occupied Increase block size Increase size of cache Increase degree of associativity

Larger Block Size Bigger blocks reduce compulsory misses –Spatial locality BUT: –Increased miss penalty More data to transfer –Possibly increased overall miss rate More conflict and capacity misses as there are fewer blocks

Effect of Block Size AMAT Block size Miss rate Block size Transfer Access Miss penalty Block size

Larger Caches Reduces capacity misses Increases hit time and cost

Higher Associativity Miss rates improve with higher associativity Two rules of thumb: –8-way set associative caches are almost as effective as fully associative But much simpler! –2:1 cache rule A direct mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2

Way Prediction Set-associative cache predicts which block will be needed on next access to the set Only one tag check is done –If mispredicted the whole set must be checked E.g. Alpha instruction cache –Prediction rate > 85% –Correct prediction: 1 cycle hit –Misprediction: 3 cycles

Pseudo-Associative Caches Check a direct mapped cache for a hit as usual If it misses, check a second block –Invert MSB of index One fast and one slow hit time

Compiler Optimisations Compilers can optimise code to minimise miss rates: –Reordering procedures –Aligning basic blocks with cache blocks –Reorganising array element accesses

5.6. Reduce Miss Rate or Miss Penalty via Parallelism Three techniques that overlap instruction execution with memory access

Nonblocking caches Dynamic scheduling allows CPU to continue with other instructions while waiting for data Nonblocking cache allows other cache accesses to continue while waiting for data

Hardware Prefetching Fetch data/instructions before they are requested by the processor –Either into cache or another buffer Particularly useful for instructions –High degree of spatial locality UltraSPARC III –Special prefetch cache for data –Increases effectiveness by about four times

Compiler Prefetching Compiler inserts “prefetch” instructions Two types: –Prefetch register value –Prefetch data cache block Can be faulting or non-faulting Cache continues as normal while data is prefetched

SPARC V9 Prefetch: prefetch [%rs1 + %rs2], fcn prefetch [%rs1 + imm13], fcn fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page

5.7. Reducing Hit Time Critical –Often affects CPU clock cycle time

Small, simple caches Small usually equals fast in hardware A small cache may reside on the processor chip –Decreases communication –Compromise: tags on chip, data separate Direct mapped –Data can be read in parallel with tag checking

Avoiding address translation Physical caches –Use physical addresses Address translation must happen before cache lookup Virtual caches –Use virtual addresses –Protection issues –High context switching overhead

Virtual caches Minimising context switch overhead: –Add process-identifier tag to cache Multiple virtual addresses may refer to a single physical address –Hardware enforces anti-aliasing –Software requires less significant bits to be the same

Avoiding address translation (cont.) Choice of page size: –Bigger than cache index + offset –Address translation and tag lookup can happen in parallel OffsetTagIndex Address CPU Page offsetPage no. Cache VM Translation

Pipelining cache access Split cache access into several stages Impacts on branch and load delays

Trace caches Blocks follow program flow rather than spatial locality! Branch prediction is taken into account by cache Intel NetBurst microarchitecture Complicates address mapping Minimises wasted space within blocks

Cache Optimisation Summary Cache optimisation is very complex –Improving one factor may have a negative impact on another

5.6. Main Memory Latency and bandwidth are both important Latency is composed of two factors: –Access time –Cycle time Two main technologies: –DRAM –SRAM

5.7. Virtual Memory Physical memory is divided into blocks –Allocated to processes –Provides protection –Allows swapping to disk –Simplifies loading Historically: –Overlays Programmer controlled swapping

Terminology Block: –Page –Segment Miss: –Page fault –Address fault Memory mapping (address translation) –Virtual address  physical address

Characteristics Block size –4kB – 64kB Hit time –50 – 150 cycles Miss penalty – – cycles Miss Rate – – 0.001% 

Categorising VM Systems Fixed block size –Pages Variable block size –Segments –Difficult replacement Hybrid approaches –Paged segments –Multiple page sizes (2 n × smallest)

Q1: Block placement? Anywhere in memory –“Fully associative” –Minimises miss rate

Q2: Block identification? Page/segment number gives physical page address –Paging: offset concatenated –Segments: offset added Uses a page table –Number of pages in virtual address space –Save space: inverted page table Number of pages in physical memory

Q3: Block replacement? Least-recently used (LRU) –Minimises miss rate –Hardware provides a use bit or reference bit

Q4: Write strategy? Write back –With a dirty bit You won’t become famous by being the first to try write through!

Fast Address Translation Page tables are big –Stored in memory themselves –Two memory accesses for every datum! Principle of locality –Cache recent translations –Translation look-aside buffer (TLB), or translation buffer (TB)

Alpha TLB

Selecting a Page Size Big –Smaller page table –Allows parallel cache access –Efficient disk transfers –Reduces TLB misses Small –Less memory wastage (internal fragmentation) –Quicker process startup

Putting it ALL Together! SPARC Revisited

Two SPARCs SuperSPARC –1992 –32-bit superscalar design UltraSPARC –Late 1990’s –64-bit design –Graphics support (VIS)

UltraSPARC Four-way superscalar execution Two integer ALU’s FP unit –Five functional units Graphics unit

Pipeline 9 stages: –Fetch –Decode –Grouping –Execution –Cache access –Load miss –Integer pipe wait (for FP/graphics pipelines) –Trap resolution –Writeback

Branch Handling Dynamic branch prediction –Two bit scheme –Every second instruction in cache has prediction bits (predicts up to 2048 branches) –88% success rate (integer) Target prediction –Fetches from predicted path

FPU Five functional units: –Add –Multiply –Divide/square root –Two graphics units (add and multiply) Mostly fully pipelined (latency 3 cycles) –Except divide and square root (not pipelined, latency is 22 cycles for 64-bit)

Memory Hierarchy On-chip instruction and data caches –Data: 16kB direct-mapped, write-through –Instructions: 16kB 2-way set associative –Both virtually addressed External cache –Up to 4MB

Virtual Memory 64-bit virtual addresses  44-bit physical addresses TLB –64 entry, fully-associative cache

Multimedia Support (VIS) Integrated with FPU Partitioned operations –Multiple smaller values in 64-bits Video compression instructions –E.g. motion estimation instruction replaces 48 simple instructions for MPEG compression

The End!