Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Principle Behind Hierarchical Storage  Each level memorizes values stored at lower levels  Instead of paying the full latency for the “furthermost” level.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Computing Systems Memory Hierarchy.
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Review (1/2) °Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster °Each level of memory.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
CS 704 Advanced Computer Architecture
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Krste Asanovic Electrical Engineering and Computer Sciences
Ka-Ming Keung Swamy D Ponpandi
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o p y - 4 b inst 16 c o p y - 1 Block Cache Instructions from the fill unit (n-bit) What about fragmentation?

Predict and Fetch Trace Fetch Cycle Predict Cycle Block Cache More efficient: redundancy is in the trace table and not the block cache

Next Trace Prediction... b_id0b_id1b_id2b_id3 w pred. block_ids Trace Table Hash Function Next trace_id to the block cache predicted branch path

The Block-Based Trace Cache Fetch Buffer trace_id Completion Final Collapse Br. block_ids I-Cache pre-collapse hist. Execution Core History Hash Fill Unit Rename Table Trace Table Block Cache

1. Next trace prediction 2. Trace cache fetch Proposed Trace Cache Enhanced Instruction Cache Fetch Completion Execution Core 1. Multiple-branch prediction 2. Instruction cache fetch 3. Instruction alignment & collapsing 1. Multiple-branch predictor update Execution Core Wide-Fetch I-cache vs. T-cache 1. Trace construction and fill

Trace Cache Trade-offs Fetch time complexity Trace cache: Enhanced instruction cache: Pros  Moves complexity to backend Cons  Inefficient instruction storage Pros  Efficient instruction storage Cons  Complexity during fetch time Instruction storage redundancy

As Machines Get Wider (… and Deeper) Fetch Rename Dispatch Execute Retire Dispatch Execute Retire 1. Eliminate Stages 2 Relocate work to the backend Decode Rename

Flow Path Model of Superscalars I-cache FETCH DECODE COMMIT D-cache Branch Predictor Instruction Buffer Store Queue Reorder Buffer Integer Floating-point Media Memory Instruction Register Data Memory Data Flow EXECUTE (ROB) Flow

CPU-Memory Bottleneck  Performance of high speed computers is usually limited by memory performance, bandwidth & latency  Main memory access time  Processor cycle time over 100 times difference!!  if m fraction of instructions are loads and stores then average ‘1+m’ references per instruction suppose m=40%,  22.4 GByte/sec CPU Memory

How to Incorporate Faster Memory  SRAM access time << Main memory access time  SRAM bandwidth >> Main memory bandwidth  SRAM is expensive  SRAM is smaller than main memory  Programs exhibit temporal locality ­frequently-used data can be held in the scratch pad ­the cost of the first and last memory access can be amortized over multiple reuse  Programs must have a small working set (aka footprint) CPU Main Memory (DRAM) RF Scratch Pad (SRAM)

Caches: Automatic Management of Fast Storage CPU cache Main Memory CPU L2 cache Main Memory L3 cache L1 16~32KB 1~2 pclk latency ~256KB ~10 pclk latency ~50 pclk latency ~4MB

Cache Memory Structures indexkey idx key tag data decoder Indexed Memory k-bit index 2 k blocks Associative Memory (CAM) no index unlimited blocks N-Way Set-Associative Memory k-bit index 2 k N blocks

Direct Mapped Caches tag idx b.o. = Tag match Multiplexor decoder = Tag Match decoder tag index block index

 Each cache block or (cache line) has only one tag but can hold multiple “chunks” of data ­reduce tag storage overhead In 32-bit addressing, an 1-MB direct-mapped cache has 12 bits of tags 4-byte cache block  256K blocks  ~384KB of tag 128-byte cache block  8K blocks  ~12KB of tag ­the entire cache block is transferred to and from memory all at once good for spatial locality since if you access address i, you will probably want i+1 as well (prefetching effect)  Block size = 2 b ; Direct Mapped Cache Size = 2 B+b Cache Block Size tag block index block offset LSBMSB B-bitsb-bits

Large Blocks and Subblocking  Large cache blocks can take a long time to refill ­refill cache line critical word first ­restart cache access before complete refill  Large cache blocks can waste bus bandwidth if block size is larger than spatial locality ­divide a block into subblocks ­associate separate valid bits for each subblock. tagsubblockv v v

tag blk.offset Fully Associative Cache     Multiplexor Associative Search Tag

tag index BO N-Way Set Associative Cache   Multiplexor Associative search decoder Cache Size = N x 2 B+b

N-Way Set Associative Cache tag idx b.o. = Tag match decoder = Tag match Multiplexor decoder a set a way (bank) Cache Size = N x 2 B+b

Pseudo Associative Cache  Simple direct-mapped cache structure, 1st look up is unchanged  If 1st lookup misses, start a 2nd lookup using a hashed index (e.g. flip msb)  If 2nd lookup hits, then swap the contents between the 1st and 2nd lookup locations  If 2nd lookup fails then go to the next hierarchy A cache hit on the 2nd lookup is slower, but still much faster than a miss 1st lookup2nd lookupMemory lookup

Victim Cache: a small cache to backup a direct-map cache  Lines evicted from the direct-mapped cache due to collision is stored into the victim cache  Avoids ping-ponging when the working set contains a few addresses that collides tag = status data block tag = status data block address from CPU regular direct-map cache small victim cache

Reducing the Miss Penalty  Giving priority to read-misses over writes ­a write can be completed by writing into a write buffer but a read miss will stall the processor  let the read overtake the writes  Sub-block replacement: fetch only part of a block  Early restart and critical word first  Non-blocking caches to reduce stalls on cache misses ­allowing new fetches to proceed out-of-order after a miss  Multilevel caches ­reduced latency on a primary miss Inclusion Property

Principle Behind Hierarchical Storage  Each level memoizes values stored at lower levels  Instead of paying the full latency for the “furthermost” level of storage each time Effective Access T i = h i t i + (1 - h i ) T i+1  where h i is the ‘hit’ ratio, the probability of finding the desired data memoized at level i  t i is the raw access time of memory at level i  Given a program with good locality of reference S working-set < s i  h i  1  T i  t i  A balanced system achieves the best of both worlds ­the performance of higher-level storage ­the capacity of lower-level low-cost storage.