Computer Architecture

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Caches Vincent H. Berk October 21, 2005
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
ENGS 116 Lecture 131 Caches and Virtual Memory Vincent H. Berk October 31 st, 2008 Reading for Today: Sections C.1 – C.3 (Jouppi article) Reading for Monday:
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Lecture 15 Calculating and Improving Cache Perfomance
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
Advanced Computer Architectures
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Memory Hierarchy.
Siddhartha Chatterjee
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Cache Performance Improvements
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Computer Architecture Lecture 21 Memory Hierarchy Design

The levels in a typical memory hierarchy CPU Registers CACHE Memory I/O Devices Increasing Distance from CPU Access Time, Cost per bit, Size

Cache performance review Level 1 2 3 4 Name Registers Cache Main memory Disk storage Typical size <1KB < 16MB < 16 GB > 200 GB Technology Custom mem. With multiple ports On-chip CMOS SRAM CMOS DRAM Magnetic disk Access time (ns) 0.25-0.5 0.5-25 45-250 5,000,000 Bandwidth (MB/sec) 50,000-1000,000 2000-50,000 1000-3000 20-150 Managed by Compiler Hardware Operating system Operating sys Backed by cache Disk CD or tape

Performance baseline, the gap in performance

Core/Bus Speed                                                                                             Figure: Memory Access Speed Source: http://www.oempcworld.com/support/Memory_Access_vs_CPU_Speed.htm

Basic Philosophy Temporal Locality Spatial Locality

Review of the ABCs of Caches Victim Cache Fully associative Write allocate Non-Blocking Dirty bit Unified cache Mem. stall cycles Block offset Misses/instruction Direct mapped Write back Block Valid bit Data cache Locality Block address Hit time Address trace Write through Cache miss Set Instr. Cycle Page fault Trace Cache AMAT Miss rate Index field Cache hit Set Associative No-write allocate Page LRU Write buffer Miss penalty Tag field Write stall

Basic Terms Cache Block Miss/Hit Miss Rate/Hit Rate Miss Penalty Hit Time 3-Cs of caches Conflict Compulsory Capacity

L1 Typical Parameters Characteristic Typical Intel P4 Alpha 21264 MIPS R10000 AMD Optron Itanium Type (Split/Unified) Split Organization 2-Way to 4-way 4-way 3-way 2-way Block Size (Bytes) 16-64 64 Size 8KB to 128KB D=8KB I=96KB I= I=32K D=32K D= 64KB I = 64KB Access Time/Latency 2-8 nS 2- 4 CC 3 Issue 3-6 4 Architecture 32

Four Memory Hierarchy Questions Where can a block be placed Direct Mapped to Fully Associative How a block is found Tag Comparison Which block should be replaced on a cache miss (only for sets) LRU, Random, FIFO (Levels off > 256KB)

Direct Mapped Cache Assume 5-bit address bus and cache with 8 entries HIT TAG DATA Index D4 – D3 000 Processor TAG 001 010 011 100 Index 101 D2 - D0 110 111 Data Bus = HIT

Direct Mapped Cache First Load HIT TAG DATA Index D4 – D3 TAG = 01 000 Processor 001 010 011 100 101 D2 - D0 = 010 110 111 Data Bus LD R1, (01010) ;remember 5-bit address bus, assume data is 8-bit and AA16 is stored at this location First time, cause a MISS, data loaded from memory and cache HIT bit is set to 1

Direct Mapped Cache After first load HIT TAG DATA Index D4 – D3 TAG = 01 000 Processor 001 1 01 AA 010 011 100 101 D2 - D0 = 010 110 111 Data Bus LD R1, (01010) ; AA16 is stored at this location, Cache HIT bit is set to 1

Direct Mapped Cache Second Load HIT TAG DATA Index TAG = 11 D4 – D3 000 Processor 001 1 01 AA 010 011 100 101 D2 - D0 = 010 110 111 Data Bus LD R1, (11010) ; assume 99 at address 11010 Same index but different TAG will cause a MISS, data loaded from memory

Direct Mapped Cache After Second Load HIT TAG DATA Index D4 – D3 TAG = 11 000 Processor 001 1 11 99 010 011 100 101 D2 - D0 = 010 110 111 Data Bus LD R1, (11010) ;remember 5-bit address bus, assume 99 First time, same index but different TAG will cause a MISS, data loaded from memory

Miss Rate Reduction Technique Larger Block Size - increases miss penalty Increased Conflict Misses Reduced compulsory misses

Cache Size Example (1) Direct Mapped HIT (1 bit) TAG (15 bit) DATA (32 bit) 32K X 48-bit Memory 1 1111 1111 1111 1111 1111 1111 32 K Entries Address Bus (A17 – A2) Processor Address Bus (32-bit) 0 0000 0000 0000 0000 0000 0000 A31- A2=18 = (15-bit) Processor Address bus = 32 bit (A) Cache Storage = 128KB = 32 K Words (2N) with N = 15 Number of blocks in cache (entries) = 32K Tag Size = A- N- 2 = 32 – 15 – 2 (Byte offset) = 15 Cache Size = 128KB (data) + 32K X 15-bit (tag) + 32K X 1-bit (Hit bit) = 192KB Data Out

Cache Size Example (1) Two-Way Set Associative Assume same processor (A = 32, D= 32) Assume same total storage of data = 128KB Two sets means we will have two direct mapped caches with 64KB (128/2) each. 64KB = 16K words To address 16K X 32-bit memory we need 14-bit address. Hence Tag Size = 32-14-2 = 16

Cache Size Example (1) Two-Way Set Associative HIT (1 bit) TAG (16 bit) DATA (32 bit) HIT (1 bit) TAG (16 bit) DATA (32 bit) 16K X 49-bit Memories 1111 1111 1111 1111 1111 1111 16 K Entries Address Bus (A16 – A2) Address Bus (A16 – A2) 0000 0000 0000 0000 0000 0000 A31- A17 A31- A17 = (16-bit) = (16-bit) Data Out Size = 2 (Sets) X 16K X (32-bit + 16-bit + 1-bit) = 196KB Data Out 2:1 MUX

Cache Size Example (1) 4-Way Set Associative Assume same processor (A = 32, D= 32) Assume same total storage of data = 128MB Four sets means we will have four direct mapped caches with 32KB (128/4) each. 32KB = 8K words To address 8K X 32-bit memory we need 13-bit address. Hence Tag Size = 32-13-2 = 17

Cache Size Example (1) 4-Way Set Associative HIT TAG 17 HIT TAG 17 HIT TAG 17 HIT TAG 17 8K X 50-bit Memories 8 M Entries 8 M Entries 8 M Entries Address Bus (A15 – A2) Address Bus (A15 – A2) Address Bus (A15 – A2) Address Bus (A15 – A2) A31- A16 = (17-bit) A31- A16 = (17-bit) A31- A16 A31- A16 = (17-bit) = (17-bit) Data Out Data Out Data Out 4:1 MUX Size = 4 (Sets) X 8K X (32-bit + 7-bit + 1-bit) = 200KB Data Out to processor

8-Way Set associative Cache?

Organization of the data cache Alpha 21264 Block 1 1 CPU address Data Data In out Tag 29 Index 9 Data <64> Valid <1> tag <29> 2 512 blocks 3 =? Victim buffer 2 512 blocks 3 2:1 Mux 4 =? Lower memory level

4 Qs (Contd..) What Happens on a Write? Write Back – Main Memory only updated when data is replaced from cache Write Through – The information is updated in upper as well as lower level. Write Allocate: Allocate data in cache on write Write No-Allocate: Only write to next level.

Reducing Cache miss penalty First miss penalty reduction technique: multilevel caches Second miss penalty reduction technique: Critical word first Early restart Third miss penalty reduction technique: Giving priority to read misses over writes Fourth miss penalty reduction technique: Victim Caches

Reducing Miss Rate Classifying Misses: 3 Cs More recent, 4th “C”: Compulsory — The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity — If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size Cache) Conflict — If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, 4th “C”: Coherence — Misses caused by cache coherence.

Second Miss Rate Reduction Technique Associativity Vs Size 0.14 Larger Caches Reduce Capacity misses Drawbacks: Higher cost, Longer hit time 1-way 0.12 2-way 0.1 4-way 0.08 8-way Miss Rate per Type 0.06 Capacity 0.04 Compulsory 0.02 1 2 4 8 16 32 64 Cache Size (KB) 128

Third Miss Rate Reduction Technique Higher Associativity Miss rates improve with higher associativity Two rules of thumb 8-way set-associative is almost as effective in reducing misses as fully-associative cache of the same size 2:1 Cache Rule: Miss Rate DM cache size N = Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

Number of instructions Method Example Given Statistics Load/Store Instructions: 50% Hit Time = 2 Clock Cycles, Hit rate = 90% Miss Penalty = 40 CC ____________________________________ Average Memory Access /instruction = 1.5 Ave. Mem Access Time = Hit time + Miss rate * Miss Penalty = 2 + 0.1 *40 = 2 + 4 = 6 ; 4 is Penalty Cycles CPI = ? CPI (with perfect cache) = 2 CPI (overall) = CPI (perfect) + Extra Memory Stall Cycles/Instruction (penalty Cycles) = 2 + (6 – 2) * 1.5 = 2 + 6 = 8 Number of instructions Method Assume total instructions = 1000 Perfect Cache Each instruction takes 2 clock cycles, hence 1000 * 2 = 2000Clock cycles CPI (Perfect) = CC/IC = 2000/1000 = 2 Imperfect Cache Calculate Extra Clock Cycle Number memory access = 1000 * 1.5 ( 1000 for I$ and 500 for D$) = 1500 Memory access in 1000 Instruction program. Cache missed (at 10%) = 1500 * 0.1 = 150 Extra(Penalty) Clock Cycles for Missed Cache = 150 * 40 = 6000 Which is infact: = IC  (Mem Access/Instruc) * Miss Rate * Miss Penalty Total clock cycle for instruction with perfect cache = 2000 Clock Cycles Total for Program = 2000 + 6000 = 8000 CPI = 8000/1000 = 8.0

Example with 2-level Cache Stats: L1: Hit Time = 2 Clock Cycles, Hit rate = 90%, Miss Penalty to L2 = 10 CC (Hit time for L2) L2: Local Hit Rate = 80%, Miss Penalty(L2)= 40 CC Load/Store Instructions: 50% HT = 40 CC Global Miss Rate = ? Main Memory HT= 2 CC Hit rate = 90%, L 2 CPU L 1 1000 Memory Accesses: 100 Miss Out of 100 Memory Accesses: 20 Miss

Example 1 Once again Perfect Cache CPI = 2.0 AMAT = Hit TimeL1 + Miss Rate1 (Hit TimeL2 + Miss rateL2  Miss PenaltyL2) = 2 + 0.1 (10 + 0.2  40) = 3.8 CPI = CPI perfect + Extra Memory Stall Cycles/instruction = 2.0 + (3.8-2)  1.5 = 4.7

(Calculate Extra Clock Cycles starting from missing from L1) Example 2 (contd..) 1000 Instruction Method (Calculate Extra Clock Cycles starting from missing from L1) Step 2 (Hit on L2) Total Accesses in L2 = 150 (Misses from L1) Extra CC on miss in L1 and hit in L2 = 150 * 10 = 1500 (eventually all get a hit – very imp) Step 3 (Miss on L2) Miss rate = (100-80) = 20% Instructions missed on L2 = 150  .2 = 30 Extra CC on miss in L2 = 30  40 = 1200 Total Extra Clock Cycles = 1500 + 1200 = 2700 Total Clock Cycles for the program = 2000 + 2700 = 4700 CPI = 4700/1000 = 4.7

Fourth Miss Rate Reduction Technique Way Prediction and Pseudo-associative Caches Way Prediction: extra bits are kept to predict the way or block within a set Mux is set early to select the desired block Only a single tag comparison is performed What if miss? => check the other blocks in the set Used in Alpha 21264 (1 bit per block in IC$) 1 cc if predictor is correct, 3 cc if not Effectiveness: prediction accuracy is 85% Used in MIPS 4300 embedded proc. to lower power

Fifth Miss Rate Reduction Technique Compiler Optimization Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Reducing Cache Miss Penalty: Multi-Level Cache More the merrier Critical Word First Early Start Impatience

Reducing Cache Miss Penalty: Give priority to Read over Write; Preference

Reducing Cache Miss Penalty: Merging Write Buffer : Partnership

Reducing Cache Miss Penalty: Victim Cache: recycling

Reducing Cache Miss Penalty: Non Blocking Cache Hit Under 1 Miss, Hit under 2 Misses, Hit under 64 misses

Reducing Cache Miss Penalty or Miss Rate Miss Penalty/Rate Reduction Technique: Hardware Perfetching of Instruction and Data Compiler-Controlled Prefetching Register Perfetching Cache Perfetching

Reducing Hit Time First Hit Time Reduction Technique: Small and Simple Caches Second Hit Time Reduction Technique: Avoiding Address Translation during Indexing of the Cache Third Hit Time Reduction Technique: Pipelining Cache Access Fourth Hit Time Reduction Technique: Trace Caches

Access Times Vs Size and Associativity 16 14 1- Way (direct mapped) 2- Way 4- Way Fully Associative 12 10 8 Access Time (ns) 6 4 2 4KB 8KB 16KB 32KB 64KB 128KB 256KB Cache Size

Main memory and Organization for Improving Performance Techniques for Higher Bandwidth Wider Main Memory Simple Interleaved memory Independent Memory Banks