Lu Peng, Jih-Kwon Peir, Konrad Lai

Slides:



Advertisements
Similar presentations
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Computing hardware CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Nios II Processor: Memory Organization and Access
Cache Memory.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
15-740/ Computer Architecture Lecture 3: Performance
Data Prefetching Smruti R. Sarangi.
Instruction Level Parallelism
Improving Memory Access 1/3 The Cache and Virtual Memory
CSCI206 - Computer Organization & Programming
Multiscalar Processors
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Single Clock Datapath With Control
Pipeline Implementation (4.6)
CACHE MEMORY.
MPOC “Many Processors, One Chip”
Flow Path Model of Superscalars
CSCI206 - Computer Organization & Programming
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Address-Value Delta (AVD) Prediction
Ka-Ming Keung Swamy D Ponpandi
CSCI206 - Computer Organization & Programming
Chapter 6 Memory System Design
Performance metrics for caches
How can we find data in the cache?
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
* From AMD 1996 Publication #18522 Revision E
Data Prefetching Smruti R. Sarangi.
Morgan Kaufmann Publishers The Processor
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Lecture 1 An Overview of High-Performance Computer Architecture
Lecture 7 Memory Hierarchy and Cache Design
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Ka-Ming Keung Swamy D Ponpandi
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lu Peng, Jih-Kwon Peir, Konrad Lai Signature Buffer: Bridging Performance Gap between Registers and Caches Lu Peng, Jih-Kwon Peir, Konrad Lai

Introduction Two types of storage Registers Fast and small Supply data for operations Memory Large and slow Cache for recently used data Most RISC only operates on data from registers Data communication path Producer -> store -> load -> consumer

Introduction Future processors with 35nm technology 10 GHz clock 64 KB L1 cache 3-7 cycles L1 cache access time IPC degrades by 3.5% per additional cycle on L1 cache access time

Signature Buffer Zero-cycle load Avoid address calculation “The load and its dependent instructions can be fetched, dispatched and executed at the same time” Avoid address calculation Each load and store uses a signature for accessing the storage The signature buffer can be accessed in early pipeline stages A signature consists of, Color of the base register Displacement value

Outline Motivation Implementation Performance evaluation

Motivation – Memory Reference Correlations Signature correlations Store-load and load-load can be correlated directly by the signature Signature reference locality Nearby memory references often differ by small displacement value with the same base register

Example 1 Signature correlations Signature reference locality Source and Assembly Codes of Function copy_disjunct from Parser

Example 2 Source and Assembly Codes of Function bsW from Bzip

Signature Buffer

Signature Buffer Initial State 1 2 3 32

Signature Buffer 1 2 -> 32 1 100 3 32 -> 33 1 -- 100

Data Alignment

Data Alignment SB Directory SB Data Array SB tag L1 tag Valid Bound L1 Data Array L1 Tag Array Tag Requests (Signature): A-001 -> A-101 -> B-010 -> X-000 (Real Address) : C-100 D-000 D-101 D-000

Data Alignment SB MISS! SB Directory SB Data Array SB tag L1 tag Valid Bound A C I-V 101 001 L1 Data Array L1 Tag Array 100 SB MISS! Tag C D Requests (Signature): A-001 -> A-101 -> B-010 -> X-000 (Real Address) : C-100 D-000 D-101 D-000

Data Alignment SB MISS! SB Directory SB Data Array SB tag L1 tag Valid Bound A C V-V 101 101 001 L1 Data Array L1 Tag Array 100 000 SB MISS! Tag C D Requests (Signature): A-001 -> A-101 -> B-010 -> X-000 (Real Address) : C-100 D-000 D-101 D-000

Data Alignment SB MISS! SB Directory SB Data Array SB tag L1 tag Valid Bound A C V-V 101 B D I-V 101 001 010 L1 Data Array L1 Tag Array 100 101 000 SB MISS! Tag C D Requests (Signature): A-001 -> A-101 -> B-010 -> X-000 (Real Address) : C-100 D-000 D-101 D-000

Data Alignment SB MISS! Invalidate high A, low B SB Directory SB Data Array SB tag L1 tag Valid Bound A C I-V 101 B D I-I 101 001 010 L1 Data Array L1 Tag Array 100 101 000 SB MISS! Invalidate high A, low B Tag C D Requests (Signature): A-001 -> A-101 -> B-010 -> X-000 (Real Address) : C-100 D-000 D-101 D-000

Microarchitecture Bypass I Bypass II SB hit or an early store-load forwarding Bypass II Normal store-load forwarding

Microarchitecture

Performance Evaluation

Performance Evaluation – IPC SB – nospec 13% speedup SB – perfect 14% speedup

Performance Evaluation – Load Distribution Normal S-L Forw. & L1 access reduced t0 30%, 70% of loads benefit from SB SB With perfect memory dependence predictor obtains 23% zero-cycle load

Performance Evaluation – SB Hit Ratio Average SB hit rate is about 51%

Performance Evaluation – Comparison with L0 Cache Performance benefit of SB goes up with L1 latency and always above having a L0 cache

Performance Evaluation – Comparison with L0 Cache Larger L0 => higher hit rate SB is less sensitive to size.

Advantages Non-speculative Data obtained from the SB without intervening stores is always correct All loads can access the data from the SB without any restriction on the type of the loads or base registers. Loads through the SB can bypass the address generation and cache access completely. Store/Load correlation is established from the instruction encoding bits to simplify hardware requirement. SB uses line-based granularity to capture spatial locality.

Questions?

Loads – SB Specific Early S-L forwarding Early SB access A load has identical signature with an early store in the LSQ with no intervening store in between. (zero-cycle load & SB hit) Early SB access SB is accessed after a load is fetched and decoded (zero-cycle load & SB hit) Delayed SB access SB is accessed after memory dependence resolutions because of intervening stores (SB hit) Non-Signature Forwarding Consecutive SB misses to the same SB line gets forwarded data from previous misses (SB miss)