BALANCED CACHE Ayşe BAKIR, Zeynep ZENGİN. Ayse Bakır,CMPE 511,Bogazici University2 Outline  Introduction  Motivation  The B-Cache Organization  Experimental.

Slides:

Advertisements

Similar presentations

UNITED NATIONS Shipment Details Report – January 2006.

Advertisements

T. S. Eugene Ng Mellon University1 Towards Global Network Positioning T. S. Eugene Ng and Hui Zhang Department of Computer.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Chapter 4 Memory Management 4.1 Basic memory management 4.2 Swapping

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

SE-292 High Performance Computing

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.

Chapter 4 Memory Management Basic memory management Swapping

Page Replacement Algorithms

Cache and Virtual Memory Replacement Algorithms

Chapter 3.3 : OS Policies for Virtual Memory

Chapter 10: Virtual Memory

Virtual Memory II Chapter 8.

Memory Management.

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Best PracticesUSCA Fall 2010: Baylor University3.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

SE-292 High Performance Computing

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

PSSA Preparation.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Memory Systems Architecture and Hierarchical Memory Systems

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Computer Architecture Lecture 26 Fasih ur Rehman.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Cache Memory By Tom Austin. What is cache memory? A cache is a collection of duplicate data, where the original data is expensive to fetch or compute.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Two Dimensional Highly Associative Level-Two Cache Design

COSC3330 Computer Architecture

Memory COMPUTER ARCHITECTURE

Multilevel Memories (Improving performance using alittle “cash”)

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Cache Memory Presentation I

Part V Memory System Design

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

Module IV Memory Organization.

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Cache - Optimization.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

BALANCED CACHE Ayşe BAKIR, Zeynep ZENGİN

Ayse Bakır,CMPE 511,Bogazici University2 Outline  Introduction  Motivation  The B-Cache Organization  Experimental Methodology and Results  Programmable Decoder Design  Analysis  Related Work  Conclusion

Ayse Bakır,CMPE 511,Bogazici University3 Introduction  Increasing gap between memory latency and processor speed is a critical bottleneck to achieve a high performance computing system.  Multilevel memory hierarchy has been developed to hide the memory latency.

Ayse Bakır,CMPE 511,Bogazici University4 Introduction PROCESSOR LEVEL 2 MAIN MEMORY LEVEL 1 Level one cache normally resides on a processor’s critical path, fast access to level one cache is an important issue for improved processor performance.

Ayse Bakır,CMPE 511,Bogazici University5 Introduction There are two cache organization models that have been developed:  Direct-Mapped Cache:  Set-Associative Cache:

Ayse Bakır,CMPE 511,Bogazici University6 Introduction 1.Direct-Mapped Cache:

Ayse Bakır,CMPE 511,Bogazici University7 Introduction 2.Set Associative Cache:

Ayse Bakır,CMPE 511,Bogazici University8 Introduction Direct-Mapped Cache faster access time consumes less power per access consumes less area easy to implement simple to design higher miss rate Set-Associative Cache longer access time consumes more power per access consumes more area reduces conflict misses has a replacement policy

Ayse Bakır,CMPE 511,Bogazici University9 Introduction Frequent hit sets have many more cache hits than other sets. The cache misses occur more frequently in Frequent miss sets. Less accessed sets are accessed less than 1% of the total cache references.

Ayse Bakır,CMPE 511,Bogazici University10 Introduction Balanced Cache (B-Cache): A mechanism to provide the benefit of cache block replacement while maintaining the constant access time of a direct-mapped cache

Ayse Bakır,CMPE 511,Bogazici University11 Introduction 1.The decoder length of a traditional direct- mapped cache is increased by three bits:  accesses to heavily used sets can be reduced to 1/8th of the original design.  only 1/8th of the memory address space has a mapping to the cache sets. 2.A replacement policy is added. 3.A programmable decoder is used.

Ayse Bakır,CMPE 511,Bogazici University12 Motivation - Example 8-bit adresses 0,1,8,9... 0,1,8,9

Ayse Bakır,CMPE 511,Bogazici University13 Motivation - Example 8-bit adress same as in 2-way cache X : invalid PD entry

Ayse Bakır,CMPE 511,Bogazici University14 B-Cache Organization - Terminology Memory address mapping factor (MF): B-Cache associativity (BAS): PI : index length of PD NPI : index length of NPD OI : index length of original direct-mapped cache MF = 2 (PI+NPI) /2 OI, where MF≥1 BAS = 2 OI /2 NPI, where BAS≥1

Ayse Bakır,CMPE 511,Bogazici University15 B-Cache Organization MF = 2 (PI+NPI) /2 OI =2 (6+6) /2 9 =8BAS = 2 (OI) /2 NPI =2 (3) /2 6 =8

Ayse Bakır,CMPE 511,Bogazici University16 B-Cache Organization–Replacement Policy  Random Policy: Simple to design and needs very few extra hardware.  Least Recently Used(LRU): May achieve a better hit rate but will have more area overhead than the random policy.

Ayse Bakır,CMPE 511,Bogazici University17 Experimental Methodology and Results  Miss rate is used as the primary metric to measure the BCache effectiveness, and MP and BAS parameters are determined.  Results are compared with baseline level one cache(a direct-mapped 16kB cache with a line size of 32 bytes for instruction and data caches)  4-issue out-of-order processor simulator is used to collect the miss rate. 26 SPEC2K benchmarks are run using the SimpleScalar tool set.

Ayse Bakır,CMPE 511,Bogazici University18 Experimental Methodology and Results 16 entry victim buffer set-associative caches B-Caches with dif. MFs

Ayse Bakır,CMPE 511,Bogazici University19 Experimental Methodology and Results 16 entry victim buffer set-associative caches B-Caches with dif. MFs The miss rate reduction of the B-Cache is as good as a 4-way cache for the data cache. For the instruction cache, on average, the miss rate reduction is 5% better than a 4-way cache.

Zeynep Zengin, CMPE511, Bogazici Univ.20 Programmable Decoder Design Latency, Storage, Power Costs

Zeynep Zengin, CMPE511, Bogazici Univ.21 Timing Analysis Critical path –Direct mapped: Tag side –B-Cache: May be on tag side or data side B-Cache modifies local decoder

Zeynep Zengin, CMPE511, Bogazici Univ.22 Timing Analysis

Zeynep Zengin, CMPE511, Bogazici Univ.23 Storage Overhead B-cache uses CAM cells additionally CAM cell is 25% larger than the SRAM cell used by data and tag memory

Zeynep Zengin, CMPE511, Bogazici Univ.24 Power Overhead Extra power consumption: PD of each subarray. Power reduction: –3-bit data length reduction –Removal of 3 input NAND gates

Zeynep Zengin, CMPE511, Bogazici Univ.25 ANALYSIS Overall Performance Overall Energy Design Tradeoffs for MP and BAS for a Fixed Length of PD Balance Evaluation The Effect of L1 Cache Sizes Comparison

Zeynep Zengin, CMPE511, Bogazici Univ.26 Overall Performance

Zeynep Zengin, CMPE511, Bogazici Univ.27 Overall Energy Static – Dynamic Power Dissipation Charging and discharging of the load capacitance Memory Related –Chip caches –Offchip memory

Zeynep Zengin, CMPE511, Bogazici Univ.28 Design Tradeoffs for MP and BAS for a Fixed Length of PD The question is which design has a higher miss rate reduction???

Zeynep Zengin, CMPE511, Bogazici Univ.29 Design Tradeoffs for MP and BAS for a Fixed Length of PD

Zeynep Zengin, CMPE511, Bogazici Univ.30 Balance Evaluation Frequent hit sets: Hits 2 times higher Frequent Miss sets: misses 2 times higher Less accessed sets: accesses below half fhschfmscmlastca DM AVE 7,557,25,636,550,210,5 BC7,639,82,215,732,48,4

Zeynep Zengin, CMPE511, Bogazici Univ.31 The miss rate reductions increase when the MF is increased B-Cache, the design with MF = 8 and BAS = 8 is the best

Zeynep Zengin, CMPE511, Bogazici Univ.32 Comparison With a victim buffer: the miss rate reduction of the B-Cache is higher than the victim buffer with a highly associative cache: –HAC is for low-power embedded systems –HAC is an extreme case of the B-Cache, where the decoder of the HAC is fully programmable.

Zeynep Zengin, CMPE511, Bogazici Univ.33 RELATED WORK Reduce the miss rate of direct mapped caches Reduce the access time of set associative caches

Zeynep Zengin, CMPE511, Bogazici Univ.34 Reducing Miss Rate of Direct Mapped Caches TECHNIQUES Page allocation Column associative cache Adaptive group associative cache Skewed associative cache

Zeynep Zengin, CMPE511, Bogazici Univ.35 Reducing Access Time of Set-associative Caches Partial address matcing : predicting hit way Difference bit cache

Zeynep Zengin, CMPE511, Bogazici Univ.36 B-CACHE SUMMARY B-cache can be applied to both high performance and low-power embedded systems. Balanced without any software intervention. Feasible and easy to implement

Zeynep Zengin, CMPE511, Bogazici Univ.37 Conclusion B-Cache allows the accesses to cache sets to be balanced by increasing the decoder length and incorporating a replacement policy to a direct-mapped cache design. programmable decoders dynamically determine which memory address has a mapping to the cache set A 16kB level one B-Cache outperforms a traditional same sized direct mapped cache by 64.5% and 37.8% for instruction and data cache, respectively Average IPC improvement: 5.9% Energy reduction: 2%. Access time: same as a traditional direct mapped cache

38 References 1.C. Zhang,”Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders”,ISCA 2006,IEEE. 2.C. Zhang,”Balanced Instruction Cache:Reducing Conflict Misses of Direct-Mapped Caches through Balanced Subarray Accesses”,IEEE Computer Architecture Letter, May Wilkonson, B.(1996), “Computer Architecture: Design and Performance”, Prentice Hall Europe. 4.University of Maryland oj01/cache/cache.html