Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
ISLPED’99 International Symposium on Low Power Electronics and Design
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Dynamic Associative Caches:
COSC3330 Computer Architecture
Lecture: Large Caches, Virtual Memory
SECTIONS 1-7 By Astha Chawla
CSC 4250 Computer Architectures
‘99 ACM/IEEE International Symposium on Computer Architecture
Copyright © 2011, Elsevier Inc. All rights Reserved.
Cache Memory Presentation I
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture: Cache Hierarchies
Cache - Optimization.
A Case for Interconnect-Aware Architectures
Automatic Tuning of Two-Level Caches to Embedded Applications
Overview Problem Solution CPU vs Memory performance imbalance
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton University Presented by xingjian Liang Presented by xingjian Liang

Out line 1.Introduction 2.Motivation and Related Work 3.Restrictive Compression Techniques 4.Performance Results 5.Enhanced Techniques 6.Sensitivity Study 7.Conclusion

1.Introduction Basic technique — All Words Narrow (AWN) increases the level 1 data cache capacity by about 20% (in section 3) Extend the AWN technique by providing additional space for a few upper half-words (AHS) in a cache block. AHS technique increases the L1 data cache capacity by almost 90% (with an average of about 50%). ( in section 3) Use Optimizing Address Tags (OATS) to reduce the increased cache tag space requirement ( in section 5) On On 32-bit architecture, will be more beneficial for a 64-bit architecture

2.Motivation and Related Work Elaborate cache compression techniques cannot be applied for L1 caches Any compression technique that is applied to L1 caches should not require updates to the byte-offset. We call such compression techniques as restrictive compression techniques. Figure 3 presents the results of the measurements in the form of a stacked graph

Figure 3: Percentage distribution of word sizes in the new cache blocks brought into the L1 Data cache

3.Restrictive Compression Techniques We consider a word narrow if it can be represented using 16 bits, and can be reconstructed by sign- extending the narrow word. All Words Narrow (AWN) A physical cache block can hold one normal cache block or up to two narrow cache blocks, as shown in Figure 4a

3.Restrictive Compression Techniques Figure 4(b)

3.Restrictive Compression Techniques If the width bit is set, 2 half-words are simultaneously read from two locations in the cache block If the width bit is reset, 1 normal-sized word is read from only one location as in the conventional case

3.Restrictive Compression Techniques The AWN technique avoids byte-offset updates. The data read from the caches has to be sign-extended if the width bit is set. When a normal cache block is brought in, the age of the most recently used cache block in a physical cache block is chosen for comparison to find the least recently used cache block. This ensures that the AWN technique will not perform worse than the conventional cache.

3.Restrictive Compression Techniques Additional Half-Word Storage (AHS) Equally divide the additional half-words among all the narrow cache blocks in a physical cache block Figure 6a shows how data is packed in a physical cache block

3.Restrictive Compression Techniques Figure 6(b) shows the schematic data read access from a cache with AHS

3.Restrictive Compression Techniques Cache Access Latency Impact Cache Energy Consumption In the AWN and the AHS techniques, additional energy is primarily consumed in the additional tag comparisons, and in the L1 cache data array as a result of an increase in the hit rate.

4.Performance Results Experimental Setup use a modified Simple Scalar simulator, simulating a 32- bit PISA architecture The hardware features and default parameters that we use are given in Table 1

4.Performance Results The statistics are collected for 500M instructions AWN increases cache capacity by 20% AHS-4 increases cache capacity by 50%

5.Enhanced Techniques Optimizing Address Tags (OATS) Adaptive AHS (AAHS)

5.Enhanced Techniques L1 data cache with the OATS and AAHS techniques compared to the 8KB 2-way conventional L1 data cache

5.Enhanced Techniques The increase in cache capacity did not translate into comparable reductions in cache miss rates The small reduction in cache miss rate is because different benchmarks require different cache sizes

5.Results Figure 10 shows that an average miss rate reduction of about 23% can be achieved with the AAHS-4 technique

6.Sensitivity Study Measuring the increase in the L1 data cache capacity as the block size is varied from 16 bytes to 128 bytes Figure 11 presents the measurements for the various cache block sizes for the AHS-4 technique

7.Conclusion Compression techniques can be used to increase the L1 data cache capacity Basic technique — AWN — compresses a cache block only if all the words in the cache block are of small size Extend the AWN technique by providing some additional space for the upper half-words—AHS—of a few normal- sized words in a cache block Providing just 2 additional tag bits is enough to give a performance almost equal to that obtained with doubling the number of tag bits

Some References [1] D. Brooks and M. Martonosi, “Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance,” Proc. HPCA, [2] M. Ghosh, et al., ”CoolPression —A Hybrid Signifi cance Compression Technique for Reducing Energy in Caches.”Proc. IEEE Int’l SOC Conference, [3] G. Loh, “Exploiting data-width locality to increase superscalar execution bandwidth,” Proc. Micro-35, [4] L. Villa, et. al., “Dynamic zero compression for cache energy reduction,”Proc. of Micro-33, [5] B. Abali and H. Franke, ”Operating System Support for Fast Hardware Compression of Main Memory Contents”, Workshop on Solving the Memory Wall Problem, [6] A. Alameldeen and D. Wood, ”Adaptive Cache Compression for High-Performance Processors”, Proc. ISCA-31, [7] C. Benveniste, et. al., ”Cache-Memory Interfaces in Compressed Memory Systems”, Workshop on Solving the Memory Wall Problem, [8] D. Burger and T. M. Austin, “The SimpleScalar Tool Set, Version 2.0,”Computer Arch. News, [9] T. Chappell, et. al., “A 2-ns cycle, 3.8-ns access 512-kB CMOS ECL SRAM with a fully pipelined architecture,” IEEE Jour. Of Solid State Circuits, 26(11): , [10] D. Chen, et. al., ”A Dynamically Partitionable Compressed Cache”, Singapore-MIT Alliance Symposium, 2003.