Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Cache Memory.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

COSC3330 Computer Architecture

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 14: Reducing Cache Misses

Chapter 5 Memory CSE 820.

Ann Gordon-Ross and Frank Vahid*

A Self-Tuning Configurable Cache

Cache - Optimization.

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation

Frank Vahid, UC Riverside 2 Low Power/Energy Techniques are Essential Hot enough to cook an egg. Skadron et al., 30 th ISCA High performance processors are going to be too hot to work Low energy dissipation is imperative for battery-driven embedded systems Low power techniques are essential to both embedded systems and high performance processors

Frank Vahid, UC Riverside 3 Caches Consume Much Power >50% Caches consume 50% of total processor system power ARM920T and M*CORE (Segars 01, Lee 99) Caches accessed often Consume dynamic power Associativity reduces misses Less power off-chip, but more power per access Victim buffer helps (Jouppi 90) Add to direct-mapped cache Keep recently evicted lines in small buffer, check on miss Like higher-associativity, but without extra power per access 10% energy savings, 4% performance improvement (Albera 99) Processor Cache Victim buffer Memory

Frank Vahid, UC Riverside 4 Victim Buffer With a victim buffer One cycle on a cache hit Two cycles on a victim buffer hit Twenty two cycles on a victim buffer miss Without a victim buffer One cycle on a cache hit Twenty one cycles on a victim buffer miss More accesses to off-chip memory OFFCHIP MEMORY PROCESSOR HIT L1 cache Victim buffer One cycle MISS HIT Miss 22 cycles21 cycles Two cycles

Frank Vahid, UC Riverside 5 Cache Architecture with a Configurable Victim Buffer Is a victim buffer a useful configurable cache parameter? Helps for some applications For others, not useful VB misses, so extra cycle wasteful? Thus, want ability to shut off VB for given app. Hardware overhead One bit register A switch Four-line victim buffer shown control signals to the next level memory SRAM tag data SRAM CAM Fully-associative victim buffer 27-bit tag16-byte cache line data VB on/off reg data from next level memory Vdd victim line data to processor to mux from cache control circuit L1 cache cache control circuit control signals s 0 1

Frank Vahid, UC Riverside 6 Hit rate of victim buffer when added to an 8 Kbyte, 4 Kbyte, or 2 Kbyte direct-mapped cache Benchmarks from Powerstone, MediaBench, and Spec Data cache Hit Rate of a Victim Buffer Instruction cache

Frank Vahid, UC Riverside 7 Computing Total Memory-Related Energy Consider CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations ) Underlined – measured quantities SimpleScalar (cache_hits, cache_misses, cycles) Our layout or data sheets (others)

Frank Vahid, UC Riverside 8 Performance and Energy Benefits of Victim Buffer with a Direct-Mapped Cache An 8-line victim buffer with an 8 Kbyte direct-mapped cache (0%=DM w/o victim buffer) Should shut-off VB Substantial benefit Configurable victim buffer is clearly useful to avoid performance penalty for certain applications

Frank Vahid, UC Riverside 9 Is a Configurable Victim Buffer Useful Even With a Configurable Cache We showed that a configurable cache can reduce memory access power by half on average (Zhang/Vahid/Najjar ISCA 03, ISVLSI 03) Software-configurable cache Associativity – 1, 2 or 4 ways Size: 2, 4 or 8 Kbytes Does that configurability subsume usefulness of configurable victim buffer? Line 2 Kb way

Frank Vahid, UC Riverside 10 Best Configurable Cache with VB Configurations Optimal cache configuration when cache associativity, cache size, and victim buffer are all configurable. I and D stands for instruction cache and data cache, respectively. V stands for the victim buffer is on. nK stands for the cache size is n Kbyte. The associativity is represented by the last four characters Benchmark vpr, I2D1 stands for two- way instruction cache and direct- mapped data cache. Note that sometimes victim buffer should be on, sometimes off

Frank Vahid, UC Riverside 11 Performance and Energy Benefits of Victim Buffer Added to a Configurable Cache An 8-line victim buffer with a configurable cache, whose associativity, size, and line size are configurable (0%=optimal config. without VB) Still surprisingly effective

Frank Vahid, UC Riverside 12 Conclusion Configurable victim buffer useful with direct-mapped cache As much as 60% energy and 4% performance improvements for some applications Can shut off to avoid performance penalty on other apps. Configurable victim buffer also useful with configurable cache As much as 43% energy and 8% performance improvement for some applications Can shut off to avoid performance overhead on other applications Configurable victim buffer should be included as a software- configurable parameter to direct-mapped as well as configurable caches for embedded system architectures