ISLPED’99 International Symposium on Low Power Electronics and Design

Slides:



Advertisements
Similar presentations
OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.
Advertisements

Cosc 3P92 Week 9 Lecture slides
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
EECS 470 Cache and Memory Systems Lecture 14 Coverage: Chapter 5.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Power-Aware Hardware Prefetching Yao Guo. Prefetching Energy Consumption L2 + Memory Access Energy –very minor energy impact. Leakeage Energy –reduced.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang Ohio State University Acknowledgement of Contributions: Chenxi Zhang, Tongji University.
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
1 Memory Hierarchy The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.
High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Cache Memory.
Fall EE 333 Lillevik 333f06-l16 University of Portland School of Engineering Computer Organization Lecture 16 Write-through, write-back cache Memory.
Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic
Cache Memory Yi-Ning Huang. Principle of Locality Principle of Locality A phenomenon that the recent used memory location is more likely to be used again.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
‘99 ACM/IEEE International Symposium on Computer Architecture
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Introduction to Computing
Cache - Optimization.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

ISLPED’99 International Symposium on Low Power Electronics and Design Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption Koji Inoue, Tohru Ishihara, and Kazuaki Murakami Department of Computer Science and Communication Engineering Kyushu University ppram@c.csce.kyushu-u.ac.jp

Conventional 4-Way Set-Associative Cache Tag subarray Cache-line subarray Way 0 Way 1 Way 2 Way 3 Step1. Address Decode Decode circuit Step2.Read out of a tag and a line from each way Activate of word line Activate senseamps pre(dis)charge bit lines Total energy for an access for decode for I/O pin drive Ecache = Edecode + Ememory + Eio Step3. Tag comparison for SRAM access Hit Miss Step4.Provide the required data Step4.Cache replacement Activate of I/O pins

Phased 4-Way Set-Associative Cache for Low Energy Consumption Energy consumption improvement by sacrificing the performance Step1. Address Decode Step2.Read out of only tags Cycle 1 Step3. Tag comparison Miss Hit Step4. Cache replacement Cycle 2 Step4.Read out of only the desired line Step5.Provide the required data

Way-Predicting Set-Associative Cache - Concept - How can we achieve high-performance and low energy consumption at the same time? Fast access by reading out both of tag and line simultaneously Conventional : Good! Phased : Bad! Low energy by avoiding unnecessary line read access Conventional : Bad! Phased : Good! Predict which way has the data desired by the processor before the cache access is started

4Way-Predicting Set-Associative Cache - Operation - Way Prediction (Cache-line Base MRU Algorithm) Step0.Way prediction Step1. Address decode Step2.Read out the predicted tag and line Cycle 1 Step3. Tag comparison Miss Prediction Hit Step4.Read out the remaining tags and lines Step4.End Cycle 2 Step5. Tag comparison Prediction Miss Cache Miss Step6.End Step6.Cache replacement

4Way-Predicting Set-Associative Cache - Organization - MRU Algorithm

Evaluation Environment Cache Models Conventional 4-way Set-Associative Cache (4SACache) Phased 4-way Set-Associative Cache (P4SACache) Way-Predicting 4-way Set-Associative Cache (WP4SACache) Cache Size : 16 K Byte, Cache-line Size : 32 Byte, Replacement Algorithm : LRU Evaluation Items Performance (Tcache): average number of clock cycles for an access Energy (Ecache) : average energy consumption for an access Energy consumed for accessing a tag-subarray Energy consumed for accessing a line-subarray Ecache ~ Ememory = Ntag x Etag + Ndata x Edata Ave. number of tag-subarray accessed for an access Ave. number of line-subarray accessed for an access

Static Analysis - Energy and Performance Expression - 4SACache P4SACache E4SACache EP4SACache 4 Etag + 4 Edata 4 Etag + Edata x CHR T4SACache TP4SACache 1 1 + 1 x CHR EWP4SACache WP4SACache (Etag + Edata) + (3 Etag + 3 Edata) x (1 - PHR) TWP4SACache CHR:Cache Hit Rate PHR:Prediction Hit Rate 1 + 1 x (1 - PHR)

Static Analysis - Best and Worst Case - 4SACache (Conventional) P4SACache (Phased) WP4SACache (Ours) Energy Consumption (Etag = 0.078Edata) Performance Compare with Conventional (4SACache) Best Case (PHR = 100%) : 75% energy improvement without any performance degradation Worst Case (PHR = 0%) : 100% performance overhead without any energy improvement

Experimental Analysis - Prediction Hit Rate -

Experimental Analysis - Result of Instruction Cache - 4SACache = 1.0 P4SACache Normalized Tcache WP4SACache (Our approach) Normalized Ecache

Experimental Analysis - Result of Data Cache - 4SACache = 1.0 P4SACache Normalized Tcache WP4SACache (Our approach) Normalized Ecache

Experimental Analysis - Energy and Performance - Average of all benchmarks Conventional (4SACache) Phased (P4SACache) Way-Predicting (WP4SACache) 199.4% 195.8% 200 200 I-Cache D-Cache 113.0% 104.1% Normalized Results (%) 100 Normalized Results (%) 100 30.3% 29.4% 28.1% 35.2% Ecache Tcache Ecache Tcache

Cache Power Consumption Cache Size trend Effect of on-chip caches to total chip power consumption DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU** 25% 43% 50% * Kamble, et. Al., “Analytical energy Dissipatiion Models for Low Power Caches”, ILPED’97 ** Joouppi, et. Al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal of Solid-State Circuits’93

Energy Consumption Model Components of the power dissipation Bit line Word line Sense Amp Output driver Addr input Comparator Latche 32KB Direct-mapped I-Cache 32KB 4-way D-Cache Ememory=95.6% Ememory=97.7% Ghose, et. Al. : Energy Efficient Cache Organizations for Superscalar Processors, Power-Driven microarchitecture Workshop in Conjunction with ISCA’98 Average Energy Consumption for an access Energy consumed for accessing a tag-subarray Energy consumed for accessing a line-subarray Ecache ~ Ememory = Ntag x Etag + Ndata x Edata Ave. number of tag-subarray accessed for an access Ave. number of line-subarray accessed for an access

Experimental Analysis - Environment - Benchmarks SPECint95 099.go, 124.m88ksim, 126.gcc, 129.compress, 130.li, 132.ijpeg, 134.perl, 147.vortex SPECfp95 101.tomcatv, 102.swim, 103.su2cor, 104.hydro2d