2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Computer System Overview
1 Computer System Overview OS-1 Course AA
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
CMPE 421 Parallel Computer Architecture
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Memory Hierarchies Sonish Shrestha October 3, 2013.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
Chapter 1 Computer System Overview
CS2100 Computer Organization
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
Improving Program Efficiency by Packing Instructions Into Registers
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Ka-Ming Keung Swamy D Ponpandi
Performance metrics for caches
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Chapter 1 Computer System Overview
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Review What are the advantages/disadvantages of pages versus segments?
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large Scale Integration(VLSI) Systems, Vol. 19, No.9, September 2011

Abstract Related Work Introduction Proposed Method Experiment Result Conclusion My Comment 2

For an embedded processor, the efficiency of instruction delivery has attracted much attention since instruction cache accesses consume a great portion of the whole processor power dissipation. In this paper, we propose a memory structure called Trace Reuse (TR) Cache to serve as an alternative source for instruction delivery. Through an effective scheme to reuse the retired instructions from the pipeline back-end of a processor, the TR cache presents improvement both in performance and power efficiency. Experimental results show that a 2048-entry TR cache is able to provide 75% energy saving for an instruction cache of 16 kB, at the same time boost the IPC up to 21%. The scalability of the TR cache is also demonstrated with the estimated area usage and energy-delay product. The results of our evaluation indicate that the TR cache outperforms the traditional filter cache under all configurations of the reduced cache sizes. The TR cache exhibits strong tolerance to the IPC degradation induced by smaller instruction caches, thus makes it an ideal design option for the cases of trading cache size for better energy and area efficiency. 3

4 Branch Prediction Instruction cache restructuring [1]-[6] Energy Performance Trace cache [7]-[9] [10]-[13] This paper filter cache [15], [16]

Improvement in both performance and power efficiency.  Performance 。 Improvement on instruction delivery to boost up processor performance.  Power Efficiency 。 The upper level of memory architecture use less power while accessing. e.g. Filter cache, but increase performance due to cache access. Prior work focuses the front-end of instruction delivery. Those works need right program trace to reduce execution latency and energy consumption. 5

6

7

Indicate how often the opportunity occurs for fetching the same instruction in HTB. HTB hit rate = H k / F k H k is the hit count in HTB, F k is total instruction in program. 8

9

Trace-Reuse cache architecture  HTB(Trace History Buffer) – FIFO buffer, store the instruction retired from pipeline backend.  TET(Trace Entry Table) – store the PC value of control- transfer instruction, for instance branch, and the corresponding HTB index. Update HTB and TET  When an instruction retired from pipeline, buffered in the HTB with its PC value.  If incoming instruction is branch instruction TET will update as well. 10

11

TET is check each cycle, size and structure is important. Replaced-by-invalidation policy used in TET, when TET is full, instead of replacing any TET entry, the newly generated trace entry is discarded. 12

Fully Associate 4-way set associate Directed mapped 13

Add a busy bit to TET(a) entry Add invalidate flag and taken/not-taken direction bit to HTB(b) 14

15

Impact on Instruction Cache Access. Energy Efficiency. Using MiBench as input code. 16

Total number of instruction access in different TRC sizes. 17

Using CACTI tool to calculate. T program-execution is the elapsed program execution time Energy-delay product(EDP) is calculate by multiplying normalized E total and T program-execution 18

For an embedded system with non-taken prediction scheme, TR cache can up boost up 92% prediction rate with 21% performance. TR cache virtually expands the capacity of the conventional instruction cache. Can be done without the support of trace- prediction and trace-construction hardware. Can deliver instruction with lower energy cost than the conventional instruction cache. 19

This is the first time of Journal paper. The proposed idea is easy, but did a great improvement to whole system. Think that what should data should we put in our tag architecture. 20