Effect of Load and Store Reuse on Energy Savings for Multimedia Applications 黃國權洪吉勇李永恆曾學文黃國權洪吉勇李永恆曾學文 Computer Architecture Term Project.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Cache Heng Sovannarith

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Power Reduction Techniques For Microprocessor Systems

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

2. Memory. Main memory – speed & types Organization of RAM RAM – Random Access Mem Static RAM [SRAM] - In SRAM, a bit of data is stored using the state.

Memory Hierarchy.

MEMORY More technically referred to as Primary Storage.

THE MEMORY SYSTEM & INTERCONNECTION STRUCTURE OBJECTIVES Define Memory hierarchy and its characteristics Define various types of memories Define the.

Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

MS108 Computer System I Lecture 2 Metrics Prof. Xiaoyao Liang 2014/2/28 1.

Memory  Main memory consists of a number of storage locations, each of which is identified by a unique address  The ability of the CPU to identify each.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.

OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: Memory.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

What is cache memory?. Cache Cache is faster type of memory than is found in main memory. In other words, it takes less time to access something in cache.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Runtime Software Power Estimation and Minimization Tao Li.

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

A memory is just like a human brain. It is used to store data and instructions. Computer memory is the storage space in computer where data is to be processed.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.

1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Glassex 2008 Energy Savings Example for Solar Control Film Dave Cox – 3M Europe 3M Building and Commercial Services.

Measuring Performance II and Logic Design

SECTIONS 1-7 By Astha Chawla

Embedded Systems Design

CS1251 Computer Architecture

Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003

Discovering Computers 2014: Chapter6

مهارات التدريس الفعال.

Chapter 1 Introduction.

Reducing Cache Traffic and Energy with Macro Data Load

2.C Memory GCSE Computing Langley Park School for Boys.

Lecture 8: Efficient Address Translation

Increasing Effective Cache Capacity Through the Use of Critical Words

What Are Performance Counters?

Little work is accurate

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Effect of Load and Store Reuse on Energy Savings for Multimedia Applications 黃國權洪吉勇李永恆曾學文黃國權洪吉勇李永恆曾學文 Computer Architecture Term Project

MotivationMotivation  Most modern microprocessors employ one or two levels caches in order to improve performance. (e.g. L1, L2 cache)  These caches are typically implemented with static RAM cells and often occupy large portion of the chip area and consume a significant amount of power.  Find ways to reduce the power consumption by removal redundancy  Most modern microprocessors employ one or two levels caches in order to improve performance. (e.g. L1, L2 cache)  These caches are typically implemented with static RAM cells and often occupy large portion of the chip area and consume a significant amount of power.  Find ways to reduce the power consumption by removal redundancy

Load Reuse  We just focus on the load instruction reuse and evaluate it on the multimedia applications.  Our goal is to reduce both the energy consumed and the execution time  The basic concept is to buffer the results of past load and store instructions and to reuse them.  We just focus on the load instruction reuse and evaluate it on the multimedia applications.  Our goal is to reduce both the energy consumed and the execution time  The basic concept is to buffer the results of past load and store instructions and to reuse them.

Experimental Environment  Simulator  SimWattch performance / energy simulator  Benchmark  MediaBench encompass most of the media applications  Simulator  SimWattch performance / energy simulator  Benchmark  MediaBench encompass most of the media applications

Reuse Step Reuse checking Buffer refreshing

Reuse Checking  Before load access the LSQ, it must check buffer first.

Buffer Refreshing  As load write back the result, it must also refresh the reuse buffer.

DefinitionDefinition  Original : –Lsq access = load access + store access  Have reuse function : –Lsq access = the load with reusing store instruction + store access  The former load instruction (same or different load) can reduce the times of accessing the LSQ.  Original : –Lsq access = load access + store access  Have reuse function : –Lsq access = the load with reusing store instruction + store access  The former load instruction (same or different load) can reduce the times of accessing the LSQ.

MediaBench MPEG -D

JPEG –E

JPEG –D

ADPCM –E

ADPCM –D

DijkstraDijkstra

G721 –E

G721 –D

EPICEPIC

Rijndael -E

Rijndael -D

FFTssFFTss

FFTinvFFTinv

SUSANSUSAN

Benchmark (1)

Benchmark (2)

ImageImage

MultimediaMultimedia

NetworkNetwork

TelecommTelecomm

SecuritySecurity

AutomotiveAutomotive

ResultResult  Most benchmarks’ power and access don’t vary with the size of buffer except SUSAN.  Buffer’s size also affect the times of same load reuse and different load reuse. –Same load Different load JPEG-E, JPEG-D,Dijkstra, G721-E, G721-D, –Same load > Different load MPEG-D, EPIC –Same load < Different load SUSAN, FFTss, Rijndael-E, Rijndael-D,  Most benchmarks’ power and access don’t vary with the size of buffer except SUSAN.  Buffer’s size also affect the times of same load reuse and different load reuse. –Same load Different load JPEG-E, JPEG-D,Dijkstra, G721-E, G721-D, –Same load > Different load MPEG-D, EPIC –Same load < Different load SUSAN, FFTss, Rijndael-E, Rijndael-D,

ConclusionConclusion  Significant levels of instruction redundancy  Removal of redundancy vary from 1% to 39% by load and store reuse mechanism to achieve energy saving  IPC improvement needs further investigation  Significant levels of instruction redundancy  Removal of redundancy vary from 1% to 39% by load and store reuse mechanism to achieve energy saving  IPC improvement needs further investigation

Thank you!! Q & A