ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

Slides:

Advertisements

Similar presentations

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Advertisements

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Embedded Systems Programming

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

ISLPED’99 International Symposium on Low Power Electronics and Design

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Computer Architecture Lecture 27 Fasih ur Rehman.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Sunpyo Hong, Hyesoon Kim

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.

1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,

Microcomputer Architecture

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Tosiron Adegbija and Ann Gordon-Ross+

Detailed Analysis of MiBench benchmark suite

Hardware Organization

Increasing Effective Cache Capacity Through the Use of Critical Words

rePLay: A Hardware Framework for Dynamic Optimization

Presentation transcript:

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009

Outline  Introduction  Software Controllable-Variable Line Size (SC-VLS) Cache  Evaluation  Summary 2

3D Integration  Stacking the main memory on processors  Connecting them by wide on-chip buses  The memory bandwidth can be improved 3

Motivation 4  3D stacking makes it possible to reduce the cache miss penalty  We can employ larger cache line size in order to expect the effect of prefetching  But… if programs don’t have high spatial localities of memory references  It might worsen the performance  A large amount of energy is required!

Software-Controllable Variable Line-Size Cache (1/3) 5  We propose SC-VLS cache  It attempts to optimize the amount of data to be transferred between cache memory and main memory  When a program does not require high memory bandwidth ⇒ SC-VLS cache reduces the cache line size

Software-Controllable Variable Line-Size Cache (2/3) 6  Features  SC-VLS cache doesn’t require any hardware monitor to decide the line size  Advantages  SC-VLS cache reduces energy consumption with trivial hardware overhead

Software-Controllable Variable Line-Size Cache (3/3) 7  Adequate line size analysis  Before an application program is executed, we analyze an adequate line size of each function  Code generation  Line size change instructions are inserted into start of functions in original program code  The instruction sets status register to indicate an adequate line size

Adequate Line Size Analysis - Example- 8

Evaluation  Simulator  SimpleScalar and CACTI  Benchmark programs  10 programs (MiBench)  Input data sets  Analysis phase: small  Execution phase: large  The SC-VLS cache can dynamically choose four line sizes;  32B, 64B, 128B and 256B 9

Energy

Performance 11

Summary  3D integration  can improve memory bandwidth  makes it possible to reduce the cache miss penalty  SC-VLS cache  can dynamically change the line sizes  reduces the energy consumption up to 75% 12

THANK YOU ACKNOWLEDGEMENT This research was supported in part by New Energy and Industrial Technology Development Organization

Architecture 14

Adequate Line Size Analysis 15  We execute cache simulation with each line size independently to determine an adequate line size 1. An average cache miss rate of each function is calculated 2. We compare the average cache miss rates with all line size candidates 3. A line size which the cache miss rate is the smallest is determined as an adequate line size

Energy Model 16 # L1 memory access Total energy of stacked DRAM average energy for a cache access Total energy of $L1 # main memory access average energy for a cache access # activated DRAM sub-array

Average SC-VLS Cache Line Size BenchmarksAverage SC-VLS cache line size (B) bitcount81.94 mad tiff2bw dijkstra rijndael_enc64.82 rijndael_dec33.01 sha adpcm_enc adpcm_dec lame