Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July 2009 2015/6/13.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Multiprocessing Memory Management

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

Chapter 1 and 2 Computer System and Operating System Overview

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Chapter 11 Operating Systems

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.

Computer Organization and Architecture

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 8: Main Memory.

Chapter 7 Memory Management Seventh Edition William Stallings Operating Systems: Internals and Design Principles.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Static Process Scheduling

Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  What Operating Systems Do  Computer-System Organization  Computer-System Architecture  Operating-System Structure.

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Workload Clustering for Increasing Energy Savings on Embedded MPSoCs S. H. K. Narayanan, O. Ozturk, M. Kandemir, M. Karakoy.

Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Measuring Performance II and Logic Design

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

vCAT: Dynamic Cache Management using CAT Virtualization

Multiple Banked Register Files

Memory Segmentation to Exploit Sleep Mode Operation

Chapter 8: Main Memory.

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

SECTIONS 1-7 By Astha Chawla

Frequency Governors for Cloud Database OLTP Workloads

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Flavius Gruian < >

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

COMP755 Advanced Operating Systems

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13

The abundance of wireless connectivity and the increased workload complexity have further underlined the importance of energy efficiency for modern embedded applications. The cache memory is a major contributor to the system power consumption, and as such is a primary target for energy reduction techniques. Recent advances in configurable cache architecture have enabled an entirely new set of approaches for application-driven energy- and cost-efficient cache resource utilization. We propose a run-time cross-layer specialization methodology, which leverages configurable cache architectures to achieve an energy- and performance-conscious adaptive mapping of instruction cache resources to tasks in dynamic multitasking workloads. Abstract - 2 -

Sizable leakage and dynamic power reductions are achieved with only a negligible and system-controlled performance impact. The methodology assumes no prior information regarding the dynamics and the structure of the workload. As the proposed dynamic cache partitioning alleviates the detrimental effects of cache interference, performance is maintained very close to the baseline case, while achieving 50%-70% reductions in dynamic and static leakage power for the on-chip instruction cache. Abstract – Cont

The cache memory is a major contributor to the total dynamic and leakage power  Occupy up to 50% of die area and 80% of transistor budget How to customize the configurable cache dynamically to provide a task only its required cache volume  Goal: reduce power consumption with limited degradation in performance What’s the Problem Task0 Performance doesn’t improve noticeably beyond half of cache Task0 Idle Energy Efficient Cache Normal Cache

Partition the instruction cache and adapt its utilization at run time  Cache partitioning: eliminate cache interference  Utilize configurable cache: only the required subsection of cache is active The Proposed Methodology for Dynamic Cache Customization Dynamic multitasking workload Task0 Task1 Task 2 Idle Only one task is active at a time From t 2 ~ t 3 :

Base on cache partitioning formation (initial partition) policy  Cache requirements of each task (detailed later) 。 Task 0 : 2K 2-way, Task 1 : 8K 4-way, Task 2 : 4K 2-way Functional Overview Dynamic multitasking workload 16K 4-way Baseline Cache Map to subsection equal to the required $ size Active section during Task 2 execution Low power drowsy mode

However, overlap cache partitioning is inevitable  Some tasks may require larger cache partitions Overlap brings the problem of cache interference  Result in performance worse than the required miss rate bound Handle such case through dynamic partition update  Update the overlapped partitions dynamically Functional Overview – Cont Ideal Case Task0 Task1 Task 2 Map to exclusive Task0 Task1 Task 2 Initial Partition Overlapping Task0 Task1 Task 2 Dynamic Partition Update Enlarge partition when performance worse

The mechanisms required for efficient cache utilization with minimal interference  Initial partition formation 。 Identify the individual task cache requirement at compile-time  Use the cache miss statistics information local to each task  Initial partition assignment 。 Assign the initial partition to a task at run-time  Set the “Cache Way Select Register (CWSR)” and the “mask register” to vary the # of sets  Dynamic partition update policy 。 Fine-tune the partition size when performance worse  Ensure miss-rate remain within the threshold bounds Dynamic Cache Customization - 8 -

Identify cache requirement and determine the initial partition size for each task  Aim at reducing energy while keep performance close to the baseline case, i.e., BASE(T i )  Use the IND_BASE(T i ) instead 。 Then define a “Threshold” accounts for the cache interference  Hence, the miss rate bound for a task is IND_BASE(T i ) + Threshold  The starting cache configuration is picked such that 。 MISS(P i,j ) ≦ IND_BASE(T i ) + Threshold Part1: Initial Partition Formation Task 4 task0 task3 task2 task1 BASE(T i ) Actual baseline miss rate of task Ti with interference task0 IND_BASE(T i ) Miss rate of task Ti when baseline cache is used in isolation Miss rate of task Ti for sharing the baseline cache Not available at compiler-time Task-specific

MCS (Missrate Cache Space) Table  Cache miss statistics for each cache configuration 。 Obtain through profiling Part1: Initial Partition Formation - Example Cache Way size 5121K2K 4K 8K 5121K2K 4K 8K Task 0 Task 1 Task 2 IND_BASE(T 0 )= 0% IND_BASE(T 1 )= 0.15% IND_BASE(T 2 )= 0.17% Threshold 0.1% MISS(P 0,j ) ≦ MISS(P 1,j ) ≦ MISS(P 2,j ) ≦ Find the minimal cache that satisfy condition Starting configuration for G721: 8K 2-way Starting configuration for LAME: 4K 4-way Starting configuration for GSM: 8K 2-way # of Ways

Assign the initial partition to a task at run-time  Set the control register and mask register of configurable cache Attempt to assign partitions exclusive of each other  But not always possible 。 Total $ requirement of G721, LAME, and GSM is 20K but only 16K is available Part2: Initial Partition Assignment At time t 0, allocate 8K 2-way to G721 At time t 1, allocate 4K 4-way to LAME (can’t exclusive of G721, and allow overlapping) At time t 1, allocate 4K 4-way to LAME (can’t exclusive of G721, and allow overlapping) At time t 2, allocate 8K 2- way to GSM (with a small portion being used by LAME) At time t 2, allocate 8K 2- way to GSM (with a small portion being used by LAME)

Tasks with overlapping partitions can’t be prevented  Interference and miss rates may exceed the bound Part3: Dynamic Partition Update Trigger the dynamic partition update HW miss counter inside CPU ＞ IND_BASE(T i ) + Threshold Trigger partition rescaling Enlarge the partition size until it is less than the miss rate bound

Partition rescaling trades-off power savings for meeting performance requirement Part3: Dynamic Partition Update - Example For LAME, the miss-rate bound is exceeded in the overlapped region The next configuration after 4K 4-way with miss rate less than 0.25% is 6K 3-way 5121K2K 4K 8K LAME: IND_BASE(T 1 ) + Threshold= 0.25% GSM rescaled to 12K 3-way due to increased overlap with the rescaled LAME partition

Partition reshuffling  When a task leaves on completing execution 。 The cache resource is freed up and available to currently executing tasks  The previously rescaled partition is considered for reshuffling 。 Completely allocate this task’s starting configuration without overlap Part3: Dynamic Partition Update - Example Reshuffling At time t4, both G721 and GSM complete only LAME is left executing Reshuffle to starting configuration (reverting to smaller partition results in reduced power) Reshuffle to starting configuration (reverting to smaller partition results in reduced power)

Use the cache configurations found in high-end embedded processor (Intel XScale and ARM9)  16K 4-way  32K 4-way Scheduling policy to model multitasking  Round-robin policy with a context-switch frequency of 33K Inst. The miss-rate impact threshold is set to 0.1% Evaluate two categories of benchmark  Static benchmarks: all tasks start and finish at the same time  Dynamic benchmarks: Experiment Setup Structure of Dynamic Benchmarks

Partitioning: apply the initial partition assignment only Rescaling: apply partitioning + rescaling Reshuffling: apply partitioning + rescaling + reshuffling For some configuration, the rescaling and reshuffling are omitted  Since the miss-rate impact is within the threshold after initial assignment Miss-Rate Impact: Increase Miss-Rate Compared to Baseline Cache Better After rescaling, the miss-rate impact is reduced

GSM is subjected to rescaling  Miss-rate bound is exceeded due to interference in the overlapped The partition reshuffling maximizes power reduction  Power reduction is achieved while keeping miss-rate impact below the threshold value BM_3 Individual Task Miss-Rates for 16K Cache Better Improve performance, even low than baseline case Exceed miss-rate bound of 0.27%

Shared cache Task 4 task0 task3 task2 task1 Thrashing