Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

Automated Design of Custom Architecture Tulika Mitra

Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

ECE 353 Lab 1: Cache Simulation. Purpose Introduce C programming by means of a simple example Reinforce your knowledge of set associative caches.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Improving Program Efficiency by Packing Instructions Into Registers

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Energy-Efficient Address Translation

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Tosiron Adegbija and Ann Gordon-Ross+

Another Performance Evaluation of Memory Hierarchy in Embedded Systems

Ann Gordon-Ross and Frank Vahid*

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

CARP: Compression-Aware Replacement Policies

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded Computer Systems, University of California, Irvine, CA, USA

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Horizontally Partitioned Cache (HPC) Originally proposed by Gonzalez et al. in 1995 Originally proposed by Gonzalez et al. in 1995 More than one cache at the same level of hierarchy More than one cache at the same level of hierarchy Caches share the interface to memory and processor Caches share the interface to memory and processor Each page is mapped to exactly one cache Each page is mapped to exactly one cache Mapping is done at page-level granularity Mapping is done at page-level granularity Specified as page attributes in MMU Specified as page attributes in MMU Mini Cache is relatively small Mini Cache is relatively small Intel StrongARM and Intel XScale Intel StrongARM and Intel XScale Processor Pipeline Main Cache Mini Cache Memory HPC is a popular architectural feature

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Performance Advantage of HPC Observation: Often arrays have low temporal locality Observation: Often arrays have low temporal locality Each value is used for a while, and is then never used But they evict all other data from the cache Separate low temporal locality data from high temporal locality data Separate low temporal locality data from high temporal locality data Array a – low temporal locality Array b – high temporal locality Performance Improvement Performance Improvement Reduce miss rate of Array b Two separate caches may be better than a unified cache of the total size Key lies in intelligent data partitioning Key lies in intelligent data partitioning char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; Processor Pipeline b[1000] b[5] Memory Existing techniques focus on performance

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Energy Advantage of HPCs Energy savings due to two effects Energy savings due to two effects Reduction in miss rate AccessEnergy(mini cache) < AccessEnergy(main cache) Reduction in miss rate Reduction in miss rate Aligned with performance Exploited by performance improvement techniques Less Energy per Access in mini cache Less Energy per Access in mini cache Inverse to performance Energy can decrease even if there are more misses Opposite to performance optimization techniques!! Existing techniques DO NOT exploit the second effect

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Processor Pipeline 8 KB Main Cache Access Latency: 1 cycle Access Energy: 3 nJ Line size: 16 chars 1 KB Mini Cache Access Latency: 1 cycle Access Energy: 1 nJ Line size: 16 chars Memory Access Latency: 20 cycles Access Energy: 30 nJ

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Processor Pipeline 8 KB Main Cache Access Latency: 1 cycle Access Energy: 3 nJ Line size: 16 chars 1 KB Mini Cache Access Latency: 1 cycle Access Energy: 1 nJ Line size: 16 chars Memory Access Latency: 20 cycles Access Energy: 30 nJ char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5];

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) Main Cache Mini Cache Array A Array B Page Mapping: Main Cache : A, B Mini Cache : - A, B- char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5];

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B- Main Cache Mini Cache Array A Array B Main Cache Accesses: = char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5];

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B-2048 Main Cache Mini Cache Array A Array B Main Cache Misses: char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; 1024/16 = /16 = 1 =

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B Main Cache Mini Cache Array A Array B Runtime: 2048*1 = *20 = 1300 = 3348 char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; 3348

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B Main Cache Mini Cache Array A Array B Energy: 2048*3 = *20 = 1950 = 8094 char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; 8094

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B Main Cache Mini Cache Array A Array B Page Mapping: Main Cache: A Mini Cache: B char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; AB

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B AB Main Cache Mini Cache Array A Array B Page Mapping: Main Cache: B Mini Cache: A char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; BA

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Motivating Example Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B AB BA Main Cache Mini Cache Array A Array B Page Mapping: Main Cache: - Mini Cache: A, B char a[1024]; char b[1024]; for (int i=0; i<1024; i++) c += a[i]+b[i%5]; -A, B

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Worse Performance, but Better Energy Main Cache Mini Cache Main Cache Accesses Main Cache Misses Mini Cache Accesses Mini Cache Misses Runtime(cycles)Energy(nJ) A, B AB BA Worse Performance Better Energy Energy Optimization can afford more misses Energy Optimization can afford more misses Optimizing for performance and energy are different!! Optimizing for performance and energy are different!!

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Related Work Horizontally Partitioned Caches Horizontally Partitioned Caches Intel StrongARM SA 1100, Intel XScale Data Partitioning Techniques for HPC Data Partitioning Techniques for HPC No Analysis (Region-based Partitioning) Separate array and stack variables Separate array and stack variables Gonzalez et. al. [ICS’95], Lee et. al. [CASES’00], Unsal et. al. [HPCA’02] Gonzalez et. al. [ICS’95], Lee et. al. [CASES’00], Unsal et. al. [HPCA’02] Dynamic Analysis (in hardware) Memory address based Memory address based Johnson et. al. [ISCA’97], Rivers et. al. [ICS’98] Johnson et. al. [ISCA’97], Rivers et. al. [ICS’98] PC-based PC-based Tyson et. al. [MICRO’95] Tyson et. al. [MICRO’95] Static Analysis (Compiler Reuse Analysis) Xu et. al. [ISPASS’04] Xu et. al. [ISPASS’04] Existing techniques focus on performance optimizations Existing techniques focus on performance optimizations Achieve energy reduction as a by-product Existing compiler data partitioning techniques are complex Existing compiler data partitioning techniques are complex O(m 3 n) Data partitioning techniques aimed at energy reduction

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Compiler Analysis Framework Application Compile r Executable Embedded Platform Data Partitioning Framework Performance/ Energy Estimator Page Access Info. Extractor Data Partitioning Heuristic Page Mapping

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Compiler Analysis Framework Application Compile r Executable Embedded Platform Data Partitioning Framework Performance/ Energy Estimator Page Access Info. Extractor Data Partitioning Heuristic Page Mapping Page Access Information Extractor Page Access Information Extractor Which pages are accessed # times each page is accessed Sorts pages in decreasing order of accesses Data Partitioning Heuristic Data Partitioning Heuristic Finds best mapping of pages to caches Performance Performance Energy Energy Performance/Energy Estimator Performance/Energy Estimator Estimates performance/energy of a given partition Page Mapping Page Mapping Sets the page bits

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Experimental Framework System System Similar to hp iPAQ h4300 Benchmarks Benchmarks MiBench Simulator Simulator SimpleScalar sim-cache Performance Metric Performance Metric cache access + memory accesses Energy Metric Energy Metric Main Cache Energy + Mini Cache Energy + Memory Bus Energy + SDRAM Energy Processor Pipeline 32 KB Main Cache 32:32:32:f 1 KB Mini Cache 2:32:32:f Memory Controller SDRAM Memory SDRAM XScale PXA 255 Hp iPAQ h4300

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Experiments Experiment 1 Experiment 1 How good are energy oriented data partitioning heuristics? Develop several energy oriented data partitioning heuristics and evaluate them Experiment 2 Experiment 2 Experiment 3 Experiment 3

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Data Partitioning Input: List of undecided pages L, sorted by number of accesses Input: List of undecided pages L, sorted by number of accesses Output: Partition the pages into two lists, M and m, such that Output: Partition the pages into two lists, M and m, such that M U m = L M Λ m = Φ Exhaustive Search (OPT) Exhaustive Search (OPT) Try all possible 2 m page mappings O(n2 m ) n – Number of Memory Accesses m – Number of Pages Accessed

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Significant Memory Energy Reduction Memory subsystem energy savings achieved by OPT Memory subsystem energy savings achieved by OPT Base Configuration – All pages are mapped to main cache Base Configuration – All pages are mapped to main cache 55% memory subsystem energy savings

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Complex Data Partitioning Heuristic Input: List of undecided pages L, sorted by number of accesses Input: List of undecided pages L, sorted by number of accesses Output: Partition the pages into two lists, M and m, such that Output: Partition the pages into two lists, M and m, such that M U m = L M Λ m = Φ Exhaustive Search (OPT) Exhaustive Search (OPT) Try all possible 2 m page mappings O(n2 m ) Complex Heuristic (OM2N) Complex Heuristic (OM2N) Try m 2 page mappings O(m 2 n) n – Number of Memory Accesses m – Number of Pages Accessed

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Significant Energy Reduction Contd. OM2N results are within 2% of exhaustive search Memory subsystem energy savings achieved by OM2N Memory subsystem energy savings achieved by OM2N Base Configuration – All pages are mapped to main cache Base Configuration – All pages are mapped to main cache

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Simple Data Partitioning Heuristic Input: List of undecided pages L, sorted by number of accesses Input: List of undecided pages L, sorted by number of accesses Output: Partition the pages into two lists, M and m, such that Output: Partition the pages into two lists, M and m, such that M U m = L M Λ m = Φ Exhaustive Search (OPT) Exhaustive Search (OPT) Try all possible 2 m page mappings O(n2 m ) Complex Heuristic (OM2N) Complex Heuristic (OM2N) Try m 2 page mappings O(m 2 n) Simple Heuristic (OMN) Simple Heuristic (OMN) Try only m page mappings O(mn) n – Number of Memory Accesses m – Number of Pages Accessed

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Even Simple Heuristic is good!! OMN heuristic achieves 50% memory subsystem energy savings Memory subsystem energy savings achieved by OMN Memory subsystem energy savings achieved by OMN Base Configuration – All pages are mapped to main cache Base Configuration – All pages are mapped to main cache

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Very Simple Data Partitioning Heuristic Input: List of undecided pages L, sorted by number of accesses Input: List of undecided pages L, sorted by number of accesses Output: Partition the pages into two lists, M and m, such that Output: Partition the pages into two lists, M and m, such that M U m = L M Λ m = Φ Exhaustive Search (OPT) Exhaustive Search (OPT) Try all possible 2 m page mappings O(n2 m ) Complex Heuristic (OM2N) Complex Heuristic (OM2N) Try m 2 page mappings O(m 2 n) Simple Heuristic (OMN) Simple Heuristic (OMN) Try only m page mappings O(mn) Very Simple Heuristic (OkN) Very Simple Heuristic (OkN) Try only k page mappings (k << m) O(kn) n – Number of Memory Accesses m – Number of Pages Accessed

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Very Simple Heuristic OkN heuristic achieves 35% memory subsystem energy savings Memory subsystem energy savings achieved by OkN Memory subsystem energy savings achieved by OkN Base Configuration – All pages are mapped to main cache Base Configuration – All pages are mapped to main cache

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Goodness of Heuristics How close to optimal this heuristic is? How close to optimal this heuristic is? Goodness of Goodness of OMN is 97% OkN is 64% Simple Heuristics are good enough for optimizing for energy!!

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Experiments Experiment 1 Experiment 1 How good are energy oriented data partitioning heuristics? Develop several energy oriented data partitioning heuristics and evaluate them Experiment 2 Experiment 2 Energy oriented data partitioning heuristics suffer how much performance degradation? Compare Energy of best performing data partition Energy of best performing data partition Performance of best energy data partition Performance of best energy data partition Experiment 3 Experiment 3

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Best Performing Partition consumes 58% more Energy Plots Plots Plots the increase in energy at the best performing partition Plots the increase in energy at the best performing partition First 5 benchmarks exhaustive results First 5 benchmarks exhaustive results Last 7 benchmarks OM2N heuristic Last 7 benchmarks OM2N heuristic

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Best Energy Partition looses 2% Performance Plots Plots Plots the increase in runtime at the best energy partition Plots the increase in runtime at the best energy partition First 5 benchmarks exhaustive results First 5 benchmarks exhaustive results Last 7 benchmarks OM2N heuristic Last 7 benchmarks OM2N heuristic Optimizing for Performance and Energy are significantly different!!

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Experiments Experiment 1 Experiment 1 How good are energy oriented data partitioning heuristics? Develop several energy oriented data partitioning heuristics and evaluate them Experiment 2 Experiment 2 Energy oriented data partitioning heuristics suffer how much performance degradation? Compare Energy of best performing data partition Energy of best performing data partition Performance of best energy data partition Performance of best energy data partition Experiment 3 Experiment 3 Sensitivity of energy oriented data partitioning heuristics on cache parameters

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Sensitivity of Heuristics Sensitivity of our heuristics with mini cache parameters Sensitivity of our heuristics with mini cache parameters Middle is the base configuration, 2KB, 32-way SA FIFO cache Middle is the base configuration, 2KB, 32-way SA FIFO cache Vary associativity in the first configurations (64-way, and 16-way) Vary associativity in the first configurations (64-way, and 16-way) Vary size in the last two configurations (8KB, and 16 KB) Vary size in the last two configurations (8KB, and 16 KB) Proposed heuristics scale well with cache parameters

Copyright © 2005 UCI ACES Laboratory CASES Sep 25, Summary Horizontally Partitioned Caches are simple yet powerful architectural feature to improve performance and energy in embedded systems Horizontally Partitioned Caches are simple yet powerful architectural feature to improve performance and energy in embedded systems Existing techniques focus on performance improvement and are complex O(m 3 n) Existing techniques focus on performance improvement and are complex O(m 3 n) We showed that We showed that Goal of energy optimization ≠ Goal of performance optimization Potential of energy reduction by aiming for energy optimization Simple heuristics achieve good results for energy optimization Simple heuristics achieve good results for energy optimization Up to 50% memory subsystem energy savings Marginal degradation in performance Our heuristics scale well with cache parameters Our heuristics scale well with cache parameters