Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.

Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Parameterized Systems-on-a-Chip Frank Vahid Tony Givargis, Roman Lysecky, Leslie Tauro, Susan Cotterell Department of Computer Science and Engineering.

System-level Exploration for Pareto- optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

1 of 20 Low Power and Dynamic Optimization Techniques for Power-Constrained Domains Ann Gordon-Ross Department of Electrical and Computer Engineering University.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

The Goal: illusion of large, fast, cheap memory

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Tosiron Adegbija and Ann Gordon-Ross+

Ann Gordon-Ross and Frank Vahid*

A High Performance SoC: PkunityTM

A Self-Tuning Configurable Cache

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine Nikil Dutt Center for Embedded Computer Systems School for Information and Computer Science University of California, Irvine This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

2 Introduction Memory access: 50% of embedded processor’s system power Caches are power hungry ARM920T(Segars 01) M*CORE (Lee/Moyer/Arends 99) Thus, the cache is a good candidate for optimizations Main Mem L1 Cache Processor L2 Cache 53%

3 Motivation Tuning cache parameters to an application can save energy: 60% on average Balasubramonian’00, Zhang’03 Each application has different cache requirements One predetermined cache configuration can’t be best for all applications Size –Excess fetch and static energy if too large –Excess thrashing energy if too small L1 Cache

4 Motivation Tuning cache parameters to an application can save energy: 60% on average Balasubramonian’00, Zhang’03 Each application has different cache requirements One predetermined cache configuration can’t be best for all applications L1 Cache Line size –Excess fetch energy if line size too large –Excess stall energy if line size too small

5 Motivation Tuning cache parameters to an application can save energy: 60% on average Balasubramonian’00, Zhang’03 Each application has different cache requirements One predetermined cache configuration can’t be best for all applications L1 Cache { } Cache associativity – Excess fetch energy per access if too high – Excess miss energy if too low

6 Motivation By tuning these parameters, the cache can be customized to a particular application Microprocessor Main Memory Energy L1 Cache L2 Cache Possible Cache Configurations Choose lowest energy configuration Tuning

7 Related Work Configurable caches Soft cores (ARM, MIPS, Tensillica, etc.) Even for hard processors (Motorola M*Core - Malik ISLPED’00; Albonesi MICRO’00; Zhang ISCA’03) Configurable cache tuning Mostly manually in practice –Sub-optimal, time-consuming L1 automated methods –Platune (Givargis TCAD’02, Palesi CODES’02) –Zhang RSP’03 Two-level caches becoming popular More transistors on-chip available Bigger gap between on-chip and off- chip accesses Need automated tuning for L1+L2 Microprocessor Main Memory L1 Cache L2 Cache Tuning

8 Challenge for Two-Level Cache Tuning One level: 10s of configurations Two levels: 100s/1000s of configurations Need efficient heuristic Especially if used with simulation-based search - Total size - Line size - Associativity Level 1 - Total size - Line size - Associativity Level 2 * 2500 configs Say 50 configs. 50 configs.

9 Two-Level Cache Tuning Goal Develop fast, good-quality heuristic for tuning two- level caches to embedded applications for reduced energy consumption Presently focus on separate I and D cache in both levels Microprocessor Level 1 Caches Level 2 Caches Main Memory I-cache D-cache I-cache D-cache Tune Instruction Cache Hierarchy Tune Data Cache Hierarchy

10 Configurable Cache Architecture Our target configurable cache architecture is based on Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture for Embedded Systems,” ISCA KB 8 KB cache consisting of 4 2KB banks that can operate as 4 ways Way concatenation offers a 2-way or a directed-mapped variation 4 KB Way concatenation offers a 2-way or a directed-mapped variation 8 KB Base Level One Cache Way shutdown offers a 2-way 4 KB cache and a direct-mapped 2 KB cache 2 KB Level One Cache Way shutdown offers a 2-way 4 KB cache or a direct-mapped 2 KB cache 2 KB Level One Cache Way shutdown and way concatenation can be combined to offer a direct-mapped 4 KB cache 4 KB Level One Cache

11 Configuration Space Cache parameters Size - L1 cache: 2, 4, and 8 KBytes. L2 cache: 16, 32, and 64 KBytes Line size (L1 or L2) - 16, 32, and 64 Bytes –16 byte physical base line size Associativity (L1 or L2) - Direct-mapped, 2-way, and 4-way 432 possible configurations For two levels, with separate I and D

12 Experimental Environment MediaBench EEMBC SimpleScalar Hit and miss ratios for each configuration Cache energy - Cacti Main memory energy - Samsung memory CPU stall energy micron MIPS uP Cache exploration heuristic Chosen cache configuration Exhaustive search Took days. For comparison purposes

13 First Heuristic: Tune Levels One-at-a-Time Tune L1, then L2 Initial L2: 64 KByte, 4- way, 64 byte line size For best L1 found, tune L2 cache Tuned each cache using Zhang’s heuristic for one-level cache tuning (RSP’03) Microprocessor Main Memory L1 Cache L2 Cache

14 First Heuristic: Tune Levels One-at-a-Time Zhang’s heuristic: Search parameters in order of importance (RSP’03) First search size Begin with a 2 KByte, direct-mapped cache with a 16 Byte line size Level One Cache First search size Increase size to 4 KB. Level One Cache First search size If the size increase yields energy improvements, increase the cache size to 8KB. Level One Cache Next search line size For the lowest energy cache size, increase the line size to 32 Bytes Level One Cache Next search line size If the increase in line size yields a decrease in energy, increase the line size to 64 Bytes Level One Cache Finally, search associativity For the lowest energy line size, increase the associativity to 2 Level One Cache Finally, search associativity If increasing the associativity yields a decrease in energy, increase the associativity to 4 Level One Cache

15 Results of First Heuristic Base cache configuration Level KByte, 4-way, 32 byte line Level KByte, 4-way, 64 byte line

16 First Heuristic Did not find optimal in most cases Sometimes 200% or 300% worse The two levels should not be explored separately Too much interdependence among L1 and L2 cache parameters E.g., high L1 associativity decreases misses and thus reduces need for large L2 Dozens of other such interdependencies

17 Improved Heuristic – Basic Interlacing To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 CacheL2 Cache Determine the best size of level one cacheDetermine the best size of level two cache

18 Improved Heuristic – Basic Interlacing To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 CacheL2 Cache Determine the best line size of level one cacheDetermine the best line size of level two cache

19 Improved Heuristic – Basic Interlacing To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches L1 CacheL2 Cache { } Determine the best associativity of level one cache { } Determine the best associativity of level two cache Basic interlacing performed better than the initial heuristic but there was still much room for improvement

20 Final Heuristic: Interlaced with Local Search Performed well, but some cases sub-optimal Manually examined those cases Determined small local search needed Final heuristic called: TCaT - The Two Level Cache Tuner 16KB Because of the bank arrangements, if a 16KB cache is determined to be the best size, the only associativity option is direct-mapped 16KB However, the application may require the increased associativity. During the associativity search step, the cache size is allowed to increase so that larger associativities may be explored. 16KB

21 TCaT Results: Energy Energy consumption (normalized to the base cache configuration) 53% energy savings in cache/memory access sub-system vs. base cache

22 TCaT Results: Performance Execution time for the TCaT cache configuration and the optimal cache configuration (normalized to the execution time of the benchmark running with the base cache configuration) TCaT finds near-optimal configuration, nearly 30% improvement over base cache

23 TCaT Exploration Time Improvements Searches only 28 of 432 possible configurations 6% of space Simulation-based approach 500 MHz Sparc 50 hrs vs. 3 hrs Hardware-based approach 434 sec vs. 28 sec

24 TCaT in Presence of Hw/Sw Partitioning Hardware/software partitioning may become common in SOC platforms On-chip FPGA Program kernels moved to FPGA Greatly reduces temporal and spatial locality of program Does TCaT still work well on programs with very low locality?

25 TCaT With Hardware/Software Partitioning Energy consumption (normalized to the base cache configuration) 55% energy savings in cache/memory access sub-system vs. base cache

26 Conclusions TCaT is an effective heuristic for two-level cache tuning Prunes 94% of search space for a given two-level configurable cache architecture Near-optimal performance results, 30% improvement vs. base cache Near-optimal energy results, 53% improvement vs. base cache Robust in presence of hw/sw partitioning Future work More cache parameters, unified 2L cache –Even larger search space Dynamic in-system tuning –Must avoid cache flushes