1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Power Reduction Techniques For Microprocessor Systems
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
ISLPED’99 International Symposium on Low Power Electronics and Design
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Evaluating Register File Size
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Tosiron Adegbija and Ann Gordon-Ross+
Ann Gordon-Ross and Frank Vahid*
A Self-Tuning Configurable Cache
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation

2 Caches Consume Much Power >50% ARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99) Caches are frequently accessed Consume Dynamic Power Caches accounts for the most of the transistors on a die Consume Static Power We showed that a configurable cache can reduce that power nearly in half on average (Zhang et.al. ISCA 03,ISVLSI 03)

3 Configurable Cache Architecture W1 Four Way Set Associative Base Cache W2W3W4 W1 Two Way Set Associative W2W3W4 W1 Direct mapped cache W2W3W4 W1 Shut down two ways W2W3W4 Gnd Vdd Bitline Gated-Vdd Control Way Concatenation Way Shutdown Counter bus One Way 16 bytes 4 physical lines are filled when line size is 64 bytes Off Chip Memory Use sleep transistor method (Powell et. al. ISLPED 2000) (Zhang et. al. ISVLSI 03) (Zhang et.al. ISCA 03) Way prediction unit can be turned on/off. Line Concatenation

4 Computing Total Memory-Related Energy Considers CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations ) energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle Underlined – measured quantities SimpleScalar (cache_hits, cache_misses, cycles) Our layout or data sheets (others)

5 Best Configuration Varies Across Applications

6 Cache Self-tuning Hardware Simulation-based methods Drawback: slowness. Seconds of real-time work may take tens of hours to simulate Simulation tools set up may be difficult Self-tuning method Incorporates a cache parameter tuner on a SoC platform Detect the lowest energy dissipation cache parameters The tuner sits to the side and collects information used to calculate the energy D$ I$ Tuner Processor Offchip Memory Heuristic algorithm is needed Search all possible cache configurations are time consuming. Considering other configurable parameters: voltage levels, bus width, etc. the search space will increase very quickly to millions Cache flushing should be avoided

7 Designing a Search Heuristic: Evaluating Impact of Cache Parameters on Miss Rate and Energy Average Instruction Cache Miss Rate and Normalized Energy of the Benchmarks. One Way Line Size 32B Line Size 32B One Way

8 Energy Dissipation of On-Chip Cache and Off Chip Memory

9 Heuristic: Searching for the least-energy cache configuration The least-energy cache configuration Search Cache SizeSearch Line SizeSearch Associativity Way prediction W1W2W3 W4

10 Implementing the Heuristic in Hardware input hit energies miss energies static energies hit num miss num multiplier adder register FSM comparator lowest energy control com_out configure register mux exe time FSM and Data Path of the Cache Explorer Total size of the tuner. About 4,200 gates, or mm 2 in 0.18 micron CMOS technology. Area overhead Compared to the reported size of the MIPS 4Kp with cache, this represents just over a 3% area overhead. Power consumption: 2.69 mW at 200 MHz. The power overhead compared with the MIPS 4Kp would be less than 0.5%. Furthermore, the exploring hardware is used only during the exploring stage, and can be shut down after the best configuration is determined.

11 Heuristic time-complexity and effectiveness Time complexity: Search all space: O(m x n x l x p) Heuristic : O(m + n + l + p) m:number of associativities, n :number of cache size l : number of cache line size, p :way prediction on/off Efficiency On average 5 searching instead of 27 total searching 2 out of 19 benchmarks miss the lowest power cache configuration. Use a different searching heuristic: line size, associativity, way prediction and cache size. 11 out 19 benchmarks miss the best configuration

12 Energy Savings On average, 40% energy reductions. Conventional direct mapped cache may consume unacceptable energy 70% energy reductions Energy savings when way concatenation, way shut down, and cache line size concatenation are implemented. cnv: Conventional Cache, cfg: configurable cache; wc:way concatenation; ws:way shut down; lc:line concatenation. (C. Zhang TECS ACM To Appear) 100% stands for the energy consumption of a conventional four way set associative cache

13 Conclusions A highly configurable cache architecture Reduces on average 40% of memory access related energy A self-tuning mechanism is proposed A special cache parameter explorer A heuristic algorithm to search the parameter space Cache flushing is avoided