A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Tony GivargisUniversity of California, Riverside & NEC USA1 Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design Tony D. Givargis.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
1 of 20 Low Power and Dynamic Optimization Techniques for Power-Constrained Domains Ann Gordon-Ross Department of Electrical and Computer Engineering University.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Cache Memory.
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Tosiron Adegbija and Ann Gordon-Ross+
Ann Gordon-Ross and Frank Vahid*
Lecture 22: Cache Hierarchies, Memory
Module IV Memory Organization.
Tosiron Adegbija and Ann Gordon-Ross+
A Self-Tuning Configurable Cache
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Cache - Optimization.
Automatic Tuning of Two-Level Caches to Embedded Applications
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros 4 1 Dept of Computer Science & Engineering - University of California, Riverside, USA 2 Campus Arapiraca – Federal University of Alagoas, Brazil 3 Centro de Informática - Federal University of Pernambuco, Brazil This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

2 Ann Gordon-Ross Univ of Ca, Riverside Introduction Memory access: 50% of embedded processor’s system power Caches are power hungry ARM920T (Segars 01) M*CORE (Lee/Moyer/Arends 99) Thus, caches are a good candidate for optimizations 53% Main Mem L1 I Cache Processor L1 D Cache

3 Ann Gordon-Ross Univ of Ca, Riverside Introduction Different applications have vastly different cache requirements Total size, line size, and associativity Cache parameters that don’t match an application’s behavior can waste over 60% of energy (Gordon-Ross 05) Cache tuning is the process of determining the appropriate cache parameters for an application 4KB 16 byte 2-way 2KB 32 byte direct-mapped 8KB 64 byte 4-way

4 Ann Gordon-Ross Univ of Ca, Riverside Download application Runtime Cache Tuning Best cache configuration can be determined by searching the design space during runtime Runtime cache tuning is transparent to the designer and end user, but incurs runtime overhead in terms of energy and performance Energy Executing in base configuration Tunable cache Tuning hw TC Cache Tuning TC

5 Ann Gordon-Ross Univ of Ca, Riverside Download application Contribution We introduce specialized hardware for non-intrusive runtime cache evaluation Temporary energy overhead and no performance overhead Single-pass multi-cache evaluation - SPCE Special hardware simultaneously evaluates all cache configurations Enables switching to the best configuration in one-shot Tunable cache SPCE Energy Executing in base configuration SPCE causes an increase in energy but no performance overhead Switch to best config in “one-shot” SPCE TC

6 Ann Gordon-Ross Univ of Ca, Riverside SPCE Key Points Contributions compared to previous methods Evaluates a highly configurable cache –Previous method offer little configurability Little hardware overhead –Simple data structures –Elementary operations

7 Ann Gordon-Ross Univ of Ca, Riverside SPCE Monitors address stream to extract cache hit information for all configurations Fully-associative cache example (64-bit architecture) Address stream t 0 = 0 t 1 = 8 t 2 = 16 t 3 = 0 t 4 = 8 t 5 = 0 t 6 = 16 Table (stored hit info) b d Line size (number of words) Number of lines 24 different configs Number of conflicts determines cache sizes that would result in a hit For each line size … >> 2 0 *8 t 0 = 0 t 1 = 1 t 2 = 2 t 3 = 0 t 4 = 1 t 5 = 0 t 6 = 2 HIT } 3 1 } 3 2 } 2 1 } 3 3 >> 2 1 *8 t 0 = 0 t 1 = 0 t 2 = 1 t 3 = 0 t 4 = 0 t 5 = 0 t 6 = 1 HIT >> 2 2 *8 t 0 = 0 t 1 = 0 t 2 = 0 t 3 = 0 t 4 = 0 t 5 = 0 t 6 = 0 HIT 6 Cache with 2 lines with 2 1 words per line (32 bytes) will have 5 hits and 7-5=2 misses

8 Ann Gordon-Ross Univ of Ca, Riverside SPCE SPCE determines hits for other set-associativities by counting the number of unique conflicts in the address trace Tables (multiple layers) Direct-mapped 2-way 4-way Table (stored hit info) b s Line size (number of words) Number of sets

9 Ann Gordon-Ross Univ of Ca, Riverside SPCE - Hardware (stack) Designed and evaluated in synthesizable VHDL

10 Ann Gordon-Ross Univ of Ca, Riverside Results - Energy Savings Energy savings compared to exploring the design space using a state-of-the-art intrusive heuristic (Zhang 03) Values less than 1 denote an energy increase 4.6x less energy expended

11 Ann Gordon-Ross Univ of Ca, Riverside Results - Tuning Speedup Tuning speedup obtained compared to a state-of- the-art intrusive heuristic 7.7x faster

12 Ann Gordon-Ross Univ of Ca, Riverside Overheads Evaluated SPCE compared to the ARM920T Area 12% area overhead –Due in large part to the TCAM stack structure Power Temporary 2.2X increase in power during short tuning cycle –Application need only iterate 4 times for average power overhead to reduce to 1%

13 Ann Gordon-Ross Univ of Ca, Riverside Conclusions SPCE is a specialized hardware structure to evaluate all cache configurations simultaneously Enables non-intrusive runtime cache evaluation Enables switching to best cache configuration in one shot Compared to a state-of-the-art intrusive cache tuning heuristic 4.6x less energy expended 7.7x speedup in tuning time 12% area overhead compared to ARM920T Temporary 2.2x increase in power during short tuning time –Only 4 application iterations to recoup power