Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering.

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Instruction Set Design

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Goal: Reduce the Penalty of Control Hazards

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

About Holst Centre Independent open-innovation R&D centre Develops generic technologies Partnership with industry and academia Shared roadmaps and programs.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

COMPILERS CLASS 22/7,23/7. Introduction Compiler: A Compiler is a program that can read a program in one language (Source) and translate it into an equivalent.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Branch Hazards and Static Branch Prediction Techniques

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

ECE 445 – Computer Organization

Pipelining: Advanced ILP

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Ann Gordon-Ross and Frank Vahid*

A Self-Tuning Configurable Cache

CS 286 Computer Architecture & Organization

pipelining: static branch prediction Prof. Eric Rotenberg

Basic Cache Operation Prof. Eric Rotenberg

Program Phase Directed Dynamic Cache Way Reconfiguration

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Dept. of Education GAANN Fellowship International Conference on Computer Design, 2002

2 Introduction Memory access can consume 50% of an embedded microprocessor’s system power –Instruction fetching usually more than half of that power –Caches tend to be power hungry ARM920T: caches consume half of total power (Segars 01) M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) ARM920T. Source: Segars ISSCC’01 I-Mem L1 Cache Processor

3 Filter Cache Tiny L0 cache (~64 instruct.) –Kin/Gupta/Mangione-Smith97 –Has very low dynamic power Short internal bitlines Close to microprocessor Power/energy savings, but: –Performance penalty of 21% due to high miss rate (Kin’97) –Tag comparisons consume power L1 Cache Filter Cache (L0) Processor

4 Dynamically Loaded Tagless Loop Cache Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99) Not really first level of memory –Rather, an alternative Operation –Filled when short backwards branch detected in instruction stream Compared to filter cache... –No tags – even lower power –Missless – no performance penalty L1 Cache Dynamic Loop Cache Mux Processor

5 Dynamically Loaded Tagless Loop Cache L1 Cache Dynamic Loop Cache Mux Processor … mov r1, 2 … sbb -2 Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99) Not really first level of memory –Rather, an alternative Operation –Filled when short backwards branch detected in instruction stream Compared to filter cache... –No tags – even lower power –Missless – no performance penalty

6 Dynamically Loaded Tagless Loop Cache L1 Cache Dynamic Loop Cache Mux Processor … mov r1, 2 … sbb -2 Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99) Not really first level of memory –Rather, an alternative Operation –Filled when short backwards branch detected in instruction stream Compared to filter cache... –No tags – even lower power –Missless – no performance penalty

7 Dynamically Loaded Tagless Loop Cache L1 Cache Dynamic Loop Cache Mux Processor … mov r1, 2 … sbb -2 Tiny cache that passively fills with loops as they execute (Lee/Moyer/Arends 99) Not really first level of memory –Rather, an alternative Operation –Filled when short backwards branch detected in instruction stream Compared to filter cache... –No tags – even lower power –Missless – no performance penalty

8 Dynamically Loaded Tagless Loop Cache – Results We ran 10 Powerstone benchmarks (from Motorola) on a MIPS processor instruction-set simulator –Average L1 fetch reduction was 30% –Closely matched results of [Lee et al 99]. –L1 fetch reductions translate to system power savings of % L1 Cache Dynamic Loop Cache Mux Processor

9 Dynamically Loaded Tagless Loop Cache - Limitation Does not support loops with control of flow changes (cof) –Only supports sequential instruction fetching since it was filled passively during a loop iteration Does not see instructions not executed on that iteration –A cof thus terminates loop cache filling or fetching –cof’s unfortunately include common if-then-else statements within a loop L1 Cache Dynamic Loop Cache Mux Processor … mov r1, 2 mov r2, 3 bne r1, r2, 2 … sbb -4 …

10 Dynamically Loaded Tagless Loop Cache - Limitation Does not support loops with control of flow changes (cof) –Only supports sequential instruction fetching since it was filled passively during a loop iteration Does not see instructions not executed on that iteration –A cof thus terminates loop cache filling or fetching –cof’s unfortunately include common if-then-else statements within a loop L1 Cache Dynamic Loop Cache Mux Processor … mov r1, 2 mov r2, 3 bne r1, r2, 2 … sbb -4 …

11 Dynamically Loaded Tagless Loop Cache - Limitation L1 Cache Dynamic Loop Cache Mux Processor Does not support loops with control of flow changes (cof) –Only supports sequential instruction fetching since it was filled passively during a loop iteration Does not see instructions not executed on that iteration –A cof thus terminates loop cache filling or fetching –cof’s unfortunately include common if-then-else statements within a loop … mov r1, 2 mov r2, 3 bne r1, r2, 2 … sbb -4 …

12 Dynamically Loaded Tagless Loop Cache - Limitation Lack of support of cof’s results in support of only half of small frequent loops in the benchmarks

13 Preloaded Tagless Loop Cache Embedded systems typically execute a fixed application –Can determine critical loops/subroutines through profiling –Can preload critical regions into loop cache – whose contents will not change Preloaded loop cache (Gordon-Ross/Cotterell/Vahid CAL’02) –Tagless, missless –Supports more loops than dynamic loop cache L1 Cache Preloaded Loop Cache Mux Processor Dmem.Processor Periph. Pmem. … mov r1, 2 mov r2, 3 bne r1, r2, 2 … sbb -4 …

14 Preloaded Tagless Loop Cache L1 Cache Preloaded Loop Cache Mux Processor Embedded systems typically execute a fixed application –Can determine critical loops/subroutines through profiling –Can preload critical regions into loop cache – whose contents will not change Preloaded loop cache (Gordon-Ross/Cotterell/Vahid CAL’02) –Tagless, missless –Supports more loops than dynamic loop cache Dmem.Processor Periph. Pmem. … mov r1, 2 mov r2, 3 bne r1, r2, 2 … sbb -4 …

15 Preloaded Tagless Loop Cache - Results Results –128 instruction preloaded loop cache reduces L1 fetches by nearly twice that of dynamic (30%) for the benchmarks studied (Powerstone and Mediabench)

16 Preloaded Tagless Loop Cache - Disadvantages Preloaded loop cache has some limitations too –Occasionally dynamic loop cache is actually better –Preloading also requires Fixed application Profiling –Limited number of loops can be preloaded –We really want both! Instruction fetch power savings

17 Solution: A Hybrid Loop Cache Functions as both a dynamic and a preloaded loop cache 2 levels of cache storage –Main Loop Cache – for instruction fetching –2nd Level Storage - preloaded loops are stored here L1 Cache Microprocessor Main Loop Cache 2nd Level Storage Mux Loop Cache Controller Preloaded Loop Filler Loop Match Memory Addr Data Addr Data Addr Data Control signals Control LARs

18 Hybrid Loop Cache - Functionality Dynamic Loop Cache functionality –On a short backwards branch, main loop cache is filled dynamically Preloaded Loop Cache functionality –On a cof, if the next instruction falls within a preloaded region of code, that region is filled into the main loop cache from 2nd level storage –After being filled, instructions can be fetched from the main loop cache

19 Hybrid Loop Cache - Results Hybrid performance –Best in 9 out of 13 benchmarks –Equally well in 1 –In the remaining 3, the hybrid loop cache performed nearly as good or better the strictly dynamic approach but was outperformed by the preloaded approach Instruction fetch power savings

20 Hybrid Loop Cache – Additional Consideration Hybrid loop cache can behave like a dynamic loop cache –If designer does not wish to profile/preload –Power savings are almost identical to the dynamic loop cache when no loops are preloaded

21 Conclusions Hybrid loop cache reduced embedded system instruction fetch power by average of 51% –90% savings in several cases –Outperformed dynamic and preloaded loop caches Dynamic 23%, Preloaded 35%, Hybrid 51% Can work as dynamic, preloaded, or both –More capacity than a preloaded loop cache –Can be used transparently as dynamic loop cache With nearly identical results Hybrid loop cache may be a good addition to low power embedded microprocessor architectures