Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
COMP375 Computer Architecture and Organization Senior Review.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering.
Branch prediction Titov Alexander MDSP November, 2009.
Instruction Set Design
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Computer Organization and Architecture
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
Higher Computing: Unit 1: Topic 3 – Computer Performance St Andrew’s High School, Computing Department Higher Computing Topic 3 Computer Performance.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Organization and Architecture The CPU Structure.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Virtual Memory I Chapter 8.
GCSE Computing - The CPU
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
Pipelining By Toan Nguyen.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Computing Hardware Starter.
Cache memory October 16, 2007 By: Tatsiana Gomova.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Cache Memory By Tom Austin. What is cache memory? A cache is a collection of duplicate data, where the original data is expensive to fetch or compute.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Dynamic Associative Caches:
GCSE Computing - The CPU
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Computer Architecture Chapter (14): Processor Structure and Function
Memory COMPUTER ARCHITECTURE
Lecture 12 Virtual Memory.
How will execution time grow with SIZE?
ROBTIC : On chip I-cache design for low power embedded systems
Ann Gordon-Ross and Frank Vahid*
Automatic Tuning of Two-Level Caches to Embedded Applications
GCSE Computing - The CPU
Presentation transcript:

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering University of California, Riverside Presented By Guang Pan

Abstract Adding a small loop cache to a microprocessor has been shown to reduce average instruction fetch energy for various sets of embedded system applications. Designers can now tune a loop cache architecture to best match a specific application. Using this environment, we show significant variation in the best architecture for different examples.

Introduction Reducing energy and power consumption of embedded systems translates to longer battery lives and reduced cooling requirements. For embedded microprocessor based systems, instruction fetching can contribute to a large percentage of system power (nearly 50%).

Fetching occurs on nearly every cycle, involves driving of long and possibly off-chip bus lines, and may involve reading numerous memories concurrently – such as in set-associative caches. Several approaches to reducing instruction fetch energy have been proposed Including: Program compression to reduce the amount of bits fetched; Bus encoding to reduce the number of switched wires; Efficient instruction cache design;

Another category of approaches: Capitalize on the common feature of embedded applications spending much time in small loops, integrate a tiny instruction cache with the microprocessor. Such tiny caches have extremely low power per access - 50 times less than regular instruction memory access.

Several low-power tiny instruction cache architectures have been introduced in recent years, including: The filter caches; Dynamically-loaded tagless loop caches; Preloaded tagless loop caches. Such tiny caches can be used in addition to an existing cache hierarchy. Not only can each type of cache vary in size, but also in certain features.

However, an embedded system typically runs one fixed application for the systems lifetime. For example, a cell phones software usually does not change. Furthermore, embedded system designers are increasingly utilizing microprocessor cores rather than off-the-shelf microprocessor chips. The combination of a fixed application and a flexible core opens the opportunity to tune the cores architecture to that fixed application.

Architecture tuning (the customizing of an architecture to most efficiently execute a particular application). A very aggressive form of tuning involves creating a customized instruction set, known as an application- specific instruction set. Complementary to such application-specific instruction-set processor design is the design of customized memory architectures.

Filter/Loop Cache Architectures Several tiny cache architectures, shown in Figure 1, have been proposed in recent years for low energy or power, each with several variations. (a) A filter cache is L0 memory – misses cause a fill from L1. (b) A basic dynamically loaded loop cache sits to the side – accesses are either to it or L1 memory. (c) A preloaded loop cache also sits to the side, but its contents do not change during program execution.

Filter Cache The filter cache is an unusually small direct-mapped cache. (Placed between the CPU and the L1 cache) Utilizes standard tag comparison and miss logic. Advantages: - Filter cache is much smaller than the L1 cache, it will have a faster access time and lower power per access. Disadvantages: - The cache is so small, it may suffer from a high miss rate and hence may decrease overall performance.

Dynamically Loaded Loop Caches The dynamically loaded loop cache is a small instruction buffer that is tightly integrated with the processor and that has no tag address store or valid bit. It is an alternative location from which to fetch instructions. A loop cache controller is responsible for filling the loop cache when detecting a simple loop – defined as any short backwards branch instruction.

Dynamically Loaded Loop Caches 2 Example: Dynamic Loop Cache Loop Cache Controller Detects 1st iteration Fills 2nd iteration Fetches 3rd iteration

Dynamically Loaded Loop Caches 3 The original dynamic loop cache The location from which to fetch an instruction is determined by a simple counter. The controller continues to fetch from the loop cache, resetting the counter each time it reaches zero until a control of flow change is encountered or short backwards branch is not taken. Inability to handle loops that are larger than the cache itself. The flexible dynamic loop cache Enhanced Original Dynamic loop cache. Filled with the instructions located at the beginning of the loop until the loop cache is full.

Dynamically Loaded Loop Caches 4 Advantages: - Eliminate performance degradation and the need for tag comparisons. Disadvantages: - Internal branches within loops, multiple backwards branches to the same starting point in a loop, nested loops, and subroutines all pose problems.

Preloaded Loop Caches A preloaded loop cache is using profiling information gathered for a particular application. It selects and preloads the loops that comprised the largest percentage of execution time during system reset, along with extra bits indicating whether control of flow changes within the loop cause an exit from the loop. Loop address registers are preloaded to indicate which loops were preloaded into the loop cache. The loop cache controller can check for a loop address whenever a short backwards branch is executed.

Preloaded Loop Caches 2 The preloaded loop cache (sbb) loop cache fetching can begin on a loops second rather than third iteration. The preloaded loop cache (sa) (starting address) loop addresses can be looked for on every instruction, allowing loop cache fetching to begin on a loops first iteration.

Preloaded Loop Caches 3 Advantages: - Potentially greater energy savings. Disadvantages: - Also a tagless loop caching scheme, only allows a limited number of loops to be cached and requires more complex logic to index into the preloaded loop cache.

Evaluation Framework Which tiny instruction cache architecture and variation is best? - The answer depends on the application being executed. - Evaluated a number of cache architecture variations on a set of Powerstone benchmarks. - For each benchmark, we considered 106 different cache configurations.

Evaluation Framework 2 Table 1: Corresponding code for various cache configurations

Results Figure 2: Instruction fetch energy savings for bit for various cache configurations.

Results 2 Figure 3: Instruction fetch energy savings for v42 for various cache configurations.

Results 3 Figure 4: Average instruction fetch energy savings for Powerstone benchmarks for various cache configurations.

Results 4 Figure 6: Best configuration for a benchmark (left bar) vs. best average configurations: configuration 30 (middle bar) and configuration 105 (right bar).

Conclusions Incorporating a tiny instruction cache can result in instruction fetch energy savings, but many variations of such caches exists. By customizing such a cache to a particular program, we obtained an average additional energy savings of 11% compared to a non-customized cache, with savings over 40% in some cases examined. An investigation of a wider variety of loop cache architectures (warm-fill dynamically-loaded loop caches and hybrid dynamic/preloaded loop caches) has just started.

Questions