Download presentation
Presentation is loading. Please wait.
Published byLester Kennedy Modified over 9 years ago
1
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001
2
Frank Vahid, www.cs.ucr.edu/~vahid 2 Mass-produced microprocessor IC’s prevail in embedded systems –Cheap From amortization and high yields –Small and low power From optimization and use of new technologies –Available immediately Typically run one program forever QUESTION: –Can we “tune” a mass-produced microprocessor to its one program to reduce power? Introduction Dmem.Processor Pmem. Periph. Annual production: 10 million units Cost per unit: $2 Dmem.Processor Periph. Pmem. Dmem.Processor Pmem. Periph.
3
Frank Vahid, www.cs.ucr.edu/~vahid 3 Dmem. Pmem. Periph. Introduction Answer: –Yes, by using configurable (tunable) components and adding a tuner circuit 1981 1984198719901993199619992002 Leading edge chip in 1981 10,000 transistors Leading edge chip in 2002 150,000,000 transistors Moore’s Law: 2x / 18 months Tuner. Non-obvious use of extra transistors –Previously unheard of – silicon too scarce –Becoming more common, e.g., self-test circuitry –“Transistor budgets have gone ballistic” [Microprocessor Report, 1998] –Analogous situation in software Yesterday, program memory extremely scarce Today, we find a flight simulator hidden in Excel’97 Processor
4
Frank Vahid, www.cs.ucr.edu/~vahid 4 Introduction We introduce: –A basic architecture and methodology for a self-optimizing microprocessor that can tune itself to its program Involves self-profiling circuitry Uses designer-activated self-optimization mode To illustrate, we also introduce: –A tunable component: Loop Table Small memory to store frequent loops Similar to previous loop caches –Differs in how and when contents are updated
5
Frank Vahid, www.cs.ucr.edu/~vahid 5 Problem Description and Related Work Goal: –Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power Constraints 1.Exact instruction set compatibility 2.Avoid changing tool chain 3.Preserve cycle-by-cycle behavior –These constraints are more stringent than in most previous work
6
Frank Vahid, www.cs.ucr.edu/~vahid 6 Problem Description and Related Work Application-specific instruction-set processors –Introduce new instructions for common actions Pre-fabrication: [Fischer99], [Tensillica00] Post-fabrication: [Kucukcakar99] – for mass-produced IC’s –Obviously modifies instruction-set and tool chain Dynamic binary translation and code morphing –Transmeta’s Crusoe: Profile executing code, cache translation results of frequently executed code –Changes cycle-by-cycle behavior, and only helps if performing dynamic binary translation in the first place Program compression –Profile code, compress frequently-executed code [Ishihara00] –Modifies the tool chain
7
Frank Vahid, www.cs.ucr.edu/~vahid 7 Problem Description and Related Work Loop caches –Cache frequently-executed small loops to reduce power for memory –Filter cache [Kin97] Small, low-power L0 cache Causes extra cycles due to many misses –Compiler-assisted loop cache [Bellas99] Use profiler/compiler to mark only frequent loops for placement in filter cache Modifies tool chain –Transparent loop cache [Lee99] Fill loop cache only when detect a short- backwards branch, indicating a small loop No tag comparisons – greater efficiency We extend to only consider frequent loops, reducing runtime overhead PID controller example: most execution time spent in two small loops Pmem Proc. Pmem Proc. Loop table
8
Frank Vahid, www.cs.ucr.edu/~vahid 8 Architecture Overview Started with standard microcontroller –ROM access consumes much power –Added Loop Table to store common loops –Added Bypass Controller to switch to/from Loop Table –Added Self-Profiling Controller and Loop Count Table to detect most frequent loops Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Self- Profiling Controller Bypass Controller Loop Count Table (~100’s of bytes) Loop Table (~100’s of bytes) Mux Address Instruction Microprocessor Instructions Jump bits Mu x Instruction LAR’s Address
9
Frank Vahid, www.cs.ucr.edu/~vahid 9 Methodology Overview Self-optimizing microcontroller –Post-fabrication (hence mass-produced) –In-system –Tuning under designer control Not by end user, hence stable and consistent end-use platform (Designer: pre- fabrication) Designer: post-fabricationUser Self-optimization mode activation
10
Frank Vahid, www.cs.ucr.edu/~vahid 10 Activate self-optimizing mode, causing update of configuration memory Reset microcontroller, causing (optimized) application execution in normal mode Methodology Overview Download application to microcontroller program memory Upload configuration memory for downloading to other microcontrollers Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Mux Self- Profiling Controller Loop Count Table (~100’s of bytes) Address Instruction Microprocessor Mux Instruction Bypass Controller Loop Table (~100’s of bytes) InstructionsJump bits Instruction LAR’s Address
11
Frank Vahid, www.cs.ucr.edu/~vahid 11 Self-optimizing mode Initializing –Activated by extra pin –Traverse memory, detect loops, add addresses to loop count table Profiling –Execute, update loop counts Requires fast increments We use fully-assoc. mem Hardware hash table possible Configuring –Store most frequent loop addresses at bottom of program memory, set flag Down- load program Self- optimizing mode Normal mode Upload configuration Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Self- Profiler Loop Count Table Microprocessor
12
Frank Vahid, www.cs.ucr.edu/~vahid 12 Normal mode Reset –If self-optimization flag set Read loop addresses into address registers (LAR’s) Set flag in bypass controller If flag unset or no address match, fetch from ROM If flag set and address match Begin fetching from loop table Extra bits in loop table for fast determination if jump leaves table –00: instruction can’t exit loop –10: exits loop if jump not taken –01: exits loop if jump taken Down- load program Self- optimizing mode Normal mode Upload configuration Program Memory (ROM)Configuration Memory Datapath Data Memory (RAM) Controller Bypass Loop Table Microprocessor Instructions Jump LAR’s
13
Frank Vahid, www.cs.ucr.edu/~vahid 13 Results -- power Savings –34% total power savings after self-optimization –Depends on technology Power overhead –Negligible when self- optimization idle –Slight increase (5%) during self-optimization Setup –Synopsys synthesis, simulation, and power analysis –8051 synthesizable VHDL model at UCR (www.cs.ucr.edu/~dalton)www.cs.ucr.edu/~dalton
14
Frank Vahid, www.cs.ucr.edu/~vahid 14 Results – size (in cells) Significant increase, but: –8051 version was small Others bigger ROM (e.g., 2M), RAM, and other processors are even bigger Smaller percentage overhead –Transistors becoming cheaper –Can build product-oriented IC’s with only loop table and controller (no Self-Profiler or Loop Count Table) –Upload new binaries from prototype-oriented part, download back to new product-oriented parts –Supported by existing standard tools –We are investigating ways to shrink the Loop Count Table
15
Frank Vahid, www.cs.ucr.edu/~vahid 15 Conclusions Mass-produced IC’s give big advantages Abundance of transistors provides new opportunity for self- optimization by tuning We introduced: –A self-optimization methodology and architecture –A loop table as a tunable component These items yielded: –Significant power savings by reducing ROM access 34% total savings for our particular microcontroller and target technology –No change in instruction set, tools, or performance Future work includes: –Reducing size overhead –Investigating other tunable components (e.g., N-way cache)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.