A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001
Frank Vahid, 2 Mass-produced microprocessor IC’s prevail in embedded systems –Cheap From amortization and high yields –Small and low power From optimization and use of new technologies –Available immediately Typically run one program forever QUESTION: –Can we “tune” a mass-produced microprocessor to its one program to reduce power? Introduction Pmem. Sample: Annual production: 10 million units Cost per unit: $2 Dmem.Processor Periph. Pmem. Processor
Frank Vahid, 3 Dmem. Pmem. Periph. Introduction Use configurable (tunable) components and add a tuner circuit Leading edge chip in ,000 transistors Leading edge chip in ,000,000 transistors Moore’s Law: 2x / 18 months Tuner. Make use of abundant transistors –Previously, silicon too scarce –Today, “transistor budgets have gone ballistic” [Microprocessor Report, 1998] –Software analogy Previously, program memory was scarce Today, we find a flight simulator hidden in Excel’97 Processor
Frank Vahid, 4 Introduction We introduce: –Architecture and methodology for a self- optimizing microprocessor that can tune itself to its program Uses self-profiling circuitry and designer- activated self-optimization mode To illustrate, we introduce: –A tunable component: Loop Table Similar to loop caches, differs in how and when contents are updated –Other tunable components are possible
Frank Vahid, 5 Problem Description Goal: –Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power Constraints 1.Exact instruction set compatibility 2.Avoid changing tool chain 3.Preserve cycle-by-cycle behavior –These constraints are more stringent than in most previous work
Frank Vahid, 6 Related Work Application-specific instruction-set processors –Introduce new instructions for frequent code Pre-fabrication: [Fischer99], [Tensillica00] Post-fab: [Kucukcakar99] – for mass-produced IC’s Modifies instruction-set and tool chain Code morphing –Crusoe: Cache frequent code’s translation Helps only if performing dynamic binary translation Changes cycle-by-cycle behavior Code compression –Compress frequent code [Ishihara00] Modifies tool chain
Frank Vahid, 7 Related Work Cache frequent small loops –Reduces memory/bus power –Filter cache [Kin97] Small L0 cache Many misses (extra cycles) –Compiler-assisted loop cache [Bellas99] Profiler/compiler marks frequent loops for filter cache placement Modifies tool chain –Transparent loop cache [Lee99] Fill loop cache only when detect short- backwards branch No tag comparisons – greater efficiency –Our approach Moves profiler to chip, and can be more selective in filling loop cache PID controller example: most execution time spent in two small loops Pmem Proc. Pmem Proc. Loop table
Frank Vahid, 8 Architecture Overview Standard microcontroller –ROM access consumes much power –Added Self-Profiling Controller and Loop Count Table for profiling Loop Table to store common loops Bypass Controller to switch to Loop Table ROMConfiguration Memory (~10’s of bytes) Datapath RAM Controller Self- Profiling Controller Bypass Controller Loop Count Table Loop Table Microprocessor
Frank Vahid, 9 Methodology Overview Self-optimizing microcontroller –Post-fabrication (hence mass-produced) –In-system –Tuning under designer control Not by end user, hence stable and consistent end-use platform (Designer: pre- fabrication) Designer: post-fabricationUser Self-optimization mode activation
Frank Vahid, 10 Methodology Overview Activate self-optimizing mode, causing update of configuration memory Reset microcontroller, causing (optimized) application execution in normal mode Download application to microcontroller program memory Upload configuration memory for downloading to other microcontrollers
Frank Vahid, 11 Self-optimizing mode Initializing –Activated by extra pin or existing pin combo –Traverse memory, detect loops, add addresses to loop count table Down- load program Self- optimizing mode Normal mode Upload configuration ROM Self- Profiling Controller Loop Count Table Loop addr.Count Profiling –Execute, update loop counts Requires fast increments We use fully-assoc. mem Hardware hash table possible Configuring –Store most frequent loop addresses at bottom of program memory, set flag 200
Frank Vahid, 12 Normal mode Reset –Read loop addresses (if any) into registers (LAR’s) –Read corresponding loops into loop table –Set flag in bypass controller ROM 200 Bypass Controller Loop Table Data- path RAM Con- troller : **** LAR: Execute: Check if flag set and address match –No: Fetch from ROM –Yes: Begin fetching from loop table –No tag comparisons, no misses –Pre-computed extra bits quickly detect table exit Down- load program Self- optimizing mode Normal mode Upload configuration
Frank Vahid, 13 Results -- power Savings –34% total power savings after self-optimization –Dependent on technology Power overhead –Negligible when self- optimization idle –Slight increase (5%) during self-optimization Setup –Synopsys synthesis, simulation, and power analysis –8051 synthesizable VHDL model at UCR ( Ex1: checksum Ex2: gcd Ex3: matrix multiply
Frank Vahid, 14 Results – size (in cells) Big increase, but: –8051 version was small Others much bigger Smaller % overhead –Transistors becoming cheaper –Product-oriented IC’s: loop table and controller, no Self- Profiler or Loop Count Table –Transfer configuration from prototype-oriented part to new product-oriented parts –Supported by existing upload/download tools –We are working on shrinking the Loop Count Table logic
Frank Vahid, 15 Conclusions Mass-produced IC’s give big advantages Transistor abundance provides new opportunities We introduced: –A self-optimization methodology and architecture –A loop table as an example tunable component These items yielded: –Power savings by reducing ROM access 34% savings for 8051 microcontroller for target technology –No change in instruction set, tools, or performance Future work includes: –Reducing size overhead while maintaining accuracy –Trading off size with accuracy –Extending loop table for multiple loops, subroutines, etc. –Incorporating into 32-bit processor environment (LEON Sparc) –Investigating other tunable components On-chip FPGA, configurable cache, etc.