A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Instruction Set Design
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
Computer Organization
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Computer Architecture And Organization UNIT-II Structured Organization.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Introduction to Microprocessors
Electronic Analog Computer Dr. Amin Danial Asham by.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Computer Systems. Bits Computers represent information as patterns of bits A bit (binary digit) is either 0 or 1 –binary  “two states” true and false,
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
EBIZ 509 Foundations of E-Business. 2 © UW Business School, University of Washington 2004 Agenda Today Class schedule and class plan Basic computer concepts.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
1 Aphirak Jansang Thiranun Dumrongson
History a bit. The 1 st uP: Intel 4004 Introduced Nov., 1971 by Intel 2250 transistors 108 kHz, 60,000 ops/sec 16 pins DIP (Dual in-line package) 10-micron.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Introduction to microprocessor (Continued) Unit 1 Lecture 2
Morgan Kaufmann Publishers
Introduction of microprocessor
Pipeline Implementation (4.6)
INTRODUCTION TO MICROPROCESSORS
Micro-programmed Control Unit
Morgan Kaufmann Publishers The Processor
Microprocessor & Assembly Language
Ann Gordon-Ross and Frank Vahid*
Processor Organization and Architecture
A Self-Tuning Configurable Cache
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
A Level Computer Science Topic 5: Computer Architecture and Assembly
Register sets The register section/array consists completely of circuitry used to temporarily store data or program codes until they are sent to the.
Presentation transcript:

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001

Frank Vahid, 2 Mass-produced microprocessor IC’s prevail in embedded systems –Cheap From amortization and high yields –Small and low power From optimization and use of new technologies –Available immediately Typically run one program forever QUESTION: –Can we “tune” a mass-produced microprocessor to its one program to reduce power? Introduction Dmem.Processor Pmem. Periph. Annual production: 10 million units Cost per unit: $2 Dmem.Processor Periph. Pmem. Dmem.Processor Pmem. Periph.

Frank Vahid, 3 Dmem. Pmem. Periph. Introduction Answer: –Yes, by using configurable (tunable) components and adding a tuner circuit Leading edge chip in ,000 transistors Leading edge chip in ,000,000 transistors Moore’s Law: 2x / 18 months Tuner. Non-obvious use of extra transistors –Previously unheard of – silicon too scarce –Becoming more common, e.g., self-test circuitry –“Transistor budgets have gone ballistic” [Microprocessor Report, 1998] –Analogous situation in software Yesterday, program memory extremely scarce Today, we find a flight simulator hidden in Excel’97 Processor

Frank Vahid, 4 Introduction We introduce: –A basic architecture and methodology for a self-optimizing microprocessor that can tune itself to its program Involves self-profiling circuitry Uses designer-activated self-optimization mode To illustrate, we also introduce: –A tunable component: Loop Table Small memory to store frequent loops Similar to previous loop caches –Differs in how and when contents are updated

Frank Vahid, 5 Problem Description and Related Work Goal: –Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power Constraints 1.Exact instruction set compatibility 2.Avoid changing tool chain 3.Preserve cycle-by-cycle behavior –These constraints are more stringent than in most previous work

Frank Vahid, 6 Problem Description and Related Work Application-specific instruction-set processors –Introduce new instructions for common actions Pre-fabrication: [Fischer99], [Tensillica00] Post-fabrication: [Kucukcakar99] – for mass-produced IC’s –Obviously modifies instruction-set and tool chain Dynamic binary translation and code morphing –Transmeta’s Crusoe: Profile executing code, cache translation results of frequently executed code –Changes cycle-by-cycle behavior, and only helps if performing dynamic binary translation in the first place Program compression –Profile code, compress frequently-executed code [Ishihara00] –Modifies the tool chain

Frank Vahid, 7 Problem Description and Related Work Loop caches –Cache frequently-executed small loops to reduce power for memory –Filter cache [Kin97] Small, low-power L0 cache Causes extra cycles due to many misses –Compiler-assisted loop cache [Bellas99] Use profiler/compiler to mark only frequent loops for placement in filter cache Modifies tool chain –Transparent loop cache [Lee99] Fill loop cache only when detect a short- backwards branch, indicating a small loop No tag comparisons – greater efficiency We extend to only consider frequent loops, reducing runtime overhead PID controller example: most execution time spent in two small loops Pmem Proc. Pmem Proc. Loop table

Frank Vahid, 8 Architecture Overview Started with standard microcontroller –ROM access consumes much power –Added Loop Table to store common loops –Added Bypass Controller to switch to/from Loop Table –Added Self-Profiling Controller and Loop Count Table to detect most frequent loops Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Self- Profiling Controller Bypass Controller Loop Count Table (~100’s of bytes) Loop Table (~100’s of bytes) Mux Address Instruction Microprocessor Instructions Jump bits Mu x Instruction LAR’s Address

Frank Vahid, 9 Methodology Overview Self-optimizing microcontroller –Post-fabrication (hence mass-produced) –In-system –Tuning under designer control Not by end user, hence stable and consistent end-use platform (Designer: pre- fabrication) Designer: post-fabricationUser Self-optimization mode activation

Frank Vahid, 10 Activate self-optimizing mode, causing update of configuration memory Reset microcontroller, causing (optimized) application execution in normal mode Methodology Overview Download application to microcontroller program memory Upload configuration memory for downloading to other microcontrollers Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Mux Self- Profiling Controller Loop Count Table (~100’s of bytes) Address Instruction Microprocessor Mux Instruction Bypass Controller Loop Table (~100’s of bytes) InstructionsJump bits Instruction LAR’s Address

Frank Vahid, 11 Self-optimizing mode Initializing –Activated by extra pin –Traverse memory, detect loops, add addresses to loop count table Profiling –Execute, update loop counts Requires fast increments We use fully-assoc. mem Hardware hash table possible Configuring –Store most frequent loop addresses at bottom of program memory, set flag Down- load program Self- optimizing mode Normal mode Upload configuration Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Datapath Data Memory (RAM) Controller Self- Profiler Loop Count Table Microprocessor

Frank Vahid, 12 Normal mode Reset –If self-optimization flag set Read loop addresses into address registers (LAR’s) Set flag in bypass controller If flag unset or no address match, fetch from ROM If flag set and address match Begin fetching from loop table Extra bits in loop table for fast determination if jump leaves table –00: instruction can’t exit loop –10: exits loop if jump not taken –01: exits loop if jump taken Down- load program Self- optimizing mode Normal mode Upload configuration Program Memory (ROM)Configuration Memory Datapath Data Memory (RAM) Controller Bypass Loop Table Microprocessor Instructions Jump LAR’s

Frank Vahid, 13 Results -- power Savings –34% total power savings after self-optimization –Depends on technology Power overhead –Negligible when self- optimization idle –Slight increase (5%) during self-optimization Setup –Synopsys synthesis, simulation, and power analysis –8051 synthesizable VHDL model at UCR (

Frank Vahid, 14 Results – size (in cells) Significant increase, but: –8051 version was small Others bigger ROM (e.g., 2M), RAM, and other processors are even bigger Smaller percentage overhead –Transistors becoming cheaper –Can build product-oriented IC’s with only loop table and controller (no Self-Profiler or Loop Count Table) –Upload new binaries from prototype-oriented part, download back to new product-oriented parts –Supported by existing standard tools –We are investigating ways to shrink the Loop Count Table

Frank Vahid, 15 Conclusions Mass-produced IC’s give big advantages Abundance of transistors provides new opportunity for self- optimization by tuning We introduced: –A self-optimization methodology and architecture –A loop table as a tunable component These items yielded: –Significant power savings by reducing ROM access 34% total savings for our particular microcontroller and target technology –No change in instruction set, tools, or performance Future work includes: –Reducing size overhead –Investigating other tunable components (e.g., N-way cache)