A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Instruction Set Design

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Lecture 2-Berkeley RISC Penghui Zhang Guanming Wang Hang Zhang.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Lecture No. 1 Computer Logic Design. About the Course Title: –Computer Logic Design Pre-requisites: –None Required for future courses: –Computer Organization.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Computer Architecture And Organization UNIT-II Structured Organization.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

EE3A1 Computer Hardware and Digital Design

Introduction to Microprocessors

Computer Architecture Lecture 32 Fasih ur Rehman.

HOW a Computer Works ? Anatomy of Microprocessor.

Electronic Analog Computer Dr. Amin Danial Asham by.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.

Computer Systems. Bits Computers represent information as patterns of bits A bit (binary digit) is either 0 or 1 –binary  “two states” true and false,

Chapter 2 Turning Data into Something You Can Use

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

EBIZ 509 Foundations of E-Business. 2 © UW Business School, University of Washington 2004 Agenda Today Class schedule and class plan Basic computer concepts.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

SEMINAR ON ARM PROCESSOR

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Fundamentals of Computer Engineering

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

UNIT – Microcontroller.

Morgan Kaufmann Publishers

The Central Processing Unit

Morgan Kaufmann Publishers The Processor

Microprocessor & Assembly Language

Ann Gordon-Ross and Frank Vahid*

A High Performance SoC: PkunityTM

A Self-Tuning Configurable Cache

Dynamic Hardware/Software Partitioning: A First Approach

Portable SystemC-on-a-Chip

Automatic Tuning of Two-Level Caches to Embedded Applications

A Level Computer Science Topic 5: Computer Architecture and Assembly

Computer Systems An Introducton.

Register sets The register section/array consists completely of circuitry used to temporarily store data or program codes until they are sent to the.

Presentation transcript:

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001

Frank Vahid, 2 Mass-produced microprocessor IC’s prevail in embedded systems –Cheap From amortization and high yields –Small and low power From optimization and use of new technologies –Available immediately Typically run one program forever QUESTION: –Can we “tune” a mass-produced microprocessor to its one program to reduce power? Introduction Pmem. Sample: Annual production: 10 million units Cost per unit: $2 Dmem.Processor Periph. Pmem. Processor

Frank Vahid, 3 Dmem. Pmem. Periph. Introduction Use configurable (tunable) components and add a tuner circuit Leading edge chip in ,000 transistors Leading edge chip in ,000,000 transistors Moore’s Law: 2x / 18 months Tuner. Make use of abundant transistors –Previously, silicon too scarce –Today, “transistor budgets have gone ballistic” [Microprocessor Report, 1998] –Software analogy Previously, program memory was scarce Today, we find a flight simulator hidden in Excel’97 Processor

Frank Vahid, 4 Introduction We introduce: –Architecture and methodology for a self- optimizing microprocessor that can tune itself to its program Uses self-profiling circuitry and designer- activated self-optimization mode To illustrate, we introduce: –A tunable component: Loop Table Similar to loop caches, differs in how and when contents are updated –Other tunable components are possible

Frank Vahid, 5 Problem Description Goal: –Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power Constraints 1.Exact instruction set compatibility 2.Avoid changing tool chain 3.Preserve cycle-by-cycle behavior –These constraints are more stringent than in most previous work

Frank Vahid, 6 Related Work Application-specific instruction-set processors –Introduce new instructions for frequent code Pre-fabrication: [Fischer99], [Tensillica00] Post-fab: [Kucukcakar99] – for mass-produced IC’s Modifies instruction-set and tool chain Code morphing –Crusoe: Cache frequent code’s translation Helps only if performing dynamic binary translation Changes cycle-by-cycle behavior Code compression –Compress frequent code [Ishihara00] Modifies tool chain

Frank Vahid, 7 Related Work Cache frequent small loops –Reduces memory/bus power –Filter cache [Kin97] Small L0 cache Many misses (extra cycles) –Compiler-assisted loop cache [Bellas99] Profiler/compiler marks frequent loops for filter cache placement Modifies tool chain –Transparent loop cache [Lee99] Fill loop cache only when detect short- backwards branch No tag comparisons – greater efficiency –Our approach Moves profiler to chip, and can be more selective in filling loop cache PID controller example: most execution time spent in two small loops Pmem Proc. Pmem Proc. Loop table

Frank Vahid, 8 Architecture Overview Standard microcontroller –ROM access consumes much power –Added Self-Profiling Controller and Loop Count Table for profiling Loop Table to store common loops Bypass Controller to switch to Loop Table ROMConfiguration Memory (~10’s of bytes) Datapath RAM Controller Self- Profiling Controller Bypass Controller Loop Count Table Loop Table Microprocessor

Frank Vahid, 9 Methodology Overview Self-optimizing microcontroller –Post-fabrication (hence mass-produced) –In-system –Tuning under designer control Not by end user, hence stable and consistent end-use platform (Designer: pre- fabrication) Designer: post-fabricationUser Self-optimization mode activation

Frank Vahid, 10 Methodology Overview Activate self-optimizing mode, causing update of configuration memory Reset microcontroller, causing (optimized) application execution in normal mode Download application to microcontroller program memory Upload configuration memory for downloading to other microcontrollers

Frank Vahid, 11 Self-optimizing mode Initializing –Activated by extra pin or existing pin combo –Traverse memory, detect loops, add addresses to loop count table Down- load program Self- optimizing mode Normal mode Upload configuration ROM Self- Profiling Controller Loop Count Table Loop addr.Count Profiling –Execute, update loop counts Requires fast increments We use fully-assoc. mem Hardware hash table possible Configuring –Store most frequent loop addresses at bottom of program memory, set flag 200

Frank Vahid, 12 Normal mode Reset –Read loop addresses (if any) into registers (LAR’s) –Read corresponding loops into loop table –Set flag in bypass controller ROM 200 Bypass Controller Loop Table Data- path RAM Con- troller : **** LAR: Execute: Check if flag set and address match –No: Fetch from ROM –Yes: Begin fetching from loop table –No tag comparisons, no misses –Pre-computed extra bits quickly detect table exit Down- load program Self- optimizing mode Normal mode Upload configuration

Frank Vahid, 13 Results -- power Savings –34% total power savings after self-optimization –Dependent on technology Power overhead –Negligible when self- optimization idle –Slight increase (5%) during self-optimization Setup –Synopsys synthesis, simulation, and power analysis –8051 synthesizable VHDL model at UCR ( Ex1: checksum Ex2: gcd Ex3: matrix multiply

Frank Vahid, 14 Results – size (in cells) Big increase, but: –8051 version was small Others much bigger Smaller % overhead –Transistors becoming cheaper –Product-oriented IC’s: loop table and controller, no Self- Profiler or Loop Count Table –Transfer configuration from prototype-oriented part to new product-oriented parts –Supported by existing upload/download tools –We are working on shrinking the Loop Count Table logic

Frank Vahid, 15 Conclusions Mass-produced IC’s give big advantages Transistor abundance provides new opportunities We introduced: –A self-optimization methodology and architecture –A loop table as an example tunable component These items yielded: –Power savings by reducing ROM access 34% savings for 8051 microcontroller for target technology –No change in instruction set, tools, or performance Future work includes: –Reducing size overhead while maintaining accuracy –Trading off size with accuracy –Extending loop table for multiple loops, subroutines, etc. –Incorporating into 32-bit processor environment (LEON Sparc) –Investigating other tunable components On-chip FPGA, configurable cache, etc.