A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.
Experiments with the Peripheral Virtual Component Interface Roman L. Lysecky, Frank Vahid*, Tony D. Givargis Dept. of Computer Science & Engineering University.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Altera FLEX 10K technology in Real Time Application.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
The Design Process Outline Goal Reading Design Domain Design Flow
Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Architecture Description Languages for Programmable Embedded Systems P. Mishra and N. Dutt IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 3, May 2005 Speaker:
Parameterized Systems-on-a-Chip Frank Vahid Tony Givargis, Roman Lysecky, Leslie Tauro, Susan Cotterell Department of Computer Science and Engineering.
Memory Management 2010.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Architecture Tuning in Embedded Systems Greg Stitt, Frank Vahid, Tony Givargis Dept. of Computer Science & Engineering University of California, Riverside.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Computer Organization and Assembly language
(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.
MICROCONTROLLER INSTRUCTION SET
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMP2011 Assembly Language Programming and Introduction to WRAMP.
©2003/04 Alessandro Bogliolo Computer systems A quick introduction.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Extreme Makeover for EDA Industry
November SSI Small Scale Integration Up to 12 equivalent gate circuits on a single chip Includes basic gates and flip-flops.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering.
Microcode Source: Digital Computer Electronics (Malvino and Brown)
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
The Macro Design Process The Issues 1. Overview of IP Design 2. Key Features 3. Planning and Specification 4. Macro Design and Verification 5. Soft Macro.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
An Overview of Hardware Design Methodology Ian Mitchelle De Vera.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
EE/CS-352: Embedded Microcontroller Systems Part V The 8051 Assembly Language Interrupts.
Embedded Systems Design with Qsys and Altera Monitor Program
What’s New in Xilinx Ready-to-use solutions. Key New Features of the Foundation Series 1.5/1.5i Release  New device support  Integrated design environment.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
ASIC Design Methodology
Classification of Instruction Set of 8051
Microprocessor and Assembly Language
Lecture Set 5 The 8051 Instruction Set.
Subroutines and the Stack
Chapter 1: Introduction
Overview of Embedded SoC Systems
8051 Single Board Computer (SBC) Version 1.0
A High Performance SoC: PkunityTM
A Self-Tuning Configurable Cache
Portable SystemC-on-a-Chip
Subroutines and the Stack
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University of California, Riverside *also with the Center for Embedded Computer Systems, UC Irvine Roman Lysecky Department of IP Management Conexant Newport Beach This work was supported by the National Science Foundation under grants CCR and CCR , and by a Design Automation Conference graduate scholarship.

Introduction: advent of cores In the past, board-level embedded systems were built using discrete IC’s ProcessorMemoryPeripheral Board PeripheralMem Processor IP cores Core library PeripheralA PeripheralB ProcessorX Today, single-IC systems are increasingly being built, using IP’s (Intellectual Property) A.k.a. “cores” Hard core: layout Firm core: structure (HDL) Soft core: synthesizable behavior (HDL) “System-on-a-chip” (SOC)

Introduction: embedded systems SOC’s implementing an embedded system have a unique feature Implements a particular application Thus, the processor may execute a single fixed program that never changes Unlike desktop systems, which execute a variety of programs Examples: digital camera, automobile cruise-controller We can exploit this fixed-program feature For example, by using mask-programmed ROM But much more can be done

Introduction: architecture tuning Architecture tuning A way to exploit the fixed- program feature of embedded systems First, do architecture design for the particular application Then, “tune” the core- based system architecture to the particular application program, before IC fabrication Goals: better performance, power, size Core library PeripheralA PeripheralB ProcessorX PeripheralProg. Processor Architecture design Fixed program HDL Architecture tuning Prog. Processor Peripheral HDL IC Prog. Processor Peripheral Fabrication Tuned cores

Introduction: architecture tuning Examples of tuning optimizations Memory hierarchy: no cache, L1 cache, L1+L2 cache Cache organization: size, associativity, line size Bus structure, data/address encoding Microprocessor optimizations Internal small-loop table Controller partitioning Datapath shortcuts Register file copies

Introduction: Tuning is a special case of Y-Chart iteration Philips/TriMedia approach of simultaneously developing architecture and its applications ArchitectureApplications Numbers Mapping Analysis Our focus

Problem description Focus of this work: Tuning a microcontroller to its program Goal is reduced power without performance loss Restrict tuning to maintain exact instruction set compatibility No instructions may be added or deleted Thus, no modification to software development environment Also, no problems with porting software to/from other versions of the microcontroller Instruction set incompatibility can be a show stopper

Previous work Application-specific instruction-set processors [Fisher99] Customize a microprocessor to its application(s) e.g., Tensilica Customized instruction-set, requiring customized tools Tuning compiler to architecture [Tiwari et al 94] Architectural description languages to inform compiler of architecture features [Halambi et al 99] Tuning cache and cache/bus [Givargis et al 99] organization to application

Tuning environment Currently for the 8051 microcontroller Starts from VHDL synthesizable model of 8051 (soft core) Uses Synopsys synthesis, simulation and power analysis Uses 8051 instruction-set simulator Uses numerous scripts Goal of the enviroment Understand how power is being consumed for a particular application, so that modifications to the architecture (or application) can be made to minimize that power Three main tools Architectural view Instruction-set view Program/data memory view

Tuning environment: architectural view tool Microprocessor structure Program binary ROM generator ROM entity Simulator and power analyzer “Flat” power data Structural hierarchical power data translator and xdu display Microprocessor soft core RT-synthesizer ROM 1.04 mW ALU 1.62 mW RAM 1.42 mW CTRL 2.69 mW DECODER 0.07 mW Total 7.66 mW

Tuning environment: instruction-set view tool Flat power data for instruction 3 Flat power data for instruction 2 Binaries to exe instruction 3 Binaries to exer instruction 2 Microprocessor structure Binaries to exercise instruction 1 ROM generator ROM entity Simulator and power analyzer Flat power data for instruction 1 Power data collector, structural power data translator, and xdu display InstructionPower (mW) ADDC_ ADD_ ANL_ CLR_ CPL_ DA DEC_ DIV INC_ MOVC_ MOVC_ MOV_ MOV_ MUL NOP ORL_ POP PUSH8.7116

Tuning environment: program/data memory view tool Program binary Instruction-set simulator Per-instruction power data Program hierarchy power translator and xdu display Program/data memory access frequencies and power AddrInsFreqPwrFreq*Pwr 00000LJMP MOV_ MOV_ MOV_ MOV_ RET MOV_ MOV_ MOV_ MOV_ MOV_ LCALL2700 AddrPurposeAccesses 00128P SP DPL DPH P PSW ACC B2598

Tuning environment Program binaryMicroprocessor core Program/data memory view tool (seconds) Architectural view tool (1 hour) Instruction-set power view tool (1 day) Program power data Architecture power data Instruction-set power data

Design flow using the tuning environment Change application DONE Change architecture Run program / data memory view tool Run architecture view tool Run instruction-set view tool Satisfied? Yes No

Sample tuning optimization Observation RAM consumes much power Address 224 accessed frequently Possible tuning optimization Replace this RAM location by a register inside the CTRL module Steps Modify VHDL model Run all three view tools Results Power reduction: 7.67 to 7.27 mW ROM 1.04 mW ALU 1.62 mW RAM 1.42 mW CTRL 2.69 mW DECODER 0.07 mW Total 7.66 mW AddrPurposeAccesses 00128P SP DPL DPH P PSW ACC B2598

Some recent data Applied the tuning environment for a particular application Converted two frequently-accessed RAM locations to registers 15% total power savings Introduced datapath shortcuts for the two most common register-to-register moves of the application, thus bypassing the ALU 10% total power savings Partitioned the controller into two, one small one implementing the frequently-executed instructions 10-15% power savings, but we expect much more if we do a better job partitioning the design

Conclusions Described an environment for tuning a microprocessor to its application for low power Full instruction set compatibility Multiple views helps find power hogs Fully automated Focus is now on developing tuning optimizations Controller partitioning, small-loop table, datapath shortcuts, register-file copies, etc. Investigate possibility of automating tuning optimizations, develop more general tuning methodology Environment for the 8051 is available on the web: