Introduction and Motivation Power consumption/density has become a critical issue in high performance processor design This issue is even more important.

Slides:



Advertisements
Similar presentations
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Advertisements

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Synchronous Digital Design Methodology and Guidelines
TigerSHARC and Blackfin Different Applications. Introduction Quick overview of TigerSHARC Quick overview of Blackfin low power processor Case Study: Blackfin.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 Architectural Analysis of a DSP Device, the Instruction Set and the Addressing Modes SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.
Programmable logic and FPGA
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Prardiva Mangilipally
BLDC MOTOR SPEED CONTROL USING EMBEDDED PROCESSOR
Engineering 1040: Mechanisms & Electric Circuits Fall 2011 Introduction to Embedded Systems.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
Embedded Systems Design ICT Embedded System What is an embedded System??? Any IDEA???
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Renesas Electronics Europe GmbH A © 2010 Renesas Electronics Corporation. All rights reserved. RL78 Clock Generator.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.
8086/8088 Hardware Specifications Power supply:  +5V with tolerance of ±10%;  360mA. Input characteristics:  Logic 0 – 0.8V maximum, ±10μA maximum;
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
0 Blackfin BF52x / Low Power. 1 Performance MHz Performance MHz Power mW BF MHz 132 KB RAM HDMA BF MHz 132 KB RAM USB BF MHz 132 KB.
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
Developing Power-Aware Strategies for the Blackfin Processor Steven VanderSanden Giuseppe Olivadoti David Kaeli Richard Gentile Northeastern University.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Power-Aware Compilation CS 671 April 22, CS 671 – Spring Why Worry about Power Dissipation? Environment Thermal issues: affect cooling, packaging,
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.
Computer Architecture Lecture 32 Fasih ur Rehman.
Basics of Energy & Power Dissipation
VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.
AT91 Products Overview. 2 The Atmel AT91 Series of microcontrollers are based upon the powerful ARM7TDMI processor. Atmel has taken these cores, added.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Components of a typical full-featured microcontroller.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
FPGA Technology Overview Carl Lebsack * Some slides are from the “Programmable Logic” lecture slides by Dr. Morris Chang.
Renesas Electronics Europe GmbH A © 2010 Renesas Electronics Corporation. All rights reserved. RL78 AD converter.
Temperature and Power Management
Hot Chips, Slow Wires, Leaky Transistors
Embedded Systems Design
Getting the Most Out of Low Power MCUs
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Introduction to 5685x Series
ADSP 21065L.
Presentation transcript:

Introduction and Motivation Power consumption/density has become a critical issue in high performance processor design This issue is even more important on battery-powered embedded cores and systems The embedded processing market is growing at a very fast pace Application engineers must be able to accurately predict the energy usage for the core and the system when running their applications This project is targeted to improve the power analysis capabilities of the ADI Blackfin family of processors and systems

ADI Blackfin Family of Processors Wireless Connectivity Bluetooth GSM/GPRS 3G/EDGE Digital Imaging CODECs MPEG JPEG H.263 H.264 System Control/ Applications Software Wired Connectivity USB TCP/IP Ethernet Human Interface Speech Recognition Text to Speech Handwriting Audio Operating Systems/RTOS Designed for High Level Language Image Processing Microprocessing Digital Signal Processing

Blackfin Family Blackfin Core –High-performance –16-bit –Dual-MAC embedded processors –Equally adept at DSP, control processing, and image processing Processor Features – Mhz core capable of to GMACs –8, 16 and 32-bit fixed-point math support –Hierarchical reconfigurable memory systems –Dual core versions –High speed peripherals and DMA controller Parallel Peripheral Interface (PPI) : dedicated 0-75Mhz parallel data port SPORTS, SPI, External Port, SDRAM, UART (IrDA), etc –Control processing features Very high compiled code density Supervisor and user modes/MMU, watchdog timer, real-time clock

How does the Blackfin Processor help? Speeds time-to-market and facilitates rapid product derivatives –High-performance software target –Software-centric product development Lowers BOM and R&D costs –Eliminates redundant DSP, MCU and hardware accelerator blocks –Software reuse model enhances R&D productivity with each sequential product generation –Processors begin at $5 (in quantities of 10K) Reduces technical, market and schedule risks –Software support for multiple formats and evolving standards –Development and debug within software—not ASIC—cycle times –Signal processing capabilities along with a familiar RISC programming model Enables end-product feature differentiation –2X to 4X performance advantage per dollar and per milliwatt

Blackfin Dynamic Power Management Overview Wide range of core frequencies supported (1.25M->756 MHz) –Programmable Core and System Clocks for maximum power savings Wide range of core operating voltages supported (0.8 -> 1.4 V) –Programmable internal voltage levels based on core frequency Full complement of power savings modes –Full-on, Active, Sleep, Deep-sleep and Hibernate “Voltage and frequency tuning” for minimum power –Ensures consistent, low power consumption across process Dual-core processor can be used for power savings –Lower voltage levels and lower frequencies provide additional power savings options with equivalent performance levels

Power Dissipation Therefore –E  I * N E = P * T E – energy consumed T – execution time T = N * 1/f N – number of cycles f – clock frequency Important to distinguish between power and energy P = I * Vcc P – average power I – average current Vcc – supply voltage Power vs. Energy Dynamic power dissipation –Due to switching activity Static power dissipation –Due to leakage current – major paths are: Subthreshold leakage Exponentially dependent on Vdd, Vth, temperature Gate leakage Exponentially dependent on Vdd, Tox

Instruction-level Power Estimation Strategy Develop an instruction-level energy model for the Blackfin processor 1.2 V and 270 MHz, though our approach is re- targetable) –Core voltage operation between 0.8V and 1.4V from 0 to 756 MHz Leverage past work on instruction-level power profiling for embedded cores Princeton) –Instruction-level estimation can be effective on cores with simple pipelines We then build energy estimates, working with individual basic blocks, and then weight blocks based on the dynamic call graph traversal during program execution

Instruction-level Power Estimation Strategy We consider variability due a configurable memory hierarchy We consider the impact of operand values and operand types on energy We consider environmental effects on measurements We will combine our instruction-level model with VisualDSP++ to provide power/performance framework

Instruction-Level Energy Modeling Total Energy = Base Energy Cost + Inter-Instruction Effects Base Energy Cost –The energy cost to execute an individual instruction Capture Base Energy Costs –Construct loops containing several instances of the same instruction (now automated) –Measure the average current drawn while executing this loop –The base energy cost is directly proportional to this current, multiplied by the number of cycles needed to complete each instance of the instruction

Instruction-Level Energy Modeling Total Energy = Base Energy Cost + Inter-Instruction Effects Inter-Instruction Effects –Energy contributions that are not considered in the base energy cost –Circuit state overhead Added cost due to switching activity within the circuit when executing two different instructions in succession Effect measured using a pair of different instructions in a loop and capturing the average current –Effects of resource constraints and delays Common events - pipeline stalls, cache misses, write buffer stalls These events increase the number of cycles required to complete an instruction The average power per cycle often decreases, but the overall energy still increases due to the higher cycle count

Measurement Environment Warm-up

Impact of Operand Values Comments: –Input operand values have a significant impact on average current (range of 3.9 mA) –Power is dependent upon the number of bit flips performed in a cycle –Large variations in current are observed with changing destination register values –Presents challenges to our measurement assumptions Instruction: r7 = r3 + r4; r3 Valuer4 ValueCurrent (mA) 0x x xFFFF x xFFFFFFFF 97.5 InstructionInitial ValuesCurrent (mA) r6 = -r3;r3 = 0x90B94.1 r3 = -r3;r3 = 0x90B108.5

Instruction Selection Add top_loop: r7 = r3 + r4; … jump top_loop; Nop top_loop: nop; … jump top_loop; Combination top_loop: r7 = r3 + r4; nop; r7 = r3 + r4; nop; … jump top_loop; Average current –Add: 94.7 mA –NOP: 90.9 mA –Combination: mA Comments: –Circuit state overhead is significant (i.e., NOPs are not free) –Decode overhead is a major contributor to power consumption

Memory Configuration Investigated current dissipation of L1 memory configured as SRAM vs. cache Cache overhead for Load instruction –Instruction: 3.9 mA –Data: 11.8 mA Comments: –Cache maintenance operations increase current dissipation –Data cache consumes more current due to core layout and multi-port design

Example Program: Cache Disabled Measured Average current: mA Number of Cycles: 9 E = 4.7 nJ Estimated E = 4.4 nJ Percent Difference 5% r1 = [i0]; r7 *= r1; r6 = r1 + r6 (ns); r5 = r1 +|- r6; [i1] = r7; [i2] = r6; [i3] = r5; r1 = [i0]; r7 *= r1; r6 = r1 + r6 (ns) || [i1] = r7; r5 = r1 +|- r6 || [i2] = r6; [i3] = r5; Measured Average current: mA Number of Cycles: 7 E = 4.0 nJ Estimated E = 3.8 nJ Percent Difference 5% Example Program: Parallel Instructions

Example Program: Multiple Basic Blocks Measured Average current: mA Number of Cycles: 20 E = 10.2 nJ Estimated E = 9.9 nJ Percent Difference 2% r1.h = 0x5555; r1.l = 0xAAAA; r2.h = 0x3333; r2.l = 0xCCCC; jump label1; label1: r7.h = r1.h*r2.h, r7.l = r1.l*r2.l; r6 = r1 & r2; r5 = ashift r1 by r2.l (s); jump label2; label2: [i1++] = r7; [i1++] = r6; [i1++] = r5;

Summary Developed a retargetable method to produce an instruction-level energy model Constructed an instruction-level energy model for the Blackfin processor and used it to estimate programs with less than 6% error Developed a set of automated tools to drive test code generation and current measurements Studied the energy effects of the memory hierarchy, changes in operand values, and environmental factors