Word-Size Optimization for Low Energy, Variable Workload Sub-threshold Systems Sudhanshu Khanna, Anurag Nigam ECE 632 – Fall 2008 University of Virginia.

Slides:



Advertisements
Similar presentations
Feb. 17, 2011 Midterm overview Real life examples of built chips
Advertisements

Chapter 4: Combinational Logic
Registers and Counters
Distributed Arithmetic
Lecture 17: Analog to Digital Converters Lecturers: Professor John Devlin Mr Robert Ross.
Chapter 1 — Computer Abstractions and Technology — 1 Lecture 7 Carry look ahead adders, Latches, Flip-flops, registers, multiplexors, decoders Digital.
Announcements Assignment 8 posted –Due Friday Dec 2 nd. A bit longer than others. Project progress? Dates –Thursday 12/1 review lecture –Tuesday 12/6 project.
1 A Variation-tolerant Sub- threshold Design Approach Nikhil Jayakumar Sunil P. Khatri. Texas A&M University, College Station, TX.
Signal Processing Using Digital Technology Jeremy Barsten Jeremy Stockwell December 10, 2002 Advisors: Dr. Thomas Stewart Dr. Vinod Prasad.
FIR Tap Filter Optimization CE222 Final Project Spring 2003 S oleste H ilberg N icole S tarr.
11/16/2004EE 42 fall 2004 lecture 331 Lecture #33: Some example circuits Last lecture: –Edge triggers –Registers This lecture: –Example circuits –shift.
Chapter 7 - Part 2 1 CPEN Digital System Design Chapter 7 – Registers and Register Transfers Part 2 – Counters, Register Cells, Buses, & Serial Operations.
Distributed Arithmetic: Implementations and Applications
KU College of Engineering Elec 204: Digital Systems Design
1 Sequential Circuits Registers and Counters. 2 Master Slave Flip Flops.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Digital to Analog Converters
DARPA Digital Audio Receiver, Processor and Amplifier Group Z James Cotton Bobak Nazer Ryan Verret.
1 Registers and Counters A register consists of a group of flip-flops and gates that affect their transition. An n-bit register consists of n-bit flip-flops.
Rabie A. Ramadan Lecture 3
Team MUX Adam BurtonMark Colombo David MooreDaniel Toler.
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
ECE 448: Lab 6 DSP and FPGA Embedded Resources (Digital Downconverter)
EE421, Fall 1998 Michigan Technological University Timothy J. Schulz 29-Sept, 1998EE421, Lecture 61 Lecture 6 - Sample Processing Methods l Basic building.
ECE 448: Lab 5 DSP and FPGA Embedded Resources (Signal Filtering and Display)
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Sequential Logic Circuit
Basics of Energy & Power Dissipation
ECE 448: Lab 7 Design and Testing of an FIR Filter.
CHAPTER 6 Sequential Circuits’ Analysis CHAPTER 6 Sequential Circuits’ Analysis Sichuan University Software College.
EKT 124 / 3 DIGITAL ELEKTRONIC 1
A Mini Stereo Digital Audio Processor Design DINESH GUNDU VIGNESH SABARINATH.
EKT 221 : Digital 2 Serial Transfers & Microoperations Date : Lecture : 2 hr.
Patricia Gonzalez Divya Akella VLSI Class Project.
Cpu control.1 2/14 Datapath Components for Lab The Processor! ( th ed)
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
Topic: N-Bit parallel and Serial adder
Explain Half Adder and Full Adder with Truth Table.
Analog-Digital Conversion. Other types of ADC i. Dual Slope ADCs use a capacitor connected to a reference voltage. the capacitor voltage starts at zero.
Power-Optimal Pipelining in Deep Submicron Technology
EKT 221 : Digital 2 Serial Transfers & Microoperations
Registers and Counters
Digital Decode & Correction Logic
EKT 221 : Digital 2 Serial Transfers & Microoperations
Embedded Systems Design
CS Chapter 3 (3A and ) – Part 4 of 5
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Latches and Flip-flops
Registers and Counters
Registers and Counters Register : A Group of Flip-Flops. N-Bit Register has N flip-flops. Each flip-flop stores 1-Bit Information. So N-Bit Register Stores.
Subject Name: Digital Signal Processing Algorithms & Architecture
Sum of Absolute Differences Hardware Accelerator
Day 26: November 1, 2013 Synchronous Circuits
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN
Circuit Design Techniques for Low Power DSPs
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
The Processor Lecture 3.1: Introduction & Logic Design Conventions
FIGURE 1: SERIAL ADDER BLOCK DIAGRAM
ECE 352 Digital System Fundamentals
Implementing Low-Power CRC-Half for RFID Circuits
ECE 352 Digital System Fundamentals
Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016
Presentation transcript:

Word-Size Optimization for Low Energy, Variable Workload Sub-threshold Systems Sudhanshu Khanna, Anurag Nigam ECE 632 – Fall 2008 University of

–Energy constrained Sub-Vt systems Medical devices Environmental sensors –Need to lower E in order to enable “lifelong” operation –SMALL “FORM-FACTOR” => Area Reduction –Total E = Active E + Sleep E Introduction

Top Level Problems Addressed Energy Reduction –Active –Sleep Mode Area Reduction Adaptation of Super-threshold designs to sub-threshold

Current Approaches Voltage Regulated from THIS off-chip, (expensive) DC-DC converter Ref: K.Craig, R.Matthews, EE632 Fall 2008

Our approach Make the “starting point” design more E-efficient, Specifically for Sleep Mode operation

Sure way of lowering CV 2 : Lower V => Sub-threshold 1.2V0.2V Logic System Can we optimize the Logic system for sub- Vt operation, or should it be the same

Sure way of lowering CV 2 : Lower V => Sub-threshold 1.2V 0.2V Logic System Smaller Logic System Make the system as small as feasible. Use it over and over till the required operation is done. Then goto sleep and leak less !! How do we make the system smaller: USE A SMALLER WORD-SIZE Will using the SMALL system over and over increase the ACTIVE Energy???

Smaller Word-Size: Problems Addressed For Sure, small word-size means: –Lower Area –Lower Sleep Energy –Higher Delay We need to find: –How much is the Area/Sleep E benefit ? –Impact of multi-cycle operation on Active E ?? –Can we somehow make them faster without losing the Sleep E and Area advantage ???

Smaller Word-Size: Our Contribution For Sure, small word-size means: –Lower Area –Lower Sleep Energy –Higher Delay We need to find: –How much is the Area/Sleep E benefit ? –Impact of multi-cycle operation on Active E ?? –Can we somehow make them faster without losing the Sleep E and Area advantage ??? > 20x area benefit > 33x sleep energy benefit Multi-cycle operation increases Active E But the final value of the Active E is about the same/lesser than that of a 32-bit system. Yes, delay degradation can be overcome !!! while still being more energy efficient

Systems Compared Addition of two 32-bit numbers using: –Large word-size (32-bit) Kogge-Stone Adder Ripple Carry Adder Full-Adder –Small word-size (1-bit) 1-bit taken for simplicity, the trends would be valid for other word- sizes e.g. 16-bit, 8-bit etc. Addition is taken as a sample digital function. However, trends founds can be generalized to other digital functions as well.

32-bit Kogge-Stone Adder (KSA), 32-bit Ripple Carry Adder (RCA) 32 Bit Register 32 Bit KSA or RCA PA PB Reset CLK PA = Parallel input A PB = Parallel input B OUT = Parallel output from Sum Register 32 Bit OUT

Small-Word Size system n-bit Full Adder n-Bit Register CLK In general, an n-bit word system will have n-bit operands Let the smaller word-size be n. Then the system will look like this: Just like a 32-bit system, but only smaller! n < 32 In case n = 1, the system will take 32 clock cycles to add two 32-bit numbers. Hence the higher delay. 1-bit Full Adder 1-Bit Register CLK n = 1 1-bit Serial Adder (SA)

Serial ADC 1-bit Full Adder 1-Bit RegisterSerial DAC Serial Multiplier CLK Analog Input Analog Output CLK 1-Bit Register 1-bit input from other part of chip Simulated 1-bit SA A conceptual fully-serial 1-bit system

32-bit Serial Adder (SA) using Full-Adder 32 Bit Shift Register 32 Bit Shift Register 32 Bit Shift Register 1 Bit Full Adder Carry Flip Flop PA PB CLK Cin Cout Regular 32-bit word system, But parallel adder replaced by 1-bit full adder => LOWER SLEEP ENERGY Takes 32 cycles but is amenable for use in a an un-modified 32-bit word system 1 Bit OUT

Energy drawn for addition of two 32-bit numbers is measured for all the 4 systems: –32-bit KSA –32-bit RCA –32-bit SA –1-bit SA Clock and register power taken into account Important Metric: Energy per operation Large word-size systems Small word-size system

Active VDD = 300mV HIGH Edyn ~ Etot ~ 6pJ But leakage current is 1.7x lower Shows that active energy of 1-bit system < 32-bit systems 40% active energy 22nm 33x reduction in leakage current (note that above plot is only showing active energy)

300mV 1-bit SA has 40% lower active E than the best 32-bit system 1-bit SA has 33x lesser leakage current than the best 32-bit system 32-bit SA has 1.7x lesser leakage current than 32-bit KSA Thus multi-cycle operation doesn’t increase active energy too much Hence once sleep time is added, benefits of small- word systems will increase => if word-size limited to 32, serial addition will save energy if the application has lot of sleep time e.g. in sensor nodes !!! => if word-size limited to 32, serial addition will save energy if the application has lot of sleep time e.g. in sensor nodes Hence once sleep time is added, benefits of small- word systems will increase

Logic System small word VDD incs => delay decs Can be used to make small-word size systems faster !!! But, impact of the VDD increase on Energy ??? 0.4V 1.2V 0.2V Logic System 0.2V Already compared Logic System small word

constant delay Delay is equal Now we compare energy at constant delay Small word-size more energy efficient even after the VDD increase But the margins of energy benefits do go down The same is not true in super-Vt ! WHY??? Difference in On-Current Equation in super-Vt and sub-Vt 0.2V Logic System 0.4V Logic System small word

SMALL SLOPE LARGE SLOPE SMALL SLOPE LARGE SLOPE Sub-VtSuper-Vt VDD change => no impact on E !!

Pareto-Optimal E-D Curve Super-Vt -> 32-bit system is pareto-optimal Sub-Vt -> 1-bit system is pareto-optimal Cross-over: 1-bit system becoming optimal Super-VtSub-Vt

Generality of Trends 1-bit system is used as an example. Energy and area benefits will be achieved in any small word- size system. Shift in pareto-optimal curve happens because of difference in I on equation. Hence this behavior can be observed in other parts of a digital system as well, and not just addition. Opens energy saving opportunities in more areas of digital design

Logic System small word constant delay While going into sub-Vt operation, re-look the word-size of the system being used. Optimal word-size goes down: Small word size gives lower E and Area and matches delay 0.2V Logic System 0.4V Energy less Leakage less Area ($$$) less Delay Same

Different Word-Size Systems 1-bit ( Digital Audio System – Sharp) 4-bit ( Marc4 Micro controller, Intel 4040) 8-bit ( Micro controllers, Intel 8080 processor) 16-bit ( Intel 8086 processor) 64-bit ( Athlon 64, Opteron processor)

FIR Filter Used in many real time DSP systems ( audio, video processing) 4-Tap FIR Filter K(i): Filter Coefficients Serial Implementation of a Parallel FIR filter

Delay Multiplier 4-input Parallel Adder X(n)X(n-2)X(n-1)X(n-3) K0K0 K3K3 K2K2 K1K1 Y(n) K 0, K 1,K 2,K 3 : Filter Coefficients Stored in memory

Serial Parallel Multiplier 1-bit Serial Adder Register Y(n) Filter Coefficients (K 3, K 2, K 1, K 0 ) X(n): serial input data Serial output From memory

QUESTIONS