OPTIMAL FSMD PARTITIONING FOR LOW POWER Nainesh Agarwal and Nikitas Dimopoulos Electrical and Computer Engineering University of Victoria.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECE-777 System Level Design and Automation Hardware/Software Co-design
A Novel 3D Layer-Multiplexed On-Chip Network
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 4 Instructor: Haifeng YU.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Post-Placement Voltage Island Generation for Timing-Speculative Circuits Rong Ye†, Feng Yuan†, Zelong Sun†, Wen-Ben Jone§ and Qiang Xu†‡
An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim, Deokjin Joo, Taehan Kim DAC’13.
Compaction of Diagnostic Test Set for a Full-Response Dictionary Mohammed Ashfaq Shukoor Vishwani D. Agrawal 18th IEEE North Atlantic Test Workshop, 2009.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
May 11, 2006High-Level Spectral ATPG1 High-Level Test Generation for Gate-level Fault Coverage Nitin Yogi and Vishwani D. Agrawal Auburn University Department.
Decomposition of Instruction Decoder for Low Power Design TingTing Hwang Department of Computer Science Tsing Hua University.
Power Efficient Rapid System Prototyping Using CoDeL: The 2D DWT Using Lifting Nainesh Agarwal & Nikitas Dimopoulos University of Victoria, Canada PacRim,
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
1 Using A Multiscale Approach to Characterize Workload Dynamics Characterize Workload Dynamics Tao Li June 4, 2005 Dept. of Electrical.
Overview Sequential Circuit Design Specification Formulation
Project Proposal RIPE: A Rapid Implication- based Power Estimator Sunil Motaparti, Gaurav Bhatia.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
ECE Lecture 1 1 ECE 3561 Advanced Digital Design Department of Electrical and Computer Engineering The Ohio State University.
A Fault-tolerant Architecture for Quantum Hamiltonian Simulation Guoming Wang Oleg Khainovski.
Project Report I RIPE: A Rapid Implication- based Power Estimator Sunil Motaparti, Gaurav Bhatia.
Project Report II RIPE: A Rapid Implication- based Power Estimator Sunil Motaparti, Gaurav Bhatia.
Optimization of Linear Problems: Linear Programming (LP) © 2011 Daniel Kirschen and University of Washington 1.
Javad Lavaei Department of Electrical Engineering Columbia University Joint work with Somayeh Sojoudi Convexification of Optimal Power Flow Problem by.
Sequential Circuits Chapter 4 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S.
USING SAT-BASED CRAIG INTERPOLATION TO ENLARGE CLOCK GATING FUNCTIONS Ting-Hao Lin, Chung-Yang (Ric) Huang Graduate Institute of Electrical Engineering,
Requirements Determine processor core Determine the number of hardware profiles and the benefits of each profile Determine functionality of each profile.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Ronny Krashinsky Seongmoo Heo Michael Zhang Krste Asanovic MIT Laboratory for Computer Science SyCHOSys Synchronous.
1 Chapter-4: Network Flow Modeling & Optimization Deep Medhi and Karthik Ramasamy August © D. Medhi & K. Ramasamy, 2007.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Logic Synthesis For Low Power CMOS Digital Design.
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
1 Lecture 22 Sequential Circuits Analysis. 2 Combinational vs. Sequential  Combinational Logic Circuit  Output is a function only of the present inputs.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Design, Optimization, and Control for Multiscale Systems
Jun Seomun, Insup Shin, Youngsoo Shin Dept. of Electrical Engineering, KAIST DAC’ 10.
Basics of Energy & Power Dissipation
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Arithmetic Test Pattern Generation: A Bit Level Formulation of the Optimization Problem S. Manich, L. García and J. Figueras.
1 Attractive Mathematical Representations Of Decision Problems Warren Adams 11/04/03.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
VLSI Design & Embedded Systems Conference January 2015 Bengaluru, India Few Good Frequencies for Power-Constrained Test Sindhu Gunasekar and Vishwani D.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Quantum Computation Stephen Jordan. Church-Turing Thesis ● Weak Form: Anything we would regard as “computable” can be computed by a Turing machine. ●
Efficient Point Coverage in Wireless Sensor Networks Jie Wang and Ning Zhong Department of Computer Science University of Massachusetts Journal of Combinatorial.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Memory Segmentation to Exploit Sleep Mode Operation
Chapter #6: Sequential Logic Design
IIT Kharagpur & Kingston Uni
Boundary Element Analysis of Systems Using Interval Methods
Parallel Programming By J. H. Wang May 2, 2017.
Department of Electrical & Computer Engineering
Hardware Acceleration of the Lifting Based DWT
A Multiple Clock Cycle Instruction Implementation
10 ways to cheat with energy
Overview Last lecture Digital hardware systems Today
Fast Min-Register Retiming Through Binary Max-Flow
Presentation transcript:

OPTIMAL FSMD PARTITIONING FOR LOW POWER Nainesh Agarwal and Nikitas Dimopoulos Electrical and Computer Engineering University of Victoria

Summary Power and energy Power gating Partitioning as means to achieve optimal power gating What next

Computation Power and Energy What is the minimum energy a computation can expend? Are we there yet?

Computation Power and Energy cont’d Feynman gives a relation between free energy and computation rate for reversible computation –E = kTlogr –Where r is the computation rate. This means that at the limit, we may expend zero energy (when r =1) but then the computation will take infinitely long.

For irreversible computation, –  E=kTblog2 –Where b is the number of bits involved in the computation (entropy) Computation Power and Energy cont’d

In both cases, these quantities are wxceptionally small. –k = ×10 −23 J/K At T=300ºK, kT= 4.14x J A 50W 3GHz processor, in one cycle, consumes 1.65x10 -8 J Computation Power and Energy cont’d

DSPstone benchmarks synthesized in 180 nm and 90 nm technologies Computation Power and Energy cont’d

DSPstone dynamic energy

DSPstone total energy

Computational energy is far above the theoretical minimum (by more than 10 orders of magnitude) Technological drive reduces total energy (an order of magnitude per generation) Leakage power has become an issue Power gating may provide efficiencies to further scale the technology Computation Power and Energy cont’d

Partitioning Controller and datapath are considered together Problem is formulated as –Integer Linear Programming –Non-linear programming solved using simulated annealing

Notation s i represents a state of a FSMD v k represents a variable associated with one or more states A variable v k is considered to be shared between two states s i and s j if the variable is read and/or written at both states T ij Is the total number of bits of all variables shared by states s i and s j E ij is 1 if there is a transition between states s i and s j, otherwise it is 0.

ILP formulation Minimizes the number of bits that are shared between the partitions and the number of times that control could between the partitions –s ij is 1 if both states s i and s j are in the same partition. Otherwise, it is 0.

ILP formulation - complete

Simulated Annealing formulation x i is -1 if state s i is in the left partition, and it is 1 if s i is in the right partition These quantities count the number of variable bits and transition edges shared between the two partitions

Simulated Annealing formulation simplification steps Observe that is constant (the total number of variable-bits)

Simulated Annealing formulation Minimizes both the shared bits and the transition edges.

Evaluation Implemented four integer algorithms –8-bit counter –5/3 wavelet transform using lifting –multiplierless approximation to the eight-point Discrete Cosine Transform (DCT) –Integer transform from the H.264 standard Used CoDeL to implement the designs. Trace data were obtained from simulations using Synopsys The ILP model was solved using the CPLEX solver included in the AIMMS modeling environment The simulated annealing used MATLAB

Evaluation cont’d Power savings were estimated (no partitioned design implementation yet) –The static power savings depends on the size of the sequential logic and the portion of time spent in each partition. –The dynamic power savings depends on the number of bits that are not clocked while the partition is not powered mediated by the overhead due to data communication when the active partition changes.

Evaluation (Static Power savings)

Evaluation (Dynamic Power Savings)

Results (ILP)

Results (Simulated Annealing)

Discussion Results show that partitioning the control and datapaths could potentially save up to 50% of power (static power) Some circuits could not partition (DWT includes one tight loop where it spends more than 90% of the time) Simulated annealing and ILP (for the partitioned circuits) give identical results. Simulated annealing is much faster.

Future Extend methodology to more than 2 partitions Implement the partitioned FSMD machines and confirm the realized power savings Lower energy!