L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

DSPs Vs General Purpose Microprocessors

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Logical Design.

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Chapter 4 Retiming.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.

VLSI Communication SystemsRecap VLSI Communication Systems RECAP.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동

Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 09: RC Principles: Software (2/4) Prof. Sherief Reda.

Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long.

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Parallelizing Compilers Presented by Yiwei Zhang.

Distributed Arithmetic: Implementations and Applications

A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.

GPGPU platforms GP - General Purpose computation using GPU

Optimizing Compilers Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.

What’s in an optimizing compiler?

VADA Lab.SungKyunKwan Univ. 1 Lower Power Architecture Design 성균관대학교 조 준 동 교수

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.

 Embedded Digital Signal Processing (DSP) systems  Specification with floating-point data types  Implementation in fixed-point architectures  Precision.

Linear Data Structures LIFO – Polish notation Context Saving.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

High-Level Transformations for Embedded Computing

EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.

COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Muhammad Elrabaa Computer Engineering Department King Fahd University of Petroleum.

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Class Report 林常仁 Low Power Design: System and Algorithm Levels.

1 Digital Design Debdeep Mukhopadhyay Associate Professor Dept of Computer Science and Engineering NYU Shanghai and IIT Kharagpur.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

VADA Lab.SungKyunKwan Univ. 1 Lower Power Design Overview 성균관대학교 조 준 동 교수

1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization.

CORDIC (Coordinate rotation digital computer)

EEE4176 Applications of Digital Signal Processing

Embedded Systems Design

By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014

Optimization Code Optimization ©SoftMoore Consulting.

102-1 Under-Graduate Project Techniques in VLSI design

Adaptation Behavior of Pipelined Adaptive Filters

Optimizing Transformations Hal Perkins Autumn 2011

Multiplier-less Multiplication by Constants

Optimizing Transformations Hal Perkins Winter 2008

Architectural-Level Synthesis

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Low Power DSP 수행시간의 대부분이 DO-LOOP 에서 이 루어짐 VSELP Vocoder: 83.4 % 2D 8x8 DCT: 98.3 % LPC computation: 98.0 % DO-LOOP 의 Power Minimization ==> DSP 의 Power Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding

VLSI Signal Processing Design Methodology pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering bit-serial, bit-parallel and digit-serial architectures, carry save architecture redundant and residue systems Viterbi decoder, motion compensation, 2D- filtering, and data transmission systems

Loop unrolling The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop Unrolling Effects Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the rst are being stored and the loop variables are being updated.

Loop Unrolling (IIR filter example) loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples. Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation, The transformation yields critical path of 3, thus voltage can be dropped.

Loop Unrolling for Low Power

Loop Unrolling for OPR

DFG after Loop Unrolling The estimated power- consumption reduction is now: obtaining a reduction of 9.4%.

Effective Resource Utilization

Pipelining

Switching Activity Reduction (a) Average activity in a multiplier as a function of the constant value (b) A parallel and serial implementations of an adder tree.

Associativity Transformation

Interlaced Accumulation Programming for Low Power

Associativity Transformation

FIR Parallelization Mahesh Mejendale, Sunil D. Sherlekar, G. Venkatesh “Low-Power Realization of FIR Filters on Programmable DSP’s” IEEE Transations on very large scale integration (VLSI) system, Vol. 6, No. 4, December 1998

FIR PARALLELIZATION

FIR Filter Parallelization

FIR parallelization: two working phases

IIR filter recursive function

Recursive Function

Interlaced Accumulation Programming for Low Power

Optimizing Power using Transformation

Data- flow based transformations Tree Height reduction. Constant and variable propagation. Common subexpression elimination. Code motion Dead-code elimination The application of algebraic laws such as commutability, distributivity and associativity. Most of the parallelism in an algorithm is embodied in the loops. Loop jamming, partial and complete loop unrolling, strength reduction and loop retiming and software pipelining. Retiming: maximize the resource utilization.

Tree-height reduction Example of tree-height reduction using commutativity and associativity Example of tree-height reduction using distributivity

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns in the parse trees. –Example: – a= x+ y; b = a+ 1; c = x+ y; – a= x+ y; b = a+ 1; c = a;

Examples of other transformations Dead-code elimination: –a= x; b = x+ 1; c = 2 * x; –a= x; can be removed if not referenced. Operator-strength reduction: –a= x 2 ; b = 3 * x; –a= x * x; t = x<<1; b = x+ t; Code motion: –for ( i = 1; i < a * b) { } –t = a * b; for ( i = 1; i < t) { }

Control- flow based transformations Model expansion. –Expand subroutine flatten hierarchy. – Useful to expand scope of other optimization techniques. – Problematic when routine is called more than once. – Example: –x= a+ b; y= a * b; z = foo( x, y) ; –foo( p, q) {t =q-p; return(t);} –By expanding foo: –x= a+ b; y= a * b; z = y-x; Conditional expansion Transform conditional into parallel execution with test at the end. Useful when test depends on late signals. May preclude hardware sharing. Always useful for logic expressions. Example: y= ab; if ( a) x= b+d; else x= bd; can be expanded to: x= a( b+ d) + a’bd; y= ab; x= y+ d( a+ b);

Strength reduction

Strength Reduction

DIGLOG multiplier 1st Iter 2nd Iter 3rd Iter Worst-case error -25% -6% -1.6% Prob. of Error<1% 10% 70% 99.8% With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

Logarithmic Number System --> Significant Strength Reduction