What are the characteristics of DSP algorithms?

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

INSTRUCTION SET ARCHITECTURES
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Embedded Systems Design ICT Embedded System What is an embedded System??? Any IDEA???
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Fixed-Point Arithmetics: Part II
CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Chapter 9 Computer Arithmetic
William Stallings Computer Organization and Architecture 8th Edition
Floating Point Representations
Embedded Systems Design
William Stallings Computer Organization and Architecture 7th Edition
Digital Signal Processors
Subject Name: Digital Signal Processing Algorithms & Architecture
Subject Name: Digital Signal Processing Algorithms & Architecture
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
Overview of SHARC processor ADSP and ADSP-21065L
Arithmetic Logical Unit
DMA example Video image manipulation
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
ECEG-3202 Computer Architecture and Organization
Understanding the TigerSHARC ALU pipeline
Overview of TigerSHARC processor ADSP-TS101 Compute Operations
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Chapter 8 Computer Arithmetic
Explaining issues with DCremoval( )
General Optimization Issues
DMA example Video image manipulation
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
EE 345S Real-Time Digital Signal Processing Lab Spring 2009
Presentation transcript:

What are the characteristics of DSP algorithms? M. Smith and S. Daeninck

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Tackled today What are the basic characteristics of a DSP algorithm? Information on the TigerSHARC arithmetic, multiplier and shifter units Practice examples of C++ to assembly code conversion DSP Introduction, M. Smith, ECE, University of Calgary, Canada

IEEE Micro Magazine Article How RISCy is DSP? Smith, M.R.; IEEE Micro, Volume: 12, Issue: 6, Dec. 1992, Pages:10 - 23 Available on line via the library “Electronic web links” Copy placed on ENCM515 Web site. Make sure you read it before midterm DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Characteristics of an FIR algorithm Involves one of the three basic types of DSP algorithms FIR (Type 1), IIR (Type 2) and FFT (Type 3) Representative of DSP equations found in filtering, convolution, correlation (Lab) and modeling Multiplication / addition intensive Simple format within a (long) loop Many memory fetches of fixed and changing data Handle “infinite amount of input data” – need FIFO buffer when handling ON-LINE data All calculations “MUST” be completed in the time interval between samples DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada FIR Input Value must be stored in circular buffer Filter operation must be performed on circular buffer For operational efficiency – Note that latest value is the “last in the array” Xarray = {Xm-1, Xm-2, Xm-3, … X1, X0 } Harray = {Hm-1, Hm-2, Hm-3, … H1, H0 } DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada FIR COMMON MISTAKE = MUCH WASTED LAB. TIME For operational efficiency – Note that latest value is the “last in the array” Xarray = {Xm-1, Xm-2, Xm-3, … X1, X0 } Harray = {Hm-1, Hm-2, Hm-3, … H1, H0 } Can work with latest value “first in the array” when doing C++, but does not work for assembly code optimization DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada FIR X[n – 1] = NewInputValue Into last place of Input Buffer Sum = 0; For (count = 0 to N – 1) -- N of size 100+ Xvalue = X[count]; Hvalue = H[count]; Product = Xvalue * Hvalue; Sum = Sum + Product; Multiply and Accumulate -- MAC NewOutputValue = Sum; Update Buffer – The T-operation in the picture For (count = 1 to N – 1) -- Discard oldest X[0]; X[count – 1] = X[count]; DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Comparing IIR and FIR filters Infinite Impulse Response filters – few operations to produce output from input for each IIR stage 3 – 7 stages Finite Impulse Response filters – many operations to produce output from input. Long FIFO buffer which may require as many operations As FIR calculation itself. Easy to optimize DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada IIR -- Biquad For (Stages = 0 to 3) Do S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 This second solution gives DIFFERENT result. Order of calculation is different. The actual output difference depends on how frequently samples are taken relative to how rapidly the signal changes CALCULATION SPEED IS DIFFERENT DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada We need to know how the processor architecture affects speed of calculation Register File and Compute Block Volatile registers Data Summation Multiply and Accumulate (MAC) DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Register File and COMPUTE Units Key Points DAB – Data Alignment Buffer (special for quad fetches NOT writes) Each block can load/store 4x32bit registers in a cycle. 4 inputs to Compute block, but only 3 Outputs to Register Block. Highly parallel operations UNDER THE RIGHT CONDITIONS DSP Introduction, M. Smith, ECE, University of Calgary, Canada

NOTE – DATA PATH ISSUES OF THE X-REGISTER FILE 1 output path (128 bit) TO memory 2 input paths FROM memory 4 output (64-bit) paths TO ALU, multiplier, shifter 3 input paths (64-bit) FROM ALU, multiplier, shifter NUMBER OF PATHS HAS IMPLICATIONS ON WHAT THINGS CAN HAPPEN IN PARALLEL DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Register File - Syntax Key Points Each Block has 32x32 bit Data registers Each register can store 4x8 bit, 2x16 bit or 1x32 bit words. Registers can be combined into dual or quad groups. These groups can store 8, 16, 32, 40 or 64 bit words. XSR3:2 -> 4x16 bit words XFR1:0 -> 1x40 bit float XR7 -> 1x32 bit word XBR3:0 -> 16x8 bit words Multiple of 4 Multiple of 2 XLR7:6 -> 1x64 bit word Register Syntax DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Register File – BIT STORAGE Both 32 bit and 64 bit registers 128 bit examples are not shown but they are the same. DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Volatile Data Registers Non-preserved during a function call Volatile registers – no need to save 24 Volatile DATA registers in each block XR0 – XR23 YR0 – YR23 2 ALU SUMMATION registers in each block XPR0, XPR1, YPR0, YPR1 5 MAC ACCUMULATE registers in each block XMR0 – XMR3, YMR0 – YMR3 XMR4, YMR4 – Overflow registers PR stands for parallel results register MR stands for Multiplier results register DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Arithmetic Logic Unit (ALU) 2x64 bit input paths 2x64 bit output paths 8, 16, 32, or 64 bit addition/subtraction - Fixed-point 32 or 64 bit logical operations - fixed-point 32 or 40 bit floating-point operations Can do the same on Y ALU AT THE SAME TIME DAB – Data Alignment Buffer(2x128 bit FIFO)-> used to align misaligned quad or dual 32 bit data loads DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Sample ALU Instruction Example of 16 bit addition XYSR1:0 = R31:30 + R25:24 Performs “short” addition in X and Y Compute Blocks XR1.HH = XR31.HH + XR25.HH XR1.HL = XR31.HL + XR25.HL XR0.LH = XR30.LH + XR24.LH XR0.LL = XR30.LL + XR24.LL YR1.HH = YR31.HH + YR25.HH 8 additions at the same time YR1.HL = YR31.HL + YR25.HL YR0.LH = YR30.LH + YR24.LH .LH, .HH is my notation YR0.LL = YR30.LL + YR24.LL Other additions/subtractions look the same, but use 32 or 8 bits DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Sample ALU Instructions A neat instruction is the sideways addition sum (SUM) Fixed-Point long word, word, short word, byte (char) Floating-Point Single, double precision DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Pass is an interesting instruction XR4 = R5 Assignment statement -- makes XR4  XR5 XR4 = PASS R5 Still makes XR4 XR5 BUT USES A DIFFERENT PATH THROUGH THE PROCESSOR Sets the ALU flags (so that they can be used for conditional tests) PASS instructions can be put in parallel with different instructions than assignments DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Example code – parallel operations occurring int x_two = 64, y_two = 16; int x_three = 128, y_three = 8; int x_four = 128, y_four = 8; int x_five = 64, y_five = 16; int x_odd = 0, y_odd = 0; int x_even = 0, y_even = 0; x_odd = x_five + x_three; x_even = x_four + x_two; y_odd = y_five + y_three; y_even = y_four + y_two; XR2 = 64;; XR3 = 128;; XR4 = 128;; XR5 = 64;; YR2 = 16;; YR3 = 8;; YR4 = 8;; YR5 = 16;; XYR1:0 = R5:4 + R3:2;; //XR1 = x_odd, XR0 = x_even //YR1 = y_odd, YR1 = y_even WRONG SYNTAX nice example of the tigerSharc, it accomplishes thecode in less lines than C DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Multiplier Operates on fixed, floating and complex numbers. Fixed-Point numbers 32x32 bit with 32 or 64 bit results 4 (16x16 bit) with 4x16 or 4x32 bit results Data compaction inputs – 16, 32, 64 bits, outputs 16, 32 bit results Floating-Point numbers 32x32 bit with 32 bit result 40x40 bit with 40 bit result COMPLEX Numbers 32x32 bit with results stored in MR register FIXED-POINT ONLY Complex – imaginary part is in the MSB part of the 32 bit word DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Multiplier XR0 = R1*R2;; XR1:0 = R3*R5;; XMR1:0 = R3*R5;; //uses XMR4 overflow XR2 = MR3:2, XMR3:2 = R3*R5;; XR3:2 = MR1:0, XMR1:0 = R3*R5;; XFR0 = R1*R2;; // 32 bit mult – 24 bit mantissa XFR1:0 = R3:2*R5:4;; //40 bit MULTIPLY //32 bit mantissa // high precision float XR2 = MR3:2, XMR3:2 = R3*R5;; if integer multiply, R2 gets MR2, if Fractional gets MR3 MR stands for Multiplier results register DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Multiplier --- with 32 or 16 bit results Note minor changes in syntax XR5:4 = R1:0*R3:2;;(16 bit results) XR7:4 = R3:2*R5:4;; (32 bit results) XMR1:0 += R3:2*R5:4;;(16 bit results) XMR3:0 += R3:2*R5:4;; (32 bit results) XR3:2 = MR3:2, XMR3:2 = R1:0*R5:4;; (16 bit results)  one instruction XR3:0 = MR3:0, XMR3:0 = R1:0*R5:4;; (32 bit results) 16 bit multiplies results can be 16 bit or 32 bit RED for 16 bit, Blue for 32 bit MR4 contains four overflow bits for every MR register No need to tell the instruction if it is a short or normal word DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Practice Examples Convert from “C” into assembly code – use volatile registers long int value = 6; long int number = 7; long int temp = 8; value = number * temp; BAD DESIGN OF FLOATING PT CODE WILL INTRODUCE MANY ERRORS RE-WRITE CODE TO FIX float value = 6; float number = 7; long int temp = 8; value = number * temp; DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Avoiding common design errors XR12 = 6.0;; //valueF12 // Sets XFR12  6.0 XR13 = 7.0;;//numberF13 XR18 = 8;; //tempR18 //(float) tempR18 XFR18 = FLOAT R18;; //valueF12 = numberF13 * tempF18 XFR12 = R13 * R18;; Convert from “C” into assembly code – use volatile registers float value = 6.0; (XFR12) float number = 7.0; (XFR13) long int temp = 8; (XR18) value = number * temp; // Treat as value = number * (float) temp; XFR23:22 = R21*R22;; not allowed DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Shifter Instructions FEXT – bit field extraction, FDEP – bit field deposit 2x64 bit input paths and 2x64 bit output paths 32, or 64 bit shifting operations 32 or 64 bit manipulation operations DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Examples --- shift only integers There is a FSCALE for floats (not shifter) long int value = 128; long int high, low; low = value >> 2; high = value << 2; POSITIVE VALUE – LEFT SHIFT NEGATIVE VALUE – RIGHT SHIFT XR0 = 2;; XR1 = -XR2;; XR2 = 128;; //low = value >> 2; XR23 = ASHIFT XR2 BY –2;; Or XR23 = ASHIFT XR2 BY XR1;; //high = value << 2; XR22 = ASHIFT XR2 BY 2;; XR22 = ASHIFT XR2 BY XR0;; DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada ALU instructions Under the RIGHT conditions can do multiple operations in a single instruction. Instruction line has 4x32 bit instruction slots. Can do 2 Compute and 2 memory operations. This is actually 4 Compute operations counting both compute blocks. One instruction per unit of a compute block, ie. ALU. Since there are only 3 result buses, only one unit (ALU or Multiplier) can use 2 result buses. Not all instructions can be used in parallel. DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Dual Operation Examples FRm = Rx + Ry, FRn = Rx – Ry;; Note that uses 4(8) different registers and not 6(12) FR4 = R2 + R1, FR5 = R2 - R1;; The source registers used around the + and – must be the same. Very useful in FFT code Can be floating(single or extended precision) or fixed(32 or 64 bit) add/subtract. Rm = MRa, MRa += Rx * Ry;; MRa must be the same register(s) (MR1:0 or MR 3:2) Can be used on fixed(32 or 64 bit results) COMPLEX numbers (on 16 bit values) Rm = MRa, MRa += Rx ** Ry;; DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Practice Examples Convert to assembly code Convert from “C” into assembly code – use volatile registers long int value = 6; long int number = 7; long int temp = 8; value = number * temp; #define value_XR12 XR12 Assignment operation value_XR12 = 6;; Multiply operations value_XR12 = R5 * R6; DSP Introduction, M. Smith, ECE, University of Calgary, Canada

Avoiding common design errors Convert to assembly code float value = 6.0; float number = 7.0; long int temp = 8; value = value + 1; number = number + 2; temp = value + number; Questionable if XFR12 = 1.0;; is allowed, assembler complains XR23:22 = 10.0;; not allowed, there may not be an immediate load for 40 bit floats DSP Introduction, M. Smith, ECE, University of Calgary, Canada

DSP Introduction, M. Smith, ECE, University of Calgary, Canada Tackled today What are the basic characteristics of a DSP algorithm? Information on the TigerSHARC arithmetic, multiplier and shifter units Practice examples of C++ to assembly code conversion DSP Introduction, M. Smith, ECE, University of Calgary, Canada