 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.

Slides:

Advertisements

Similar presentations

Lab 2 – DSP software architecture and the real life DSP characteristics of signals that make it necessary.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Computer Architecture CSCE 350

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Assignment Overview Thermal oscillator One of the ENCM415 Laboratory 2 items Oscillator out GND +5V.

What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.

Blackfin BF533 EZ-KIT Control The O in I/O Activating a FLASH memory “output line” Part 2.

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

The Logic Machine We looked at programming at the high level and at the low level. The question now is: How can a physical computer be built to run a program?

Software engineering, program management. The problem  Software is expensive to design! – Industry estimates put software development labor costs at.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Blackfin BF533 EZ-KIT Control The O in I/O Activating a FLASH memory “output line” Part 2.

Getting the O in I/O to work on a typical microcontroller Activating a FLASH memory “output line” Part 1 Main part of Laboratory 1 Also needed for “voice.

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

Blackfin BF533 EZ-KIT Control The O in I/O

Introduction to Computers and Programming

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Processor Structure & Operations of an Accumulator Machine

Ultra sound solution Impact of C++ DSP optimization techniques.

Assembly Language Review Being able to repeat on the Blackfin the things we were able to do on the MIPS 9/19/2015 Review of 50% OF ENCM369 in 50 minutes1.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Invitation to Computer Science, Java Version, Second Edition.

Blackfin Array Handling Part 2 Moving an array between locations int * MoveASM( int foo[ ], int fee[ ], int N);

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

CSC321 Making a Computer Binary number system → Boolean functions Boolean functions → Combinational circuits Combinational circuits → Sequential circuits.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October.

CS 2130 Lecture 5 Storage Classes Scope. C Programming C is not just another programming language C was designed for systems programming like writing.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Ultra sound solution Profiles and other optimizations.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

Exam Format  105 Total Points  25 Points Short Answer  20 Points Fill in the Blank  15 Points T/F  45 Points Multiple Choice  The above are approximations.

Processor Architecture

Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.

COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Assembly language.

Software and Hardware Circular Buffer Operations

General Optimization Issues

Generating “Rectify( )”

Trying to avoid pipeline delays

Generating a software loop with memory accesses

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

TigerSHARC processor and evaluation board

Convolution, GPS and the TigerSHARC XCORRS instr.

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

ECE 352 Digital System Fundamentals

Understanding the TigerSHARC ALU pipeline

A first attempt at learning about optimizing the TigerSHARC code

Working with the Compute Block

CS 1308 Exam 2 Review.

Presentation transcript:

 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample and then use FIR filter of 256 taps – equivalent to 1 FIR filter of 256 * 256 taps with a bandwidth of / 256 * 256 Hz ◦ Use code from Lab 0, Lab 1, assignment 1 as much as possible  Develop C++ version (show that fails unless optimized code) – Assignment 1  Modify your Lab 1assembly code to demonstrate (test and audio) speed improvement for following steps ◦ 1) software to hardware loop ◦ 2) parallel dm, pm access, don’t unroll loop ◦ 3) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple instructions ◦ 4) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple and add instructions  Remember to provide resource chart and compare your timing to expected

 Can the processor meet the requirements?  Two forms of the code – which one is needed ◦ Grab one audio value -- Process everything before next individual audio samples ◦ Grab one audio block – Collect next audio block and process last audio block before next audio block collected  Real life – worse case ◦ Each channel needs tap FIR filters ◦ Total channels – 42 Hz + harmonics, 19 Hz plus harmonics (19 * 3 = 57 Hz) – say 8 channels ◦ Need to generate audio warning signals ◦ Modify FIR filter coefficients to following signals – might not be constant frequency  Do the best case timing analysis to see whether algorithm works

 Similarity between one signal and another, and at what locations the similarity occurs  Have a heart beat signal 000ABcD0000  Have a signal from patient running ABcD ABcD ABcD0000  Use 0000DcAB0000 as coefficients in FIR filter ABcD ABcD ABcD ABcD minimum filter output 000ABcD some output 000ABcD max output 000ABcD less output 000ABcD0000 – max again

 Draw a picture of the situation  Known signal sent to ultrasound transmitter A  Noisy signal picked up at receiver B ◦ Do auto-correlation to get best estimate of delay  Known signal sent to ultrasound transmitter B  Noisy signal picked up at receiver A ◦ Do auto-correlation to get best estimate of delay ◦ Differences in delay time are related to speed of air in mine shaft

 Simplest step up from doing examples exactly the same as lab examples  Many standard formats  Complex array – real and imaginary  Components stored alternately in memory R1, I1, R2, I2, R3, I3 … access using dm(IX, MdmX) where MdmX = 2  Components stored in alternate blocks R1, R2, R3, … I1, I2, I3 access using dm(I1X, MdmP1) and dm(I2X, MdmP1) or access using dm(IdmX, MdmP1) and pm(IpmX, MpmP1) where MdmP1 and MplP1 are set to +1 by compiler  Speed depends on format used and what you are doing with values

complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { complex correlation = 0 + j0 -- Missing piece of code for (int k = 0; k < numPts - offset; k++) { // Could be other forms of the algorithm // This is more “autocorrelation” – comparing signal to itself // Would work best when information of interest is in the centre of the signals correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } return correlation; Repeat many times along firstArray for different offsets  Auto-correlation and cross-correlation and convolution are all equivalent to FIR operations where the FIR cofficients are data values rather than fixed values  // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part)  // more likely (Another exam) switch to SIMD mode and use R0 and S0

 There is absolutely no point trying to optimize a loop that calls a subroutine / function ◦ The cost of setting up subroutine call (handling incoming parameters and return values) and jumping in an out of subroutine  Question reminded you of this ◦ Assume that the Conjugate function is in-lined for speed. ◦ That means you need to go and write out the equation with inlined code

 Enter and exit CalculateCorrelation( ) – 20 cycles  Set up pointers inpar_Rx  Ix – 30 cycles  Set up and use hardware loop – 20 cycles  Set up sum < 10 cycles  So basically timing is (numPts – offset) * loop Body count  correlation = realCorrelation + kImageCorrelation = correlation + realCorrelation + kImageCorrelation + firstArray[k] * Conjugate(firstArray [k+offset]); + (a + jb) * (c – jd) -- read in as c + jd or RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)

RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b) Means -- two sets of calculations RC_RX0= RC_RX0+ a*c +b *d RX0 does not mean R0 And IC_RX1= IC_RX1 + ( - a * d + c * b) Looks like 8 memory access per tap (point), fetch a, b, c, d TWICE Actually could optimize to 4 fetches and reuse (a, b, c, d IF there are enough registers to store the fetched values and do all the calculations if we unroll the loop and have to cope with memory access delays)

Reference sheet says MULTIFUNCTION COMPUTE OPERATION On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, FX=F(8,9,10,11),FY=F(12,13,14,15) So when doing this RC_RX0= RC_RX0+ a*c +b *d bring a and b into F(0,1,2,3); bring c and d into F(4,5,6,7) store a * c result into F(8,9,10,11) and store b * d result into F(12,13,14,15) store a * c + b * d result into F(8,9,10,11) which would work if RC_RX0 was in F(12,13,14,15) Questions to answer 1) Why? 2) How do we handle IC_R1= IC_R1 + ( - a * d + c * b) given the way the registers were being used by the RC_RX0= RC_RX0+ a*c +b *d calculations

RC_R0= RC_R0 + a*c +b *d And IC_R1= IC_R1 + ( - a * d + c * b)  Looks like 8 memory access per tap (point), but actually could optimize to 4 and reuse (IF there are enough registers)  4 multiples and 4 adds Can (if switch into SIMD mode) do 2 multiplication + 2 adds + 4 memory accesses per cycle 2 cycles needed in SIMD mode time 2 * Numpoints / 500 us < 50% of 10 us (at 96 kHz) Will work provided Numpoints < 5000 / 4 Problem to solve if working with SIMD mode– make sure that we don’t end up with a in register R1 and c in register S1 because then can’t multiply together Could we -- Unroll loop so do first dm pm fetch in R1 and R4 and have SIMD do the (hidden) second dm pm fetch into S1 and S4

 Even the simplest problem is essentially impossible to translate in time available – that why I say GPA A- starts around 80%  You need to demonstrate that ◦ You know what you need to do; so that if you had enough time you could complete ◦ Really key – able to use this knowledge to check that the compiler was doing a good job  15 marks split across the following (16 as first error is free) 1.REALLY KEY – Design the code before translating it 2.Format of assembly language code and course coding requirements 3.Demonstrate understanding of parameter passing and return – in R registers 4.Need to save and recover registers – know what is volatile and what is not 5.KEY -- Need to move passed pointers (in R registers) into I registers 6.How to set up arrays to allow simultaneous dm, pm access 7.Hardware / software loop differences 8.KEY -- Post-modify and pre-modify difference 9.KEY -- USING F registers when doing mults and adds in multi-function mode 10.Complex number theory and format on DSP processors

#include // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part) // more likely (Midterm 2) switch to SIMD mode and use R0 and S0.section seg_pmco;.global _ CalculateComplexCorrelation__NM; _CalculateComplexCorrelation__NM: R16 not a real fake – would look like Rx = dm(2, SP) – but why learn that when could cut-and-paste for a C++ code example complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { R0, R1 for return values (pretend) 4 parameters in very complex as using stack operations Fake by pretending R4 and R16 (dm and pm pointer) R8 R12 – Then move R16 into real register Rx

corrReal_F0 = 0.0; corrImag_F1 = 0.0; maxLoop_R8 = numPts_R8 – offset_R12; This sets Z, N flags if LE JUMP END; // no DB realPt_I4 = inPar_R4; imagPt_I12 = inPar_R16; // Want to handle offset into arrays easily Save I5 and I13 to stack // need more R registers Save R3, R6, R7, R9, R10 inParR4Offset_R4 = inPar_R4 + offset_R12; inParR4Offset_R5 = inPar_R5 + offset_R12; realPtOffset_I5= inParR4Offset_R4 imagPtOffset_I13 = inParR4Offset_R5 // Do a code review and fix the minor bug correlation = 0 + k0 set up pointers There are other ways of doing this using modify registers

set up loop using R8 information should be on reference sheet for (int k = 0; k < numPts - offset; k++) { Would look something like this Modify(SP, 3); R0 = I3; // Can’t save Ix directly to memory dm(1, SP) = R0 R0 = I13; // Can’t save Ix directly to memory dm(2, SP) = R0 // Also there is no pm stack implimented

// Read real part of 1 and complex part of other firstReal_R6 = dm(realPt_I4, DMPLUS1), secondImag_R10 = pm(imagPtOffset_I13, PMPLUS1); secondReal_R9 = dm(realPtOffset_I5, DMPLUS1), firstImag_R7 = pm(imagPtOffset_I13, PMPLUS1); temp_F2 = F6 * F9; temp_F3 = F7 * F10; real_F0 =F0 + F2; real_F0 = F0 + F3; temp_F2 = F6 * F10; temp_F3 = F7 * F9; imag_F1 = F1 – F2; imag_F1 = F1 + F3 correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); // Use math explained above // I am just writing code – not trying to optimize // Valid code BUT these instructions ARE NOT executed in parallel – wrong syntax, wrong registers for multi-function // real update // imag update – less documented temp registers used and discarded quickly – okay under exam condition

END: Recover registers in reverse order R10, R9, R7, R6, R3 Values already in R0 and R1 5 magic lines to return to C } return correlation; (R0 and R1)

 Demonstrate unroll loop – unroll 2 * p times ◦ Unrolling allows us to move (make parallel) parts of the first set of operations and second operations ◦ In real life – may unroll up to 8 times to find parallel operations – demonstrate concept in midterm (time)  If switching to SIMD -- unroll 4 * p times  Write the optimization design using C++ syntax ◦ Don’t switch to assembly code until VERY last moments ◦ Write in the simplest possible version of C  Concentrate on the loop as that is where we get the speed

for (int k= 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Problem 1 – Can’t switch to SIMD mode if k + offset is not divisible by 2 SIMD mode does R0 = dm[2 * x] and S0 = dm[2 * x + 1] Meaning it can do dual fetch dm[1000], dm[1001], but not dm[1001], dm[1002] Means our speed estimate is out by factor of 2 since we can’t switch to SIMD mode – or if we do switch -- code must become more complex – so don’t switch to SIMD

for (int k = 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes If (numPts – offset) is even then unrolled code becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Else for (int k = 0; k < numPts – offset - 1; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } k = numPts – offset – 1; correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);

correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); correlation = correlation + (a[k] + jb[k] )* (a[k + offset] - jb[k + offset] ); correlation = correlation + (a[k + 1] + jb[k + 1] )* (a[k + offset + 1] - jb[k + offset + 1] ); Look at real part only -- use correlationRe = correlationRe + (a[k] * a[k + offset]) + (b[k] * b[k + offset] ) correlationRe = correlationRe + (a[k + 1] * a[k + offset + 1]) + (b[k + 1] * b[k + offset + 1] )

Temp1 = a[k] ; Note register renaming Temp2 = a[k + offset]; Use this approach incase there Mult3 = temp1 * temp2 are unexpected timing delays Temp4 = b[k]; then can interlink the 2 unrolls Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Plan to put imag array on pm access corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

Use this order because of instruction format On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset] ; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1]; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult6 = temp4 * temp5; Mult13 = temp11 * temp12corrRe = corrRe + Mult3 Imag fetches corrRe = corrRe + Mult3 Mult16 = temp14 * temp15;corrRe = corrRe+ Mult6 Imag fetches corrRe = corrRe+ Mult6 imag mult 1 corrRe = corrRe + Mult13 Imag fetches imag mult 2 corrRe = corrRe+ Mult16 Imag fetches Temp11 = a[k+ 1] ; Imag mult 1Imag add 1 Temp12 = a[k + offset + 1]; imag mult 1Imag add 2 Mult13 = temp11 * temp12 Imag add 1 Temp14 = b[k + 1] ; Imag add 2 Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; Efficiency 8 in 12 corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Mult3 = temp1 * temp2 F2 F5 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; What register for Mult 3 What register for Temp 11 ? Mult6 = temp4 * temp5; F3 F6 Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult13 = temp11 * temp12 ? ? corrRe = corrRe + Mult3 F0 F0 Illegal use of F0 Mult16 = temp14 * temp15; ? ? corrRe = corrRe+ Mult6 F0 F0 imag mult 1 corrRe = corrRe + Mult13 imag mult 2 corrRe = corrRe+ Mult16 Imag add 1 Imag add 2