Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample and then use FIR filter of 256 taps – equivalent to 1 FIR filter of 256 * 256 taps with a bandwidth of / 256 * 256 Hz ◦ Use code from Lab 0, Lab 1, assignment 1 as much as possible Develop C++ version (show that fails unless optimized code) – Assignment 1 Modify your Lab 1assembly code to demonstrate (test and audio) speed improvement for following steps ◦ 1) software to hardware loop ◦ 2) parallel dm, pm access, don’t unroll loop ◦ 3) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple instructions ◦ 4) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple and add instructions Remember to provide resource chart and compare your timing to expected
Can the processor meet the requirements? Two forms of the code – which one is needed ◦ Grab one audio value -- Process everything before next individual audio samples ◦ Grab one audio block – Collect next audio block and process last audio block before next audio block collected Real life – worse case ◦ Each channel needs tap FIR filters ◦ Total channels – 42 Hz + harmonics, 19 Hz plus harmonics (19 * 3 = 57 Hz) – say 8 channels ◦ Need to generate audio warning signals ◦ Modify FIR filter coefficients to following signals – might not be constant frequency Do the best case timing analysis to see whether algorithm works
Similarity between one signal and another, and at what locations the similarity occurs Have a heart beat signal 000ABcD0000 Have a signal from patient running ABcD ABcD ABcD0000 Use 0000DcAB0000 as coefficients in FIR filter ABcD ABcD ABcD ABcD minimum filter output 000ABcD some output 000ABcD max output 000ABcD less output 000ABcD0000 – max again
Draw a picture of the situation Known signal sent to ultrasound transmitter A Noisy signal picked up at receiver B ◦ Do auto-correlation to get best estimate of delay Known signal sent to ultrasound transmitter B Noisy signal picked up at receiver A ◦ Do auto-correlation to get best estimate of delay ◦ Differences in delay time are related to speed of air in mine shaft
Simplest step up from doing examples exactly the same as lab examples Many standard formats Complex array – real and imaginary Components stored alternately in memory R1, I1, R2, I2, R3, I3 … access using dm(IX, MdmX) where MdmX = 2 Components stored in alternate blocks R1, R2, R3, … I1, I2, I3 access using dm(I1X, MdmP1) and dm(I2X, MdmP1) or access using dm(IdmX, MdmP1) and pm(IpmX, MpmP1) where MdmP1 and MplP1 are set to +1 by compiler Speed depends on format used and what you are doing with values
complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { complex correlation = 0 + j0 -- Missing piece of code for (int k = 0; k < numPts - offset; k++) { // Could be other forms of the algorithm // This is more “autocorrelation” – comparing signal to itself // Would work best when information of interest is in the centre of the signals correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } return correlation; Repeat many times along firstArray for different offsets Auto-correlation and cross-correlation and convolution are all equivalent to FIR operations where the FIR cofficients are data values rather than fixed values // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part) // more likely (Another exam) switch to SIMD mode and use R0 and S0
There is absolutely no point trying to optimize a loop that calls a subroutine / function ◦ The cost of setting up subroutine call (handling incoming parameters and return values) and jumping in an out of subroutine Question reminded you of this ◦ Assume that the Conjugate function is in-lined for speed. ◦ That means you need to go and write out the equation with inlined code
Enter and exit CalculateCorrelation( ) – 20 cycles Set up pointers inpar_Rx Ix – 30 cycles Set up and use hardware loop – 20 cycles Set up sum < 10 cycles So basically timing is (numPts – offset) * loop Body count correlation = realCorrelation + kImageCorrelation = correlation + realCorrelation + kImageCorrelation + firstArray[k] * Conjugate(firstArray [k+offset]); + (a + jb) * (c – jd) -- read in as c + jd or RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)
RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b) Means -- two sets of calculations RC_RX0= RC_RX0+ a*c +b *d RX0 does not mean R0 And IC_RX1= IC_RX1 + ( - a * d + c * b) Looks like 8 memory access per tap (point), fetch a, b, c, d TWICE Actually could optimize to 4 fetches and reuse (a, b, c, d IF there are enough registers to store the fetched values and do all the calculations if we unroll the loop and have to cope with memory access delays)
Reference sheet says MULTIFUNCTION COMPUTE OPERATION On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, FX=F(8,9,10,11),FY=F(12,13,14,15) So when doing this RC_RX0= RC_RX0+ a*c +b *d bring a and b into F(0,1,2,3); bring c and d into F(4,5,6,7) store a * c result into F(8,9,10,11) and store b * d result into F(12,13,14,15) store a * c + b * d result into F(8,9,10,11) which would work if RC_RX0 was in F(12,13,14,15) Questions to answer 1) Why? 2) How do we handle IC_R1= IC_R1 + ( - a * d + c * b) given the way the registers were being used by the RC_RX0= RC_RX0+ a*c +b *d calculations
RC_R0= RC_R0 + a*c +b *d And IC_R1= IC_R1 + ( - a * d + c * b) Looks like 8 memory access per tap (point), but actually could optimize to 4 and reuse (IF there are enough registers) 4 multiples and 4 adds Can (if switch into SIMD mode) do 2 multiplication + 2 adds + 4 memory accesses per cycle 2 cycles needed in SIMD mode time 2 * Numpoints / 500 us < 50% of 10 us (at 96 kHz) Will work provided Numpoints < 5000 / 4 Problem to solve if working with SIMD mode– make sure that we don’t end up with a in register R1 and c in register S1 because then can’t multiply together Could we -- Unroll loop so do first dm pm fetch in R1 and R4 and have SIMD do the (hidden) second dm pm fetch into S1 and S4
Even the simplest problem is essentially impossible to translate in time available – that why I say GPA A- starts around 80% You need to demonstrate that ◦ You know what you need to do; so that if you had enough time you could complete ◦ Really key – able to use this knowledge to check that the compiler was doing a good job 15 marks split across the following (16 as first error is free) 1.REALLY KEY – Design the code before translating it 2.Format of assembly language code and course coding requirements 3.Demonstrate understanding of parameter passing and return – in R registers 4.Need to save and recover registers – know what is volatile and what is not 5.KEY -- Need to move passed pointers (in R registers) into I registers 6.How to set up arrays to allow simultaneous dm, pm access 7.Hardware / software loop differences 8.KEY -- Post-modify and pre-modify difference 9.KEY -- USING F registers when doing mults and adds in multi-function mode 10.Complex number theory and format on DSP processors
#include // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part) // more likely (Midterm 2) switch to SIMD mode and use R0 and S0.section seg_pmco;.global _ CalculateComplexCorrelation__NM; _CalculateComplexCorrelation__NM: R16 not a real fake – would look like Rx = dm(2, SP) – but why learn that when could cut-and-paste for a C++ code example complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { R0, R1 for return values (pretend) 4 parameters in very complex as using stack operations Fake by pretending R4 and R16 (dm and pm pointer) R8 R12 – Then move R16 into real register Rx
corrReal_F0 = 0.0; corrImag_F1 = 0.0; maxLoop_R8 = numPts_R8 – offset_R12; This sets Z, N flags if LE JUMP END; // no DB realPt_I4 = inPar_R4; imagPt_I12 = inPar_R16; // Want to handle offset into arrays easily Save I5 and I13 to stack // need more R registers Save R3, R6, R7, R9, R10 inParR4Offset_R4 = inPar_R4 + offset_R12; inParR4Offset_R5 = inPar_R5 + offset_R12; realPtOffset_I5= inParR4Offset_R4 imagPtOffset_I13 = inParR4Offset_R5 // Do a code review and fix the minor bug correlation = 0 + k0 set up pointers There are other ways of doing this using modify registers
set up loop using R8 information should be on reference sheet for (int k = 0; k < numPts - offset; k++) { Would look something like this Modify(SP, 3); R0 = I3; // Can’t save Ix directly to memory dm(1, SP) = R0 R0 = I13; // Can’t save Ix directly to memory dm(2, SP) = R0 // Also there is no pm stack implimented
// Read real part of 1 and complex part of other firstReal_R6 = dm(realPt_I4, DMPLUS1), secondImag_R10 = pm(imagPtOffset_I13, PMPLUS1); secondReal_R9 = dm(realPtOffset_I5, DMPLUS1), firstImag_R7 = pm(imagPtOffset_I13, PMPLUS1); temp_F2 = F6 * F9; temp_F3 = F7 * F10; real_F0 =F0 + F2; real_F0 = F0 + F3; temp_F2 = F6 * F10; temp_F3 = F7 * F9; imag_F1 = F1 – F2; imag_F1 = F1 + F3 correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); // Use math explained above // I am just writing code – not trying to optimize // Valid code BUT these instructions ARE NOT executed in parallel – wrong syntax, wrong registers for multi-function // real update // imag update – less documented temp registers used and discarded quickly – okay under exam condition
END: Recover registers in reverse order R10, R9, R7, R6, R3 Values already in R0 and R1 5 magic lines to return to C } return correlation; (R0 and R1)
Demonstrate unroll loop – unroll 2 * p times ◦ Unrolling allows us to move (make parallel) parts of the first set of operations and second operations ◦ In real life – may unroll up to 8 times to find parallel operations – demonstrate concept in midterm (time) If switching to SIMD -- unroll 4 * p times Write the optimization design using C++ syntax ◦ Don’t switch to assembly code until VERY last moments ◦ Write in the simplest possible version of C Concentrate on the loop as that is where we get the speed
for (int k= 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Problem 1 – Can’t switch to SIMD mode if k + offset is not divisible by 2 SIMD mode does R0 = dm[2 * x] and S0 = dm[2 * x + 1] Meaning it can do dual fetch dm[1000], dm[1001], but not dm[1001], dm[1002] Means our speed estimate is out by factor of 2 since we can’t switch to SIMD mode – or if we do switch -- code must become more complex – so don’t switch to SIMD
for (int k = 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes If (numPts – offset) is even then unrolled code becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Else for (int k = 0; k < numPts – offset - 1; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } k = numPts – offset – 1; correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);
correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); correlation = correlation + (a[k] + jb[k] )* (a[k + offset] - jb[k + offset] ); correlation = correlation + (a[k + 1] + jb[k + 1] )* (a[k + offset + 1] - jb[k + offset + 1] ); Look at real part only -- use correlationRe = correlationRe + (a[k] * a[k + offset]) + (b[k] * b[k + offset] ) correlationRe = correlationRe + (a[k + 1] * a[k + offset + 1]) + (b[k + 1] * b[k + offset + 1] )
Temp1 = a[k] ; Note register renaming Temp2 = a[k + offset]; Use this approach incase there Mult3 = temp1 * temp2 are unexpected timing delays Temp4 = b[k]; then can interlink the 2 unrolls Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Plan to put imag array on pm access corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16
Use this order because of instruction format On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16
OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset] ; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1]; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16
OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16
OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult6 = temp4 * temp5; Mult13 = temp11 * temp12corrRe = corrRe + Mult3 Imag fetches corrRe = corrRe + Mult3 Mult16 = temp14 * temp15;corrRe = corrRe+ Mult6 Imag fetches corrRe = corrRe+ Mult6 imag mult 1 corrRe = corrRe + Mult13 Imag fetches imag mult 2 corrRe = corrRe+ Mult16 Imag fetches Temp11 = a[k+ 1] ; Imag mult 1Imag add 1 Temp12 = a[k + offset + 1]; imag mult 1Imag add 2 Mult13 = temp11 * temp12 Imag add 1 Temp14 = b[k + 1] ; Imag add 2 Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; Efficiency 8 in 12 corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16
Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Mult3 = temp1 * temp2 F2 F5 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; What register for Mult 3 What register for Temp 11 ? Mult6 = temp4 * temp5; F3 F6 Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult13 = temp11 * temp12 ? ? corrRe = corrRe + Mult3 F0 F0 Illegal use of F0 Mult16 = temp14 * temp15; ? ? corrRe = corrRe+ Mult6 F0 F0 imag mult 1 corrRe = corrRe + Mult13 imag mult 2 corrRe = corrRe+ Mult16 Imag add 1 Imag add 2