Presentation is loading. Please wait.

Presentation is loading. Please wait.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

Similar presentations


Presentation on theme: "Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October."— Presentation transcript:

1 Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October Afternoon Practical examples handled in Lab 1 1

2 Demo (uTTCOS) and Test (E-UNIT) configurations 2 True A/D True leftChannel_In Audio ISR with Filter True leftChannel_Out True D/A DMA CHANNEL YOUR SOFTWARE Test InAudio array Mock leftChannel_In Filter Mock leftChannel_Out Test OutAudio array MOCK ReceiveD2A Mock TransmitA2D YOUR SOFTWARE Test Set up InAudio[ ] Set up Expected[ ] In Loop { Call Filter to produce OutAudio[ ] } Compare Expected[ ] and OutAudio

3 Mock Device Registers “satisfy linker” CCES says “inconsistent” definition Poor mock – we move values in Audio Device registers by hand Can we “MOCK” – Receive_ADC_Samples – Typical industrial testing approach needed when hardware “NOT-YET-DEVELOPED 3

4 Better Simulation What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer Currently “LeftChannel_In1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals So we could start testing the algorithm validity (not its speed) by changing LeftChannel_In1 by “mocking “ReciveA2D( )” and “TransmitD2A( ) audio devices 4

5 Using ‘MockDevice.c” loads (RECAP) What do we do about ‘Receive_ADC_Samples ( )?’ These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL). 5

6 Mocked device inside Assign1Library Can be used during Lab 1 -- 4 6 MADE PRIVATE (FIXED) GOOD OR BAD IDEA? VARIETY OF ALGORITHMS TESTED

7 Use GUI to add new test group for Averaging code – 3 styles of tests (RECAP) 7

8 Testing Test that it works Test that it meets real time performance – Measure ms / Sample for 1 channel = Time-1CH – Require 20 ms > 8 * Time-1CH Move code onto Resource chart. – Determine theoretical best time if all optimizations Could be found Test to determine real cycle count Cycle / Tap / Sample Examine CPP.lst file (.i or.is) or your ASM file to determine expected cycle count – Work out why the difference between theory and real – Looking at accuracy of better than 1 cycle in 1000 – Assume 1 cycle per instruction except jumps and memory accesses and movement of I registers to memory – or any other delay we find common Be able to move the theoretical calculation for other processor architecture (timings) for MidTerm 1 on Thursday 23 rd Oct 8

9 Theoretical Analysis We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes We are not using any C++ DSP extensions, so expected efficient rather than optimized code Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture? 9

10 Expectations First instruction after a jump takes 3 cycles to finish executing After that 1 instruction, all things being equal, takes 1 cycle 1 cycle for a read, write, add, multiple D? cycles for a division 10

11 Averaging Filter with Loop Theoretical Analysis Fetch N values from memory -- N cycles Perform N add operations -- N cycles Go round the sum for-loop -- N * FLC cycles – Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop Exit for loop(done once)-- EFL cycles Do division-- D cycles Return a value from function -- RV cycles -- Enter and exit Average routine -- EER cycles AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS 11

12 Modify tests so can handle both CPP and ASM versions (Cut-and-paste) Not the timing that’s the problem at this moment It’s ‘does the ASM and CPP code work’ at all! 12

13 Check what function needs developing Fix compiler error with prototype in ‘Assign1.h” Linker error message says ‘wrong prototype’ (NM) 13

14 Check to see if can run the Tests that call ASM code without crashing C++ prototype extern “C” void Function(void) 14

15 Getting the same constants in an include file working in both CPP and ASM Use this type of syntax in ‘Assign1.h’ – Conditional code generation And in assembly code files 15

16 Initial testing done with small N N = 4 (as can work out expected result) Write the test – C++ code expected to pass – 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4 16

17 Look for ‘one out error’ in loops Common DSP mistake Remember to fix error in ASM ‘pseudo code’ 17

18 Initial testing done with small N N = 8 (as can work out expected result) Write the test – C++ code expected to pass – Asm code MUST fail test – otherwise test is poor – Must fail as there is no ASM code to allow pass to occur. This is the TEST of the TEST Now have 4 tests passing rather than 3, including ASM test INDICATES BAD TEST – WHY? 18

19 Improved test. Don’t allow ‘old correct value’ in output from C++ test Defect might have been identified by reversing test order 19

20 What registers can we use in assembly? Don’t use without performing save immediately and later recover operations. Otherwise C and C++ will crash These okay to use in assembly 20

21 Here’s the full software loop structure Note the formatting for easy code review (Required) 21 Each time around Loop – 9 cycles for Control Not the 5 we thought

22 dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4) Both instructions use the ‘eye’ 4 index register (volatile) dm(2, I4) – is a pre-modify memory operation – The 1 is before the I4 – hence pre something – I4 points to a memory location – Dm(2, I4) means access the memory location at (I4 + 2) ADD IS NOT preformed in parallel with other operations? – LEAVE value in index register I4 unchanged – Used in array addressing Dm(I4, 2) – is a post-modify memory operation – The 2 is after the I4 – hence post something – I4 points to a memory location – Dm( I4, 2) means access the memory location at (I4) – MODIFY value in index register by 2 DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other operations?) 22

23 23

24 Other bits of code needed 24

25 Add assembly language ‘externs’ to ‘Assign1.h Still have not coded the division – fake it by hard-coding * 1/4 Must be an easier way to code memory – Yes – use post increment operation using pointers and not using array indexing 25

26 Code fails -- Most likely place to look for defects are in loop operations Forgot to set loopCounter =0 And loopMax to N when we Added code for the new loops 26

27 Try persuading the “assembler” to pre-calculate F3 = (1.0 / N) at ‘compile time’, not ‘run-time’ Code should now work for N = 64 – so can compare timing with C code 27

28 If we believe tests then calculation accuracy is lower (5E-06 for larger N) Despite lousy ASM code we already beating compiler in ‘debug’ mode(around 2N) 28

29 Before optimizing, we need to add a few more tests to check code valid 29 Uses sum of N integers N (N + 1) / 2 Accuracy now set to 1E-5

30 Use post-modify address mode sum = sum + *pt++; ( N = 64) ASM was 2400 cycles (N = 64), is now 2208 – Expect improvement of N = 64 cycles (2 instead of 3 instructions) – Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster 2 cycle stall till M4 ready to use? 30

31 dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4) Both instructions use the ‘eye’ 4 index register dm(2, I4) – is a pre-modify memory operation – The 2 is before the I4 – hence pre something – I4 points to a memory location – Dm(2, I4) means access the memory location at (I4 + 2) – LEAVE value in index register I4 unchanged – Used in array addressing Dm(I4, 2) – is a post-modify memory operation – The 2 is after the I4 – hence post something – I4 points to a memory location – Dm( I4, 2) means access the memory location at (I4) – MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE) POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES 31

32 Using pre-modify and post-modify addressing – replace 6 instructions by 2 Expect 4 * N faster (256) Was 2208, is 1704 = 500 cycles Close to N * 6 faster! 32

33 Need to force “C++” to optimize Our asm code 1704 cycles Optimized “C” 205 cycles – 1500 cycles faster or roughly N * 23.5 cycles faster FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256 Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182 – Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4) CONCLUSION We have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS WE KNOW MORE, so should be able to write faster code (if we need to) 33


Download ppt "Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October."

Similar presentations


Ads by Google