Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October Afternoon Practical examples handled in Lab 1 1

Code Design (RECAP) Add new value to FIFO buffer – includes discarding oldest FIFO value Perform averaging Note N will be set as small number (e.g. N = 16) during some parts of test, and set large when timing done Use no magic numbers use in code – No loops involving for (j = 0; j < 1024; j++) – Use (j = 0; j < N; j++) where N is declared in Assign1.h so a single N is used across project 2

Develop a personal software process to minimize mistakes and wasted time Do a code review for syntax errors – Make a list of errors so that you can identify your most common mistakes AND STOP making them The number of syntax errors the compiler finds after your code review is related to the number of logical defects (unfound) in your code. The number of syntax errors the compiler finds after your code review is related to the amount of time you waste debugging your code looking for those hidden defects. – Plan to spend 20% of your programming time doing code review 3

SHARC assembly code WAIL – handle timing issues now Can we call SHARC.asm code and return to C++ without crashing system? – Equivalent of RTS on Blackfin – Equivalent of ??? On MIPS Can we access memory? – Equivalent of R0 = [P0]; and [P1] = R1; on Blackfin – Equivalent to ????? On MIPS Can we access memory without crashing system; 4

Life cycle (RECAP) Design – wish-list of ‘stories’ Write Tests to show how code would work Write C code to satisfy tests Generate resource chart based on processor architecture to calculate best ‘theoretical’ speed – If ‘real time theoretical speed’ works for you, then okay to try to optimize. – If ‘theoretical speed’ does not work for you then ‘find a different algorithm’, optimization is not going to help. Modify already written tests to prove that assembly code works as well as being fast 5

Did a project clean and then build Do code review to find final error This was a refactoring error when I changed file names Change CONTROL and NOTIFY macros to include CPP (See next slide) 6

Mock Device Registers “satisfy linker” CCES says “inconsistent” definition Poor mock – we move values in Audio Device registers by hand Can we “MOCK” – Receive_ADC_Samples – Typical industrial testing approach needed when hardware “NOT-YET-DEVELOPED 7

Better Simulation What is – the algorithm is “by mistake” still doing Left_Out = Left_In (Copy), then we would get the same answer Currently “LeftChannel_In1” is a fixed constant – making it difficult for us to check whether our algorithm would work for more complex signals So we could start testing the algorithm validity (not its speed) by changing LeftChannel_In1 by “mocking “ReciveA2D( )” and “TransmitD2A( ) audio devices 8

Using ‘MockDevice.c” loads (RECAP) What do we do about ‘Receive_ADC_Samples ( )?’ These ‘mock’ routines satisfy a linker requirement for a function we don’t use. When they need to become more detailed, worry about then (WAIL). 9

Mocked device inside Assign1Library Can be used during Lab 1 -- 4 10 MADE PRIVATE (FIXED) GOOD OR BAD IDEA? VARIETY OF ALGORITHMS TESTED

Use GUI to add new test group for Averaging code – 3 styles of tests (RECAP) 11

Time test – measure in us Must be less than 20 us per point (1 audio channel) 12 Un-automated, But we need to collect details and don`t have to do much analysis in lab

Interesting CCES code ran much slower than VDSP code 13 CCES has different C++ device buffer characteristics apparently for printf( )

Refactored Project Arrows indicate some changes 14

Some issues to take up with Analog Devices Engineering Zone 15

Theoretical Analysis We expect our theoretical analysis to be fast or faster than what the C++ optimized code takes We are not using any C++ DSP extensions, so expected efficient rather than optimized code Is 816 cycles per sample processed by Average Filter the speed we would expect based on our understanding of the processor architecture? 16

Expectations First instruction after a jump takes 3 cycles to finish executing After that 1 instruction, all things being equal, takes 1 cycle 1 cycle for a read, write, add, multiple D? cycles for a division 17

Averaging Filter Theoretical Analysis Fetch N values from memory -- N cycles Perform N add operations -- N cycles Go round the sum for-loop -- N * FLC cycles – Where FLC is # instructions to handle For-Loop-Control – includes all-overheads of jumping dufing for-loop Exit for loop(done once)-- EFL cycles Do division-- D cycles Return a value from function -- RV cycles -- Enter and exit Average routine -- EER cycles AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles VERY BIG DEFECT IN ANALYSIS FOUND LATER ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS 18

Averaging Filter Theoretical analysis continued AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles TIME / SAMPLE PROCESSED = { N(1 + 1 + FLC) + EFL + D + RV + EER } / N which becomes 2 + FLC + (EFL + D + RV + EER } / N For N large – 2 + FLC cycles / sample processed Estimate FLC – loop control - 1 + 1 + 3 (once per loop) + 1 + 3 = 7 – START OF LOOP -- compare, check compare, jump out of loop if needed – END OF LOOP – increment counter, jump to start of loop For N large – expect 9 cycles / sample processed – most FLC!!!!!!! Time analysis says -- 816 cycles per sample processed Either C++ incredibly inefficient or we missed things – For example – this for loop overhead – but that would only add 9 cycles / sample processed 19

Big problem Measuring wrong thing We calculated the time to perform an average of N points in an N point array – Allows us to understand how to calculate the time for averaging P points 816 cycles is time to average 1 input value – Useless information – as we make no mention of how many points were averaged – Need to rewrite tests 20

Optimizing C++ compiler The optimizing compiler know more than we do about the processor architecture Use C++ compiler as tutor – Look at code generate Go into Disassembly window and search for the optimized code as SHARC instructions 22

Unexpected behaviour explained 23 Also use – control-shift-G

Did the C- compiler remember anoverhead for doing FIFO update? So our analysis could be even slower 24 Hardware Loop counter is 0xFF = 255 – NOT FIFO_SIZE! Why? VERY VERY DIFFERENT REASONs See Next SLIDEs

For N Large Only the loop code really counts FIFO update What does the syntax of the hardware loop mean? 25

What’s a SISD compared to a SIMD Another indication of special architectural features to handle DSP we must understand AND USE 26

For N Large Only the loop code really counts FIFO update Do (pc, 0x06) -- Means what???? Does it mean loop starts at current instruction location 12b8db and finishes at (or includes) 12b8db + 6 = 12b8e1???? Or loop starts at first instruction start of loop 12b8de and finishes BEFORE 12b8de + 6 = 12b8e3 (2 instruction loop) Is the nop in the loop or not? --- Makes a 50% difference in DSP speed is inside the loop. – We must understand processor architecture 27

Timing -- For N Large Only the loop code time really counts FIFO update Lcntr = 255 -- since move (FIFO-SIZE – 1) values 0x?db is only executed once (loop set up) – then the loop switches to automatic (zero-overhead hardware loop) Loop is size 2 --- 0x?de and 0x?e1 instructions are each of size 48 (hence 0x?db says 0x6 is loop size) 0x?e4 (nop) is only executed 1 – Is a special safety feature which is necessary when these “special loops are execute” to avoid data races FIR filter hardware loops will be 1 cycle (dm[], pm[], + and * in 1 instruction) in loop – need “TWO” safety nops otherwise “possible” race condition” 28

Averaging loop itself Doing a memory read and add each cycle Average can be done in 1 cycle that way We do 256 moves and adds – yet loop is only size 255 ! -- Concept of loop unrolling Note special multiple by 2^N instruction f2 = scaleb f2 by r1 r1= 0xffff fff8 or -8 -- meaning divide by 2 ^ -8 which is (1 / 256) – 1 cycle division if power of 2 One fetch before loop, 255 adds and fetches inside loop, leaving 1 add to finish outside the loop 29

Divide by power of two We need to understand by division of a floating point number by power of two can be achieved in 1 cycle Related to, but very different from doing integer division by “shifting” >> 8 Need to “review” number representations – how are integer and floating point numbers stored and manipulated by software and hardware 30

Things we need to tell the compiler In principle you can vectorize the add Instead of doing Loop 256 R1 = dm(I4, ? ); // Fetch floating point number F2 = F2 + F1; We can do Switch to SIMD mode Loop 128 R1 = dm(I4, ? ), S1 = dm(I4 + 1, ? ); ; // Fetch 2 values F2 = F2 + F1, SF2 = SF2 + SF1; // Add 2 values Later we can cause the partial sums to be added F2 and SF2 If we can switch to SIMD in the right way – 50% speed improvement!!!! 31

Rough timing calculation can be performed based on “C” code EXPECT TO DO THIS DURING EXAMS, QUIZZES AND LAB. REPORTS 32

Rough timing calculation – Cycles (Expect to do this in lab reports, exams etc) Get in and out of routine20 First loop – update FIFO(N-1) * (Read + Write + loop control) Insertion of new value2 Do sum(N) * (Read + write + loop control + sum) Forgot the ‘do adds’ Divisions and store result3 * division + 2 writes Total(2N – 1) Read + (2N + 2) writes + 3 divisions + N sum + (2 N – 1) loop control Assume loop control is Compare (1), Check (1), increment (1) Jump back (3 because of pipeline) Assume read, write and add = 1 cycle Division = 10 (not many of these – so SHOULD not matter if N large > 30 Total – we can see that loop control is dominating – need to fix (2N – 1) + (2N + 2) + 3 * 10 + N + 6 * (2N – 1) = around 17 N In our case N = 6417 N = 1088 Where we have 2698 or 1326 experimentally 33

Things I have yet to learn how to do in CCES 34

Lets switch to the SHARC simulator (VDSP screen shots here) No new information here (except simulator runs slowly) Need to run – CYCLE ACCURATE PIPELINE VIEWER 35

Oh Bother And Damnation! Minor advantage – don’t have to run the slow simulator? 36

What other tools do we have? More accurate way of timing than using TESTs with their strange overhead Done using SHARC Cycle counters we can display Break points we can set (Don’t set on a for loop) 37

38 Cheat – add breakpoint at dummy instruction

Calculation From line 84 to 106 – done once 162151 – 15936 = 2625 cycles Add over head of getting in and out of routine (20 cycles) = 2645 cycles Test timing results – using fast board not slow simulator Reverse test list timing – Via micro-sec calculation 2639 cycles, via ‘clock( )’ 5398 Direct test list timing – Via micro-sec calculation 2639 cycles, via ‘clock( )’ 2899 WARNING: 2645 should not be considered close to 2639 (possible coincidence) until we know whether software loops are generated by C compiler in the way we assumed 39

(TMI) Took a shower to break my thought train. Look for code defect (now obvious) Defect– should be cyclesUsed Otherwise using ‘time since start of program’ Code cycles consistent via two different approaches 40

Modify tests so can handle both CPP and ASM versions (Cut-and-paste) Not the timing that’s the problem at this moment It’s ‘does the ASM and CPP code work’ at all! 41

Probably stop here 42

Check what function needs developing Fix compiler error with prototype in ‘Assign1.h” Linker error message says ‘wrong prototype’ (NM) 43

Check to see if can run the Tests that call ASM code without crashing C++ prototype extern “C” void Function(void) 44

Now add assembly code FIFO stack Temp fixes – Remove { } syntax init to zero code – WAIL – Need to declare N using ‘Assign1.h’ which is “C++ Do a quick local declaration of N = 64 to see if coding problem fixed before we start worrying about ‘Assign1.h’ 45

How to declare array’s in assembly This does not work Need to look in Assembly language manual – Copy available from Analog Web site or ENCM515 website Find better way later 46

I made the code ‘more general’ #define.byte4.var // Home made defect remover Now move ‘all’ defines into ‘Assign1.h’ so that the same N gets used by CPP and ASM code and by TEST Does not work – C++ syntax confuses assembler 47

Best ‘temp’ fix I could find Use this type of syntax in ‘Assign1.h’ – Conditional code generation And in assembly code files 48

Initial testing done with small N N = 4 (as can work out expected result) Write the test – C++ code expected to pass – 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4 49

Look for ‘one out error’ in loops Common DSP mistake Remember to fix error in ASM ‘pseudo code’ 50

Initial testing done with small N N = 4 (as can work out expected result) Write the test – C++ code expected to pass – Asm code MUST fail test – otherwise test is poor – Must fail as there is no ASM code to allow pass to occur. This is the TEST of the TEST Now have 4 tests passing rather than 3, including ASM test INDICATES BAD TEST – WHY? 51

Improved test. Don’t allow ‘old correct value’ in output from C++ test Defect might have been identified by reversing test order 52

Modify Embedded Unit main( ) to allow this to happen 53

Might not get any further Go over again in next class Do ‘software loop control’ in tutorial Need to understand if – then – else construct 54

What registers can we use in assembly? Don’t use without performing save immediately and later recover operations. Otherwise C and C++ will crash These okay to use in assembly 55

Set up the FIFO adjust loop Need to set up ‘loopMax = N – 1” THIS CONSTANT OKAY THIS CONSTANT BAD! 56

What’s the error here? 57 RELIABLE METHOD PLACE CONSTANT IN REGISTER BEFOE USE

Here’s the full software loop structure 58 Each time around Loop – 9 cycles for Control Not the 5 we thought

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4) Both instructions use the ‘eye’ 4 index register (volatile) dm(2, I4) – is a pre-modify memory operation – The 1 is before the I4 – hence pre something – I4 points to a memory location – Dm(2, I4) means access the memory location at (I4 + 2) ADD IS NOT preformed in parallel with other operations? – LEAVE value in index register I4 unchanged – Used in array addressing Dm(I4, 2) – is a post-modify memory operation – The 2 is after the I4 – hence post something – I4 points to a memory location – Dm( I4, 2) means access the memory location at (I4) – MODIFY value in index register by 2 DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other operations?) 59

Other bits of code needed 61

Add assembly language ‘externs’ to ‘Assign1.h Still have not coded the division – fake it by hard-coding * 1/4 Must be an easier way to code memory – Yes – use post increment operation using pointers and not using array indexing 62

Code fails -- Most likely place to look for defects are in loop operations Forgot to set loopCounter =0 And loopMax to N when we Added code for the new loops 63

Try persuading the “assembler” to pre-calculate F3 = (1.0 / N) at ‘compile time’, not ‘run-time’ Code should now work for N = 64 – so can compare timing with C code 64

If we believe tests then calculation accuracy is lower (5E-06 for larger N) Despite lousy ASM code we already beating compiler in ‘debug’ mode(around 2N) 65

Before optimizing, we need to add a few more tests to check code valid 66 Uses sum of N integers N (N + 1) / 2 Accuracy now set to 1E-5

Use post-modify address mode sum = sum + *pt++; ( N = 64) ASM was 2400 cycles (N = 64), is now 2208 – Expect improvement of N = 64 cycles (2 instead of 3 instructions) – Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster 2 cycle stall till M4 ready to use? 67

dm(2, I4) versus dm(I4, 2) dm(M4, I4) versus dm(I4, M4) Both instructions use the ‘eye’ 4 index register dm(2, I4) – is a pre-modify memory operation – The 2 is before the I4 – hence pre something – I4 points to a memory location – Dm(2, I4) means access the memory location at (I4 + 2) – LEAVE value in index register I4 unchanged – Used in array addressing Dm(I4, 2) – is a post-modify memory operation – The 2 is after the I4 – hence post something – I4 points to a memory location – Dm( I4, 2) means access the memory location at (I4) – MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE) POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE STAGES 68

Using pre-modify and post-modify addressing – replace 6 instructions by 2 Expect 4 * N faster (256) Was 2208, is 1704 = 500 cycles Close to N * 6 faster! 69

Need to force “C++” to optimize Our asm code 1704 cycles Optimized “C” 205 cycles – 1500 cycles faster or roughly N * 23.5 cycles faster FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256 Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182 – Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4) CONCLUSION We have a lot more to learn about using the processor architecture correctly in order to get HIGH SPEED DSP CODE NOTE: COMPILER ASSUMES GENERAL DSP, CODE CHARACTERISTICS WE KNOW MORE, so should be able to write faster code (if we need to) 70

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October.

Similar presentations

Presentation on theme: "Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October.

Similar presentations

Presentation on theme: "Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October."— Presentation transcript:

Similar presentations

About project

Feedback