General Optimization Issues Multiple bus access Quad data fetches DAB usage
Optimization 3, M. Smith, ECE, University of Calgary, Canada To be tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Most optimized SIMD Floating point (32-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Code to this point is SISD parallel optimization SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
BUFFER_SIZE = 128 Rewrite so that X and Y ops done together Error found when testing 1+ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Understanding C++ on a DSP processor We say int main( void) { int foo[200]; // Goes on stack What happens int foo[200]; // Goes on J-stack -- as with 68K // J27 = J27 – 200 -- start of main // J27 = J27 + 200 -- end of main 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Understanding C++ on a DSP processor We say int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens int foo[200]; // Equivalent to 68K // start main J27 = J27 – 200 // end main J27 = J27 + 200 static int far[200]; // becomes // .section data1; // .var _far[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Understanding C++ on a DSP processor WE SAY THIS int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens --- all arrays by default are DM space memory int DM foo[200]; // Goes on J-stack (DM space) static DM int far[200]; // becomes // .section data1; (DM space) // .var _far[200]; TigerSHARC --- Modify C++ syntax to take TigerSHARC architecture into account has J stack and K stack -- K stack currently not accessable from C++ static PM int fum[200]; // becomes // .section data2; (PM space) // .var _fum[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Example of DM and PM usage We are now looking at 2 * N / 2 cycle loop = 128 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles 512 + 296 = 808 + 128 stalls 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Real Life code – Version 5C – C++ XY compute blocks Plan for left and right handled in parallel 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
XY compute blocks in assembly code 458 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Stage 5D – C++ using dual data memory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Dual Memory – Assembly code version 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Dual memory access – assembly code version 332 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Dual memory – interleaved code to avoid compute block stalls – Version 5E 203 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Version 5F – fine tuning by rearranging “the little stuff” at the beginning and end 172 cycles Loop 128 – no stall but 46 cycles “other stuff”) 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? 172 -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles 512 + 296 = 808 + 128 stalls Around 128 45 cycles overhead 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Questions to answer Can we persuade the “C++” compiler (through some command) to make use of the hardware circular buffer? What other special optimizating #pragma’s are available? Given that there is some much overhead in assembly code with saving and restoring the circular buffer pointer – can we save time by using “memory-to-memory” moves and quad fetches along both J and K busses? 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Basic concepts of quad fetches Key Issue High efficiency for fetching 4 data values in 1 cycle Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; J0 = source;; J1 = destination;; Valid syntax – correct answer XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Example code and results 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Basic concepts of quad fetches Key Issue Problem when doing FIFO data buffer operations Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; Valid syntax – WRONG answer J0 = source;; J1 = destination;; J0 = J0 + 1;; (not on 4 boundary) XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; // Result as if J0 = J0 + 1;; not there 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Example code – wrong values fetched 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
First attempt to use DAB Memory values coming out “4 reads” later than expected. This because the DAB acts as an “extra” stage in the data fetch pipeline 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Must do a “dummy” read of the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Always need to do an “initial” read of DAB If JB and JL registers are set then DAB automatically does CB operations. DAB must use J0 to J3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
DAB ALIGNED NON-ALIGNED 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Optimization 3, M. Smith, ECE, University of Calgary, Canada Tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada