Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Optimization Issues

Similar presentations


Presentation on theme: "General Optimization Issues"— Presentation transcript:

1 General Optimization Issues
Multiple bus access Quad data fetches DAB usage

2 Optimization 3, M. Smith, ECE, University of Calgary, Canada
To be tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

3 Most optimized SIMD Floating point (32-bit)TigerSHARC instruction
xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

4 Code optimization – 32 bit integers or 32-bit floats
2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

5 Code to this point is SISD parallel optimization
SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

6 BUFFER_SIZE = 128 Rewrite so that X and Y ops done together
Error found when testing 1+ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

7 Understanding C++ on a DSP processor
We say int main( void) { int foo[200]; // Goes on stack What happens int foo[200]; // Goes on J-stack -- as with 68K // J27 = J27 – start of main // J27 = J end of main 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

8 Understanding C++ on a DSP processor
We say int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens int foo[200]; // Equivalent to 68K // start main J27 = J27 – 200 // end main J27 = J static int far[200]; // becomes // .section data1; // .var _far[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

9 Understanding C++ on a DSP processor
WE SAY THIS int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens --- all arrays by default are DM space memory int DM foo[200]; // Goes on J-stack (DM space) static DM int far[200]; // becomes // .section data1; (DM space) // .var _far[200]; TigerSHARC --- Modify C++ syntax to take TigerSHARC architecture into account has J stack and K stack -- K stack currently not accessable from C++ static PM int fum[200]; // becomes // .section data2; (PM space) // .var _fum[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

10 Example of DM and PM usage
We are now looking at 2 * N / 2 cycle loop = 128 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

11 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

12 Real Life code – Version 5C – C++ XY compute blocks
Plan for left and right handled in parallel 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

13 XY compute blocks in assembly code
458 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

14 Stage 5D – C++ using dual data memory
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

15 Dual Memory – Assembly code version
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

16 Dual memory access – assembly code version
332 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

17 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Dual memory – interleaved code to avoid compute block stalls – Version 5E 203 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

18 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Version 5F – fine tuning by rearranging “the little stuff” at the beginning and end 172 cycles Loop 128 – no stall but 46 cycles “other stuff”) 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

19 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? 172 -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls Around 128 45 cycles overhead 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

20 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Questions to answer Can we persuade the “C++” compiler (through some command) to make use of the hardware circular buffer? What other special optimizating #pragma’s are available? Given that there is some much overhead in assembly code with saving and restoring the circular buffer pointer – can we save time by using “memory-to-memory” moves and quad fetches along both J and K busses? 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

21 Basic concepts of quad fetches Key Issue
High efficiency for fetching 4 data values in 1 cycle Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; J0 = source;; J1 = destination;; Valid syntax – correct answer XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

22 Example code and results
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

23 Basic concepts of quad fetches Key Issue
Problem when doing FIFO data buffer operations Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; Valid syntax – WRONG answer J0 = source;; J1 = destination;; J0 = J0 + 1;; (not on 4 boundary) XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; // Result as if J0 = J0 + 1;; not there 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

24 Example code – wrong values fetched
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

25 First attempt to use DAB
Memory values coming out “4 reads” later than expected. This because the DAB acts as an “extra” stage in the data fetch pipeline 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

26 Must do a “dummy” read of the DAB
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

27 Always need to do an “initial” read of DAB
If JB and JL registers are set then DAB automatically does CB operations. DAB must use J0 to J3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

28 DAB ALIGNED NON-ALIGNED
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

29 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada


Download ppt "General Optimization Issues"

Similar presentations


Ads by Google