General Optimization Issues

General Optimization Issues
Multiple bus access Quad data fetches DAB usage

Optimization 3, M. Smith, ECE, University of Calgary, Canada
To be tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Floating point (32-bit)TigerSHARC instruction
xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats
2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Code to this point is SISD parallel optimization
SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

BUFFER_SIZE = 128 Rewrite so that X and Y ops done together
Error found when testing 1+ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Understanding C++ on a DSP processor
We say int main( void) { int foo[200]; // Goes on stack What happens int foo[200]; // Goes on J-stack -- as with 68K // J27 = J27 – start of main // J27 = J end of main 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

We say int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens int foo[200]; // Equivalent to 68K // start main J27 = J27 – 200 // end main J27 = J static int far[200]; // becomes // .section data1; // .var _far[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

WE SAY THIS int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens --- all arrays by default are DM space memory int DM foo[200]; // Goes on J-stack (DM space) static DM int far[200]; // becomes // .section data1; (DM space) // .var _far[200]; TigerSHARC --- Modify C++ syntax to take TigerSHARC architecture into account has J stack and K stack -- K stack currently not accessable from C++ static PM int fum[200]; // becomes // .section data2; (PM space) // .var _fum[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example of DM and PM usage
We are now looking at 2 * N / 2 cycle loop = 128 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Real Life code – Version 5C – C++ XY compute blocks
Plan for left and right handled in parallel 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

XY compute blocks in assembly code
458 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Stage 5D – C++ using dual data memory
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Dual Memory – Assembly code version

Dual memory access – assembly code version
332 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Dual memory – interleaved code to avoid compute block stalls – Version 5E 203 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Version 5F – fine tuning by rearranging “the little stuff” at the beginning and end 172 cycles Loop 128 – no stall but 46 cycles “other stuff”) 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? 172 -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls Around 128 45 cycles overhead 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Questions to answer Can we persuade the “C++” compiler (through some command) to make use of the hardware circular buffer? What other special optimizating #pragma’s are available? Given that there is some much overhead in assembly code with saving and restoring the circular buffer pointer – can we save time by using “memory-to-memory” moves and quad fetches along both J and K busses? 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Basic concepts of quad fetches Key Issue
High efficiency for fetching 4 data values in 1 cycle Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; J0 = source;; J1 = destination;; Valid syntax – correct answer XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example code and results

Basic concepts of quad fetches Key Issue
Problem when doing FIFO data buffer operations Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; Valid syntax – WRONG answer J0 = source;; J1 = destination;; J0 = J0 + 1;; (not on 4 boundary) XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; // Result as if J0 = J0 + 1;; not there 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example code – wrong values fetched

First attempt to use DAB
Memory values coming out “4 reads” later than expected. This because the DAB acts as an “extra” stage in the data fetch pipeline 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Must do a “dummy” read of the DAB

Always need to do an “initial” read of DAB
If JB and JL registers are set then DAB automatically does CB operations. DAB must use J0 to J3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

DAB ALIGNED NON-ALIGNED

Tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

General Optimization Issues

Similar presentations

Presentation on theme: "General Optimization Issues"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

General Optimization Issues

Similar presentations

Presentation on theme: "General Optimization Issues"— Presentation transcript:

Similar presentations

About project

Feedback