Download presentation
Presentation is loading. Please wait.
1
General Optimization Issues
Multiple bus access Quad data fetches DAB usage
2
Optimization 3, M. Smith, ECE, University of Calgary, Canada
To be tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
3
Most optimized SIMD Floating point (32-bit)TigerSHARC instruction
xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
4
Code optimization – 32 bit integers or 32-bit floats
2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
5
Code to this point is SISD parallel optimization
SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
6
BUFFER_SIZE = 128 Rewrite so that X and Y ops done together
Error found when testing 1+ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
7
Understanding C++ on a DSP processor
We say int main( void) { int foo[200]; // Goes on stack What happens int foo[200]; // Goes on J-stack -- as with 68K // J27 = J27 – start of main // J27 = J end of main 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
8
Understanding C++ on a DSP processor
We say int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens int foo[200]; // Equivalent to 68K // start main J27 = J27 – 200 // end main J27 = J static int far[200]; // becomes // .section data1; // .var _far[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
9
Understanding C++ on a DSP processor
WE SAY THIS int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens --- all arrays by default are DM space memory int DM foo[200]; // Goes on J-stack (DM space) static DM int far[200]; // becomes // .section data1; (DM space) // .var _far[200]; TigerSHARC --- Modify C++ syntax to take TigerSHARC architecture into account has J stack and K stack -- K stack currently not accessable from C++ static PM int fum[200]; // becomes // .section data2; (PM space) // .var _fum[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
10
Example of DM and PM usage
We are now looking at 2 * N / 2 cycle loop = 128 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
11
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
12
Real Life code – Version 5C – C++ XY compute blocks
Plan for left and right handled in parallel 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
13
XY compute blocks in assembly code
458 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
14
Stage 5D – C++ using dual data memory
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
15
Dual Memory – Assembly code version
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
16
Dual memory access – assembly code version
332 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
17
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Dual memory – interleaved code to avoid compute block stalls – Version 5E 203 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
18
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Version 5F – fine tuning by rearranging “the little stuff” at the beginning and end 172 cycles Loop 128 – no stall but 46 cycles “other stuff”) 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
19
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? 172 -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles = stalls Around 128 45 cycles overhead 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
20
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Questions to answer Can we persuade the “C++” compiler (through some command) to make use of the hardware circular buffer? What other special optimizating #pragma’s are available? Given that there is some much overhead in assembly code with saving and restoring the circular buffer pointer – can we save time by using “memory-to-memory” moves and quad fetches along both J and K busses? 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
21
Basic concepts of quad fetches Key Issue
High efficiency for fetching 4 data values in 1 cycle Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; J0 = source;; J1 = destination;; Valid syntax – correct answer XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
22
Example code and results
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
23
Basic concepts of quad fetches Key Issue
Problem when doing FIFO data buffer operations Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; Valid syntax – WRONG answer J0 = source;; J1 = destination;; J0 = J0 + 1;; (not on 4 boundary) XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; // Result as if J0 = J0 + 1;; not there 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
24
Example code – wrong values fetched
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
25
First attempt to use DAB
Memory values coming out “4 reads” later than expected. This because the DAB acts as an “extra” stage in the data fetch pipeline 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
26
Must do a “dummy” read of the DAB
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
27
Always need to do an “initial” read of DAB
If JB and JL registers are set then DAB automatically does CB operations. DAB must use J0 to J3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
28
DAB ALIGNED NON-ALIGNED
2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
29
Optimization 3, M. Smith, ECE, University of Calgary, Canada
Tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.