General Optimization Issues

Slides:

Advertisements

Similar presentations

Lab 2 – DSP software architecture and the real life DSP characteristics of signals that make it necessary.

Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.

Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Feb 12, 2004Tiger SHARC Memory Operations REV B 1 of 17 ENEL DSP Architectures Tiger SHARC Memory Operations.

Ultra sound solution Impact of C++ DSP optimization techniques.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Software and Hardware Circular Buffer Operations

General Optimization Issues

TigerSHARC processor General Overview.

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

Overview of SHARC processor ADSP and ADSP-21065L

Trying to avoid pipeline delays

Generating a software loop with memory accesses

Understanding the TigerSHARC ALU pipeline

Comparing 68k (CISC) with 21k (Superscalar RISC DSP)

What are the characteristics of DSP algorithms?

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

ENCM515 Standard and Custom FIR filters for Lab. 4

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Understanding the TigerSHARC ALU pipeline

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Overview of TigerSHARC processor ADSP-TS101 Compute Operations

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

-- Tutorial A tool to assist in developing parallel ADSP2106X code

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* From AMD 1996 Publication #18522 Revision E

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

Overview of SHARC processor ADSP-2106X Memory Operations

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Lecture 5: Pipeline Wrap-up, Static ILP

Working with the Compute Block

A first attempt at learning about optimizing the TigerSHARC code

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

ENCM515 Standard and Custom FIR filters

Presentation transcript:

General Optimization Issues Multiple bus access Quad data fetches DAB usage

Optimization 3, M. Smith, ECE, University of Calgary, Canada To be tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Floating point (32-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Code to this point is SISD parallel optimization SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

BUFFER_SIZE = 128 Rewrite so that X and Y ops done together Error found when testing 1+ 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Understanding C++ on a DSP processor We say int main( void) { int foo[200]; // Goes on stack What happens int foo[200]; // Goes on J-stack -- as with 68K // J27 = J27 – 200 -- start of main // J27 = J27 + 200 -- end of main 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Understanding C++ on a DSP processor We say int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens int foo[200]; // Equivalent to 68K // start main J27 = J27 – 200 // end main J27 = J27 + 200 static int far[200]; // becomes // .section data1; // .var _far[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Understanding C++ on a DSP processor WE SAY THIS int main( void) { int foo[200]; // Goes on stack static int far[200]; // Goes into static memory space What happens --- all arrays by default are DM space memory int DM foo[200]; // Goes on J-stack (DM space) static DM int far[200]; // becomes // .section data1; (DM space) // .var _far[200]; TigerSHARC --- Modify C++ syntax to take TigerSHARC architecture into account has J stack and K stack -- K stack currently not accessable from C++ static PM int fum[200]; // becomes // .section data2; (PM space) // .var _fum[200]; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example of DM and PM usage We are now looking at 2 * N / 2 cycle loop = 128 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles 512 + 296 = 808 + 128 stalls 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Real Life code – Version 5C – C++ XY compute blocks Plan for left and right handled in parallel 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

XY compute blocks in assembly code 458 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Stage 5D – C++ using dual data memory 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Dual Memory – Assembly code version 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Dual memory access – assembly code version 332 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Dual memory – interleaved code to avoid compute block stalls – Version 5E 203 cycles 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Version 5F – fine tuning by rearranging “the little stuff” at the beginning and end 172 cycles Loop 128 – no stall but 46 cycles “other stuff”) 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Speed considerations C++ debug C++ optimized Assembly code Basic Code 9,567 1,122 2,506 XY compute blocks -- interleaved 10,021 1,310 2 extra stalls / loop? Using dm/pm memory J / K busses 935 1 less stall / loop? 172 -- CB loop update physically removed 4,666 296 4 memory moves = 512 cycles 512 + 296 = 808 + 128 stalls Around 128 45 cycles overhead 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Questions to answer Can we persuade the “C++” compiler (through some command) to make use of the hardware circular buffer? What other special optimizating #pragma’s are available? Given that there is some much overhead in assembly code with saving and restoring the circular buffer pointer – can we save time by using “memory-to-memory” moves and quad fetches along both J and K busses? 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Basic concepts of quad fetches Key Issue High efficiency for fetching 4 data values in 1 cycle Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; J0 = source;; J1 = destination;; Valid syntax – correct answer XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example code and results 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Basic concepts of quad fetches Key Issue Problem when doing FIFO data buffer operations Key problem – must fetch from words aligned on a 4 boundary .align 4; .var source[129]; .var destination[129]; Valid syntax – WRONG answer J0 = source;; J1 = destination;; J0 = J0 + 1;; (not on 4 boundary) XR3:0 = Q [J0 += 4];; // must be post-increment Q [J1 += 4] = XR3:0;; // Result as if J0 = J0 + 1;; not there 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Example code – wrong values fetched 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

First attempt to use DAB Memory values coming out “4 reads” later than expected. This because the DAB acts as an “extra” stage in the data fetch pipeline 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Must do a “dummy” read of the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Always need to do an “initial” read of DAB If JB and JL registers are set then DAB automatically does CB operations. DAB must use J0 to J3 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

DAB ALIGNED NON-ALIGNED 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada

Optimization 3, M. Smith, ECE, University of Calgary, Canada Tackled today Recap SISD with only X-compute and J-Bus accesses SIMD with X and Y -compute and J-Bus accesses SIMD with X and Y -compute and J- K- Bus accesses Doing it in C++ and in assembly code Final optimization Quad data fetches and the DAB 2/23/2019 Optimization 3, M. Smith, ECE, University of Calgary, Canada