Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 5 What syntax to make the code more parallel?
Understanding the TigerSHARC Parallel Operations TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Processor Architecture 3 128-bit data busses 2 Integer ALU 2 Computational Blocks ALU (Float and integer) SHIFTER MULTIPLIER COMMUNICATIONS CLU 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Use C++ IIR code as comments Things to think about prior to code writing Register name reorganization Keep XR4 for xInput – save a cycle Put S1 and S2 into XR0 and XR1 -- chance to fetch 2 memory values in one cycle using L[ ] Put H0 to H5 in XR12 to XR16 -- chance to fetch 4 memory values in one cycle using Q[ ] followed by one normal fetch -- Problems – if more than one IIR stage then the second stage fetches are not quad aligned There are two sets of multiplications using S1 and S2. Can these by done in X and Y compute blocks in one cycle? float *copyStateStartAddress = state; S1 = *state++; S2 =*state++; *copyStateStartAddress++ = S1; *copyStateStartAddress++ = S2; 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Register name conversion done in steps Setting Xin – XR4 and Yout = XR8 saves one cycle Bulk conversion with no error So many errors made during bulk conversion that went to Find/replace/ test for each register individually 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Fix bringing state variables in QUESTION We have XR18 = [J6 += 1] (load S1) and R19 = [J6 += 1] (load S2) Both are valid What is the difference? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
That difference – could it be used to our advantage? XR18 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR18 XYR19 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR19 AND YR18 XYR19 = L[J6 += 2];; -- concept correct – but executes faster Read value at [J6], updates J6 to J6 + 1, store in XR19. AND Read value at [(new) J6], updates J6 to J6 + 1, store in XY19. PROVIDED J6 was originally aligned on 64-bit boundary 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Send state variables out Go for the gusto – use L[ ] (64-bit) Need to recalculate the test result state[1] is NOT Yout 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Working solution -- I 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Working Solution -- Part 2 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Working solution – Part 3 I could not spot where any extra stalls would occur because of memory pipeline reads and writes All values were in place when needed Need to check with pipeline viewer 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Lets look at DATA MEMORY and COMPUTE pipeline issues -- 1 No problems here 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Weird stuff happening with INSTRUCTION pipeline Only 9 instructions being fetched but we are executing 21! Why all these instruction stalls? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Analysis We are seeing the impact of the processor doing quad-fetches of instructions (128-bits) into IAB (instruction alignment buffer) Once in the IAB, then the instructions (32-bits) are issued to the various execution units as needed. 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Before we do any further optimization, need to understand about processor parallelism We already know about Parallel multiplications and additions and their associated stalls What about parallel memory fetches? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Parallel memory fetches What is permissible? Can we do? Parallel fetches into XY at the same time Parallel into X and a Y registers Parallel into two X registers 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Parallel memory syntax – not too difficult Only this syntax is illegal Will need to do more research to discover whether “legal” means that the operation is performed without stalling the memory pipeline NOTE: Need to transfer INPAR3 (J6) into a K-register (K6) in order to be able to use both the J and K data busses during IIR operation 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Question: How do you (in C++) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada Question: How do you (in assembly code) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada C++ manual talks about 2 data spaces (dm and pm) for STATIC or GLOBAL variables 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada BAD You can use the VDSP C++ extension pm to specify a different memory space. HOWEVER, there is no such thing as a pm stack so all variable must be declared “static” or “global” dm arrays can be placed on the stack but there may be alignment issues 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
The assembler manual says something similar but different 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada VDSP C++ extensions dm and pm parameters are still being passed into functions via J5 and J6 as before. Notice the very big difference in the “absolute addresses” indicating that the data blocks are in very different memory spaces. Also data memory address is widely different from instruction memory space. Do instruction and 2 data fetches at same time 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
IIR function using TigerSHARC C++ DSP extensions dm and pm 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Using dm and pm shows up a little more parallel than only using dm 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
From TigerSHARC TS201 programming reference manual 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Memory block operation will need to be explored in more detail later 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Understanding the TigerSHARC Parallel Operations TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada