Understanding the TigerSHARC ALU pipeline

Understanding the TigerSHARC ALU pipeline
Determining the speed of one stage of IIR filter – Part 5 What syntax to make the code more parallel?

Understanding the TigerSHARC Parallel Operations
TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Processor Architecture
3 128-bit data busses 2 Integer ALU 2 Computational Blocks ALU (Float and integer) SHIFTER MULTIPLIER COMMUNICATIONS CLU 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Use C++ IIR code as comments
Things to think about prior to code writing Register name reorganization Keep XR4 for xInput – save a cycle Put S1 and S2 into XR0 and XR chance to fetch 2 memory values in one cycle using L[ ] Put H0 to H5 in XR12 to XR chance to fetch 4 memory values in one cycle using Q[ ] followed by one normal fetch -- Problems – if more than one IIR stage then the second stage fetches are not quad aligned There are two sets of multiplications using S1 and S2. Can these by done in X and Y compute blocks in one cycle? float *copyStateStartAddress = state; S1 = *state++; S2 =*state++; *copyStateStartAddress++ = S1; *copyStateStartAddress++ = S2; 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Register name conversion done in steps
Setting Xin – XR4 and Yout = XR8 saves one cycle Bulk conversion with no error So many errors made during bulk conversion that went to Find/replace/ test for each register individually 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Fix bringing state variables in
QUESTION We have XR18 = [J6 += 1] (load S1) and R19 = [J6 += 1] (load S2) Both are valid What is the difference? 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

That difference – could it be used to our advantage?
XR18 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR18 XYR19 = [J6 += 1];; Read the value at memory location [J6], and updates J6 to J6 + 1 after fetch. Stores fetched value in XR19 AND YR18 XYR19 = L[J6 += 2];; -- concept correct – but executes faster Read value at [J6], updates J6 to J6 + 1, store in XR19. AND Read value at [(new) J6], updates J6 to J6 + 1, store in XY19. PROVIDED J6 was originally aligned on 64-bit boundary 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Send state variables out Go for the gusto – use L[ ] (64-bit)
Need to recalculate the test result state[1] is NOT Yout 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada
Working solution -- I 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Working Solution -- Part 2
4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Working solution – Part 3
I could not spot where any extra stalls would occur because of memory pipeline reads and writes All values were in place when needed Need to check with pipeline viewer 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Lets look at DATA MEMORY and COMPUTE pipeline issues -- 1
No problems here 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Weird stuff happening with INSTRUCTION pipeline
Only 9 instructions being fetched but we are executing 21! Why all these instruction stalls? 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Analysis We are seeing the impact of the processor doing quad-fetches of instructions (128-bits) into IAB (instruction alignment buffer) Once in the IAB, then the instructions (32-bits) are issued to the various execution units as needed. 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Before we do any further optimization, need to understand about processor parallelism We already know about Parallel multiplications and additions and their associated stalls What about parallel memory fetches? 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Parallel memory fetches
What is permissible? Can we do? Parallel fetches into XY at the same time Parallel into X and a Y registers Parallel into two X registers 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Parallel memory syntax – not too difficult
Only this syntax is illegal Will need to do more research to discover whether “legal” means that the operation is performed without stalling the memory pipeline NOTE: Need to transfer INPAR3 (J6) into a K-register (K6) in order to be able to use both the J and K data busses during IIR operation 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Question: How do you (in C++) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Question: How do you (in assembly code) place IIR coefficients in one memory block and state values into another? 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

C++ manual talks about 2 data spaces (dm and pm) for STATIC or GLOBAL variables 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

BAD You can use the VDSP C++ extension pm to specify a different memory space. HOWEVER, there is no such thing as a pm stack so all variable must be declared “static” or “global” dm arrays can be placed on the stack but there may be alignment issues 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

The assembler manual says something similar but different

VDSP C++ extensions dm and pm parameters are still being passed into functions via J5 and J6 as before. Notice the very big difference in the “absolute addresses” indicating that the data blocks are in very different memory spaces. Also data memory address is widely different from instruction memory space. Do instruction and 2 data fetches at same time 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

IIR function using TigerSHARC C++ DSP extensions dm and pm

Using dm and pm shows up a little more parallel than only using dm

From TigerSHARC TS201 programming reference manual

Memory block operation will need to be explored in more detail later

Understanding the TigerSHARC Parallel Operations
TigerSHARC has many pipelines Review of the COMPUTE pipeline works Interaction of memory (data) operations with COMPUTE operations Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation) Optimized assembly code and optimized C++ 4/30/2019 Speed IIR -- stage M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline

Similar presentations

Presentation on theme: "Understanding the TigerSHARC ALU pipeline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding the TigerSHARC ALU pipeline

Similar presentations

Presentation on theme: "Understanding the TigerSHARC ALU pipeline"— Presentation transcript:

Similar presentations

About project

Feedback