Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 1 Getting code to work
Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer May be different answer for floating point and integer operations 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Register File and COMPUTE Units 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Simple Example IIR -- Biquad For (Stages = 0 to 3) Do S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 Not a great bit of IIR code as It can’t be used in a loop on an array of values as is really necessary 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Set up the tests. Want to make sure correct answer as code changes #include <EmbeddedUnit/EmbeddedUnit.h> #include <EmbeddedUnit/CommonTests.h> #include <EmbeddedUnit/EmbeddedTests.h> 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Step 1 – Stub plus return value Build an assembly language stub for float iirASM(void); Make it return a floating point value of 40.5 to show that we can return a value of 40.5 J8 is an INTEGER so how can we return 40.5? ANSWER – WE DON’T We return the “bit pattern” for 40.5, which is the same as an “INTEGER” bit pattern 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Code does not work when passing back floats with J8 register We are passing back 40.5 in normal return register, but that is obviously NOT what the C++ compiler was expecting Wrong code convention 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Code does work when using XR8 register – NOTE NOT XFR8 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Step 2 – Using C++ code as comments -- set up the coefficients XFR0 = 0.0;; DOES NOT EXIST as a float instruction XR0 = 0.0;; DOES EXIST Bit-patterns require integer X registers Leave what you wanted to do behind as comments 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada “ARCHITECTURAL ISSUES “– DON’T NEED SPECIAL FLOAT = CONSTANT INSTRUCTIONS Initialize X registers to float values via “integer” operations XR = Then use XFR “float” operations What I want to do is left behind as comments for the stranger reading my code next week (ME) 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Modify C++ code so that it can be translated into assembly code Can only have 1 instruction per line Code must execute sequentially so remember the ;; 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Start with S0 = Xin instruction Can’t use XFR8 = XFR6 to copy a register 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Since XFR8 = XFR6 is not allowed Try XR8 = R6; SIMD Single instruction Multiple Data SISD Single instruction SingleData R6 means move XR6 and YR6 (Multiple data move described in 1 instruction) Try XR8 = XR6 (integer – bit-pattern – move) New TigerSHARC architecture issues SIMD versus SISD 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Some operations are FLOAT operations and must have XFR on left side of equation BUT only R on the right Some operations are SISD operations and must have XR on both side of the equation (or just R on both sides of the equation making them SIMD X and Y with garbage happening on Y) Personally, I think all these problems are “assembler” issues and could be made consistent 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada What we have learnt TigerSHARC has both SISD (single data) and SIMD (multiple data) ability XFR4 = R4 * R5; The answer (left) is single data – so the SISD choice is taken on right – read XR4 and XR5 (bit patterns), treat as floats when do multiplication (F on left), and store (bit pattern of answer) in XR4 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada What we have learnt TigerSHARC has both SISD (single data) and SIMD (multiple data) ability SISD XR4 = XR5;; Move X part of R5 register into X part of R4 register XR4 = YR5;; Move Y part of R5 register into X part of R4 register SIMD XYR4 = R5;; Move X part of R5 register into X part of R4 register and Y part of R5 register into Y part of R4 register R4 = R5;; Short hand version of XYR4 = R5 to confuse you Does YXR4 = R5 also exist? Move X part of R5 register into Y part of R4 register and X part of R5 register into Y part of R4 register 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Disconnect from target and go to simulator 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Activate Simulator 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Rebuild the project and set breakpoints at start and end of ASM code 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Activate the pipeline viewer 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Adjust the pipeline window so can see all the instruction pipeline stages Have just located an arrow icon which causes the pipeline window to fill the screen all the way across 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi-indepently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
PIPELINE STAGES See page 8-34 of Processor manual Instruction fetch -- F1, F2, F3 and F4 Fetch Unit Pipe Memory driven not instruction driven 128 bits fetched – may make up 1, 2, 3, or 4 instruction lines (or parts of a couple of instruction lines Instruction fetched into IAB, instruction alignment buffer 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
PIPELINE STAGES See page 8-34 of Processor manual Integer ALU pipe – PD, D, I and A PreDecode – the next COMPLETE instruction line (1, 2, 3 or 4 ) fetched from IAB Decode – different instructions dispatched to different execution units (J-IALU, K-IALU, Compute Blocks) Data memory access start in Integer stage A stands for Access stage Results are not available EX2 stage, but (by register forwarding) can be sometimes accessed earlier 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
PIPELINE STAGES See page 8-34 of Processor manual Compute Block EX1 and EX2 Result is always written to the target register on the rising edge of CCLK after stage EX2 Following multiple use of register (read and store) in one line guaranteed to pipeline correctly R2 = R0 + R1; R6 = R2 * R3;; R2 at end of instruction R2 value at beginning of instruction used 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Only interested in later stages of the pipeline. Adjust properties 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Run the code till first ASM break point: Note down cycle Number 39830 Then run again till reach second ASM breakpoint Calculate execution time Instruction in pipeline for a long time before simulator stops 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Pipeline during code execution 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Pipeline viewer says 26 cycles but what do we expect to get from our code? 1 2 3 4 5 6 7 8 8 cycles in this part of the code as expect 1 instruction per clock cycle 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Pipeline viewer says 26 cycles but what do we expect -- 21 20% error in timing Too much Where are the extra cycles coming from? How easy is it to code in such a way that the extra cycles can be removed? ANSWER Fairly straight forward to fix in principle, can be difficult in practice 1 2 3 4 5 6 7 8 9 10 11 12 13 Again 1 instruction / cycle expected 13 cycles expected + 8 from before = 21 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada
Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer May be different answer for floating point and integer operations 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada