Download presentation
Presentation is loading. Please wait.
Published byLesley Watts Modified over 9 years ago
1
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline
2
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 2 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficiency Excel and Project (Gantt charts) are useful tool
3
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 3 / 32 11/2/2015 Register File and COMPUTE Units
4
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 4 / 32 11/2/2015 Simple Example IIR -- Biquad For (Stages = 0 to 3) Do S0 = X in * H5 + S2 * H3 + S1 * H4 Y out = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 S0 S1 S2 Horrible IIR code example as can’t re-use in a loop Works as a simple example for understanding TigerSHARC pipeline
5
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 5 / 32 11/2/2015 Code return float when using XR8 register – NOTE NOT XFR8
6
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 6 / 32 11/2/2015 Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patterns require integer registers Leave what you wanted to do behind as comments
7
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 7 / 32 11/2/2015 Expect to take 8 cycles to execute
8
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 8 / 32 11/2/2015 PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi- independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2
9
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 9 / 32 11/2/2015 Pipeline Viewer Result XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time
10
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 10 / 32 11/2/2015 Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps
11
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 11 / 32 11/2/2015 Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cycles XFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cycles XFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL
12
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 12 / 32 11/2/2015 Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage)
13
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 13 / 32 11/2/2015 Information on Window Event Icons
14
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 14 / 32 11/2/2015 Result of Analysis Can’t use Float result immediately after calculation Writing XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed Is the same as coding XFR23 = R8 * R4;; NOP;; Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;; Proof – write the code with the stalls shown in it Writing this way means we don’t have to use the pipeline viewer all the time Pipeline viewer is only available with (slow) simulator #define SHOW_ALU_STALL nop
15
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 15 / 32 11/2/2015 Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct
16
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 16 / 32 11/2/2015 Analysis approach IS correct Same speed with and without nops
17
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 17 / 32 11/2/2015 Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies
18
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 18 / 32 11/2/2015 Copy and paste to make IIRASM_Optimized( )
19
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 19 / 32 11/2/2015 Need to re-order instructions to fill delay slots with useful instructions After refactoring code to fill delay slots, must run tests to ensure that still have the correct result Change – and “retest” NOT EASY TO DO MUST HAVE A SYSTEMATIC PLAN TO HANDLE OPTIMIZATION I USE EXCEL
20
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 20 / 32 11/2/2015 Show resource usage and data dependencies All temporary register usage involves the SAME XFR23 register This typically stalls out the processor
21
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 21 / 32 11/2/2015 Change all temporary registers to use different register names Then check code produces correct answer All temporary register usage involves a DIFFERENT Register ALWAYS FOLLOW THIS PROCESS WHEN OPTIMIZING
22
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 22 / 32 11/2/2015 Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works
23
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 23 / 32 11/2/2015 Check that code still operates 1 cycle saved Have put “our” marker stall instruction in parallel with moved instruction using ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runs after this optimization stage
24
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 24 / 32 11/2/2015 Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before
25
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 25 / 32 11/2/2015 Move up the R10 and R9 assignment operations -- check 4 cycle improvement?
26
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 26 / 32 11/2/2015 CHECK THE PIPELINE AFTER TESTING
27
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 27 / 32 11/2/2015 Are there still more improvements possible (I can see 4 more moves)
28
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 28 / 32 11/2/2015 Problems with approach Identifying all the data dependencies Keep track of how the data dependencies change as you move the code around Handling all of this “automatically” I started the following design tool as something that might work, but it actually turned out very useful. M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005.
29
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 29 / 32 11/2/2015 Using Microsoft Project – Step 1
30
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 30 / 32 11/2/2015 Add dependencies and resource usage – then activate level
31
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 31 / 32 11/2/2015 Microsoft Project as a microprocessor design tool Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays
32
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 32 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficiency Excel and Project (Gantt charts) are useful tool
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.