Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm
DC_Removal algorithm performance 2 / 28 To be tackled today Expected and actual cycle count for Compute Block version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Understanding some operations “first time into function” – cache issues?
DC_Removal algorithm performance 3 / 28 Set up time In principle 1 cycle / instruction instructions
DC_Removal algorithm performance 4 / 28 First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log 2 N) 4 instructions N * 5 instructions * log 2 N
DC_Removal algorithm performance 5 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2
DC_Removal algorithm performance 6 / 28 TigerSHARC pipeline
DC_Removal algorithm performance 7 / 28 Time in theory Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return N * * log 2 N * N N + 2 log 2 N N = 128 – instructions = cycles delay cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”
DC_Removal algorithm performance 8 / 28 Set up time Expected instructions Actual instructions + 2 stalls Why not 4 stalls?
DC_Removal algorithm performance 9 / 28 First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual stalls
DC_Removal algorithm performance 10 / 28 Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual stalls
DC_Removal algorithm performance 11 / 28 Shift Loop – 1 st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts
DC_Removal algorithm performance 12 / 28 Time in theory / practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls? 2 0 stalls 4 2 stalls 4 + N * 5 N * 8 = 1024 stalls * log 2 N 9 stalls 6 3 stalls * N 3 stalls 2 -- Exit from subroutine 10 stalls? N + 2 log 2 N 1061 stalls N = 128 – instructions = cycles stalls = 2505 cycles In practice 2507 cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”
DC_Removal algorithm performance 13 / 28 Final sum code – Using XR registers
DC_Removal algorithm performance 14 / 28 Time in Practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls 2 0 stalls 4 2 stalls 4 + N * 5 Was 1024 stalls 1 Was * log 2 N + 9 stalls 6 3 stalls * N 3 stalls 2 10 stalls N Was N + 2 log 2 N N = 128 – instructions = delay cycles = 1709 cycles Was 2,504 cycles with JALU 1444 cycles delay cycles Predicted stall with X-compute block = 249 stalls -- close enough to 256 = N * 2 – or one stall for each memory access Improved more than expected as accidentally making better use of available resources
DC_Removal algorithm performance 15 / 28 Second time into function First time around the loop 2 stalls per loop iteration as predicted
DC_Removal algorithm performance 16 / 28 2 nd time into function 9 th time around the loop Stalls as expected Note sets of 5 quad instructions appear to be fetch in
DC_Removal algorithm performance 17 / 28 Interpretation Currently XR2 = [J0 + J8];; XR6 = R6 + R2;; // Must wait 1 cycle for XR2 to be brought in XR3 = [J1 + J8];; XR7 = R7 + R3;; // Must wait 1 cycle for XR3? Next improvement? XR2 = [J0 + J8];; XR3 = [J1 + J8];; XR6 = R6 + R2;; // XR2 and XR3 are now ready when we want to use // them? XR7 = R7 + R3;; // or do we get DATA / DATA clash along J-bus?
DC_Removal algorithm performance 18 / 28 Pipeline “intermingled” left and right filter operation
DC_Removal algorithm performance 19 / 28 Time in Practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls 2 0 stalls 4 2 stalls 4 + N * 5 Was 1024 stalls 1 Was * log 2 N + 9 stalls 6 3 stalls * N 3 stalls 2 10 stalls N Was N + 2 log 2 N N = 128 – instructions = delay cycles = 1709 cycles Was 2,504 cycles with JALU 1444 cycles delay cycles Predicted stall with X-compute block = 249 stalls -- close enough to 256 = N * 2 – or one stall for each memory access Intermingled code – around 1430 cycles + 30 stalls
DC_Removal algorithm performance 20 / 28 1 st time into function 1 st time round the loop
DC_Removal algorithm performance 21 / 28 1 st time into function 2 nd, 3 rd, … time round loop
DC_Removal algorithm performance 22 / 28 9 th, 17 th etc time into the loop
DC_Removal algorithm performance 23 / 28 From TigerSHARC p9-11 Reading in 8-words at a time from “memory” into “cache” MIGHT explain the behaviour
DC_Removal algorithm performance 24 / 28 Again, talking about “8” data values
DC_Removal algorithm performance 25 / 28 Read buffer
DC_Removal algorithm performance 26 / 28 Implications – read buffer Prefetch buffer 4 pages Each page bit words = 64 items Buffer = 256 – exactly enough to handle 128 left and 128 right Does that imply that speed does not scale up – 256 point arrays are slower than 2 x as slow as 128 points May make sense to process all of left and then all of right?
DC_Removal algorithm performance 27 / 28 Implications – cache 4 way associative cache 128 cache sets Each cache set has four cache ways Each cache way – 8 32 bit words That’s bit words Things break down when left / right arrays are of size 512, or else do all left then all right – things change at 1024
DC_Removal algorithm performance 28 / 28 To be tackled today Expected and actual cycle count for Compute Block version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Understanding some operations “first time into function” – cache issues?