Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm
DC_Removal algorithm performance 2 / 28 To be tackled today Expected and actual cycle count for J- IALU version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Differences between first time into a function (cache empty) and second time into the function
DC_Removal algorithm performance 3 / 28 Set up time In principle 1 cycle / instruction instructions
DC_Removal algorithm performance 4 / 28 First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log 2 N) 4 instructions N * 5 instructions * log 2 N
DC_Removal algorithm performance 5 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2
DC_Removal algorithm performance 6 / 28 TigerSHARC pipeline
DC_Removal algorithm performance 7 / 28 Using the “Pipeline Viewer” Available with the TigerSHARC simulator ONLY VIEW | Debug Windows | Pipeline viewer F1 to F4 – instruction fetch unit pipeline PD, D, I -- Integer ALU pipeline A, EX1, EX2 – Compute Block pipeline
DC_Removal algorithm performance 8 / 28 Pipeline symbols Control - click A – Abort B – Bubble H – BTB Hit (Jumps) S – Stall W – Wait X – Illegal fetch(F1 – F4) X – Illegal instruction (PD – E2)
DC_Removal algorithm performance 9 / 28 Time in theory Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return N * * log 2 N * N N + 2 log 2 N N = 128 – instructions = cycles delay cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”
DC_Removal algorithm performance 10 / 28 Test environment Examine the pipeline the 2 nd time around the loop “Cache’s filled”?
DC_Removal algorithm performance 11 / 28 Set up time Expected instructions Actual instructions + 2 stalls Why not 4 stalls?
DC_Removal algorithm performance 12 / 28 First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual stalls
DC_Removal algorithm performance 13 / 28 Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual stalls
DC_Removal algorithm performance 14 / 28 Shift Loop – 1 st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts
DC_Removal algorithm performance 15 / 28 Shift loop 2 nd and later times around Expect 2 Get 2
DC_Removal algorithm performance 16 / 28 Store back of &left, &right Expect 6 Actual stalls
DC_Removal algorithm performance 17 / 28 Exercise 1 Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio
DC_Removal algorithm performance 18 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2
DC_Removal algorithm performance 19 / 28 Answer
DC_Removal algorithm performance 20 / 28
DC_Removal algorithm performance 21 / 28
DC_Removal algorithm performance 22 / 28
DC_Removal algorithm performance 23 / 28 Second time into function
DC_Removal algorithm performance 24 / 28 What happens if cache not full? – first time function called? Was stalls in loop Now stalls in loop
DC_Removal algorithm performance 25 / 28 First time function called 2 nd time around the loop Ditto 3, 4, 5, 6, 7, 8 times
DC_Removal algorithm performance 26 / 28 9 th time around the loop ditto 17 th, 25 th, 33 rd, 41 st, 49 th
DC_Removal algorithm performance 27 / 28 What is happening? With cache filled – memory read accesses require 4 cycles Unfilled – first one requires “12 cycles” Then next 7 require 4 cycles Total guess – is extra time associated with doing extra reads to fill the cache?
DC_Removal algorithm performance 28 / 28 Tackled today Expected and actual cycle count for J-IALU version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Differences between first time into a function (cache empty) and second time into the function Further unknowns – how memory operations really work