General Optimization Issues M. Smith
To be tackled today Most optimized TigerSHARC instruction Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Most optimized SIMD Floating point (32-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Most optimized SIMD Integer (short) (16-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; R7:6 = R5:4 * R3:2; xySR9:8 = R7:6+R1:0,SR11:10 = R7:6-R1:0;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyR7:6 = R5:4 * R3:2; /* Eight multiplications XR5.H * XR3.H, and XR5.L * XR3.L, XR4.H * XR2.H, XR4.L * XR3.L ditto YR */ xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise Write out the 16 operations performed xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ Now do a sideways add on xySR9:8 and get a value 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Steps to optimize Get the algorithm to work in “C” Determine how much time is available If Timing already okay – quit Determine maximum number of each type of operation (add, subtract, multiple, memory fetches) Divide the calculated maximum by the number of available resources for that type of operation The largest division result is the – in theory – number of cycles needed for the algorithm If that minimum time is more than 100% of the time available – find a new algorithm If that minimum time is less than 40% of the time available – perhaps you can optimize the code to meet the speed requirements 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 2 additions AND 2 memory fetches each cycle Therefore optimum is SIZE cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
16-bit integers (short int) might be okay in some circumstances 2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 8 short additions AND 32 short memory fetches each cycle Therefore optimum is SIZE / 4 cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
FIR optimization SIZE additions SIZE multiplications SIZE * 2 memory fetches 2 additions, 2 multiplications and 8 fetches per cycles Should be able to do it in SIZE / 2 cycles 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
FIR optimization SIZE additions SIZE multiplications SIZE * 2 memory fetches Fetch 2 values along J-bus into XA and YA compute Fetch 2 coefficients along K-bus into XB and YB compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
STAGE 1 Get the C++ code to work 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Stage 2 – Rewrite in simplest format Note naming convention Single operation per line Note other changes 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Step 3 -- Unwrap the loop Again Note naming convention 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Step 4 Overlap the first and second parts of loops Note The “C++” code goes no faster, but using this format for translating into parallel assembly code will Step 1 -- 4 * N Step 3 – 8 * (N / 2) + 2 Step 4 – 6 * (N / 2) + 2 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Step 5A - Rearrange “start-up” and ending code “Software” Pipeline Move first read outside Need to add “extra read” at the end of the loop Timing 2 + (N/2 – 1) * 6 Need to adjust loop start (Is it done correctly? Are we “one-out”) CAUTION – NEED TO FIX 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Step 5B - Rearrange “start-up” and ending code Can now parallel additional adds and memory fetches Note loop still in error 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise -- Get the loop control correct 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise 1 -- Get the loop control correct BUFFER_SIZE = 1 BUFFER_SIZE = 2 BUFFER_SIZE = 4 BUFFER_SIZE = 5 BUFFER_SIZE = 8 BUFFER_SIZE = 128 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise 2 -- Rewrite the code when it is known that BUFFER_SIZE = 127 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Code to this point is SISD parallel optimization SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise 3 -- BUFFER_SIZE = 128 Rewrite so that X and Y ops done together 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
Exercise 4 -- BUFFER_SIZE = 128 Rewrite so that expect no data dependency stalls 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada
To be tackled today Most optimized TigerSHARC instruction Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada