General Optimization Issues

General Optimization Issues
M. Smith

To be tackled today Most optimized TigerSHARC instruction
Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Floating point (32-bit)TigerSHARC instruction
xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Integer (short) (16-bit)TigerSHARC instruction
xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; R7:6 = R5:4 * R3:2; xySR9:8 = R7:6+R1:0,SR11:10 = R7:6-R1:0;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR Increments J register and adjusts for circular buffer operation */ xyR7:6 = R5:4 * R3:2; /* Eight multiplications XR5.H * XR3.H, and XR5.L * XR3.L, XR4.H * XR2.H, XR4.L * XR3.L ditto YR */ xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise Write out the 16 operations performed
xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ Now do a sideways add on xySR9:8 and get a value 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Steps to optimize Get the algorithm to work in “C”
Determine how much time is available If Timing already okay – quit Determine maximum number of each type of operation (add, subtract, multiple, memory fetches) Divide the calculated maximum by the number of available resources for that type of operation The largest division result is the – in theory – number of cycles needed for the algorithm If that minimum time is more than 100% of the time available – find a new algorithm If that minimum time is less than 40% of the time available – perhaps you can optimize the code to meet the speed requirements 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats
2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 2 additions AND 2 memory fetches each cycle Therefore optimum is SIZE cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats
2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

16-bit integers (short int) might be okay in some circumstances
2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 8 short additions AND 32 short memory fetches each cycle Therefore optimum is SIZE / 4 cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

FIR optimization SIZE additions SIZE multiplications
SIZE * 2 memory fetches 2 additions, 2 multiplications and 8 fetches per cycles Should be able to do it in SIZE / 2 cycles 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

FIR optimization SIZE additions SIZE multiplications
SIZE * 2 memory fetches Fetch 2 values along J-bus into XA and YA compute Fetch 2 coefficients along K-bus into XB and YB compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code
Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

STAGE 1 Get the C++ code to work
11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Stage 2 – Rewrite in simplest format
Note naming convention Single operation per line Note other changes 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 3 -- Unwrap the loop Again Note naming convention

Step 4 Overlap the first and second parts of loops
Note The “C++” code goes no faster, but using this format for translating into parallel assembly code will Step * N Step 3 – 8 * (N / 2) + 2 Step 4 – 6 * (N / 2) + 2 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5A - Rearrange “start-up” and ending code
“Software” Pipeline Move first read outside Need to add “extra read” at the end of the loop Timing 2 + (N/2 – 1) * 6 Need to adjust loop start (Is it done correctly? Are we “one-out”) CAUTION – NEED TO FIX 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5B - Rearrange “start-up” and ending code
Can now parallel additional adds and memory fetches Note loop still in error 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise -- Get the loop control correct

Exercise 1 -- Get the loop control correct
BUFFER_SIZE = 1 BUFFER_SIZE = 2 BUFFER_SIZE = 4 BUFFER_SIZE = 5 BUFFER_SIZE = 8 BUFFER_SIZE = 128 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 2 -- Rewrite the code when it is known that BUFFER_SIZE = 127

Code to this point is SISD parallel optimization
SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 3 -- BUFFER_SIZE = 128 Rewrite so that X and Y ops done together

Exercise 4 -- BUFFER_SIZE = 128 Rewrite so that expect no data dependency stalls

To be tackled today Most optimized TigerSHARC instruction
Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

General Optimization Issues

Similar presentations

Presentation on theme: "General Optimization Issues"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

General Optimization Issues

Similar presentations

Presentation on theme: "General Optimization Issues"— Presentation transcript:

Similar presentations

About project

Feedback