General Optimization Issues

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

PIPELINE AND VECTOR PROCESSING

Processor Architecture Needed to handle FFT algoarithm M. Smith.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.

Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Blackfin Array Handling Part 2 Moving an array between locations int * MoveASM( int foo[ ], int fee[ ], int N);

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

EEC4133 Computer Organization & Architecture Chapter 6: Languages and the Machine by Muhazam Mustapha, May 2014.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

Introduction to MMX, XMM, SSE and SSE2 Technology

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Chapter Overview General Concepts IA-32 Processor Architecture

Advanced Architectures

Machine dependent Assembler Features

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

CDA 3101 Summer 2007 Introduction to Computer Organization

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Software and Hardware Circular Buffer Operations

TigerSHARC processor General Overview.

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

Overview of SHARC processor ADSP and ADSP-21065L

DMA example Video image manipulation

Trying to avoid pipeline delays

Generating a software loop with memory accesses

Understanding the TigerSHARC ALU pipeline

What are the characteristics of DSP algorithms?

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Chapter 2: Data Manipulation

Advanced Computer Architecture

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Understanding the TigerSHARC ALU pipeline

Overview of TigerSHARC processor ADSP-TS101 Compute Operations

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Chapter 2: Data Manipulation

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

DMA example Video image manipulation

Overview of SHARC processor ADSP-2106X Memory Operations

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Lecture 5: Pipeline Wrap-up, Static ILP

Working with the Compute Block

Chapter 2: Data Manipulation

A first attempt at learning about optimizing the TigerSHARC code

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

General Optimization Issues M. Smith

To be tackled today Most optimized TigerSHARC instruction Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Floating point (32-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Integer (short) (16-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; R7:6 = R5:4 * R3:2; xySR9:8 = R7:6+R1:0,SR11:10 = R7:6-R1:0;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyR7:6 = R5:4 * R3:2; /* Eight multiplications XR5.H * XR3.H, and XR5.L * XR3.L, XR4.H * XR2.H, XR4.L * XR3.L ditto YR */ xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise Write out the 16 operations performed xySR9:8 = R7:6 + R1:0, R11:10 = R7:6 + R1:0;; /* Eight additions ??????? AND Eight subtractions ????????????????? */ Now do a sideways add on xySR9:8 and get a value 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Steps to optimize Get the algorithm to work in “C” Determine how much time is available If Timing already okay – quit Determine maximum number of each type of operation (add, subtract, multiple, memory fetches) Divide the calculated maximum by the number of available resources for that type of operation The largest division result is the – in theory – number of cycles needed for the algorithm If that minimum time is more than 100% of the time available – find a new algorithm If that minimum time is less than 40% of the time available – perhaps you can optimize the code to meet the speed requirements 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 2 additions AND 2 memory fetches each cycle Therefore optimum is SIZE cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

16-bit integers (short int) might be okay in some circumstances 2 * SIZE additions 2 * SIZE Memory fetches If done correctly Can do 8 short additions AND 32 short memory fetches each cycle Therefore optimum is SIZE / 4 cycles IFF can find all optimizations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

FIR optimization SIZE additions SIZE multiplications SIZE * 2 memory fetches 2 additions, 2 multiplications and 8 fetches per cycles Should be able to do it in SIZE / 2 cycles 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

FIR optimization SIZE additions SIZE multiplications SIZE * 2 memory fetches Fetch 2 values along J-bus into XA and YA compute Fetch 2 coefficients along K-bus into XB and YB compute 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

STAGE 1 Get the C++ code to work 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Stage 2 – Rewrite in simplest format Note naming convention Single operation per line Note other changes 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 3 -- Unwrap the loop Again Note naming convention 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 4 Overlap the first and second parts of loops Note The “C++” code goes no faster, but using this format for translating into parallel assembly code will Step 1 -- 4 * N Step 3 – 8 * (N / 2) + 2 Step 4 – 6 * (N / 2) + 2 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Need a systematic approach to handling the optimization of code Get the C++ code to work Rewrite code in simplest format – one operation per line Recommend – rewrite code using register names Unwrap the loop – start with “twice” Rewrite the second part of the loop using different register names – avoids setting up unexpected dependencies Overlap the first and second parts of loops Rearrange “start-up” and ending code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5A - Rearrange “start-up” and ending code “Software” Pipeline Move first read outside Need to add “extra read” at the end of the loop Timing 2 + (N/2 – 1) * 6 Need to adjust loop start (Is it done correctly? Are we “one-out”) CAUTION – NEED TO FIX 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5B - Rearrange “start-up” and ending code Can now parallel additional adds and memory fetches Note loop still in error 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise -- Get the loop control correct 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 1 -- Get the loop control correct BUFFER_SIZE = 1 BUFFER_SIZE = 2 BUFFER_SIZE = 4 BUFFER_SIZE = 5 BUFFER_SIZE = 8 BUFFER_SIZE = 128 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 2 -- Rewrite the code when it is known that BUFFER_SIZE = 127 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code to this point is SISD parallel optimization SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 3 -- BUFFER_SIZE = 128 Rewrite so that X and Y ops done together 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 4 -- BUFFER_SIZE = 128 Rewrite so that expect no data dependency stalls 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

To be tackled today Most optimized TigerSHARC instruction Integer and float Systematic optimization procedure SISD and SIMD modes Exercises 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada