General Optimization Issues

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

PIPELINE AND VECTOR PROCESSING
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Blackfin Array Handling Part 2 Moving an array between locations int * MoveASM( int foo[ ], int fee[ ], int N);
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Advanced Architectures
Machine dependent Assembler Features
Instruction Level Parallelism
Parallel Processing - introduction
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
CS203 – Advanced Computer Architecture
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Instruction Level Parallelism and Superscalar Processors
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
Overview of SHARC processor ADSP and ADSP-21065L
DMA example Video image manipulation
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
What are the characteristics of DSP algorithms?
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Chapter 2: Data Manipulation
Advanced Computer Architecture
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
-- Tutorial A tool to assist in developing parallel ADSP2106X code
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Computer Architecture
Getting serious about “going fast” on the TigerSHARC
Explaining issues with DCremoval( )
General Optimization Issues
Chapter 2: Data Manipulation
DMA example Video image manipulation
Chapter 12 Pipelining and RISC
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Lecture 5: Pipeline Wrap-up, Static ILP
Working with the Compute Block
COMPUTER ORGANIZATION AND ARCHITECTURE
A first attempt at learning about optimizing the TigerSHARC code
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

General Optimization Issues Solving the exercise issues

To be tackled today Exercise 1 Exercise 2 Exercise 3 Exercise 4 Solving the loop problem SIZE = 128 Exercise 2 Solving the loop problem SIZE = 127 Exercise 3 Moving from SISD to SIMD mode, SIZE = 128 Exercise 4 Removing any expected stalls 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Most optimized SIMD Floating point (32-bit)TigerSHARC instruction xR3:0 = CB Q[j0 += 4]; yR3:0 = CB Q[k0 += 4]; xyFR4 = R5 * R6; xyFR7 = R8 + R9, FR10 = R8 - R9;; xR3:0 = CB Q[j0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ yR3:0 = CB Q[k0 += 4]; /* Fetches 4 values on J BUS into x compute registers XR3, XR2, XR1, XR0 Increments J register and adjusts for circular buffer operation */ xyFR4 = R5 * R6; /* Two multiplications XFR5 * XFR6 and YFR5 * YFR6 */ xyFR7 = R8 + R9, FR10 = R8 - R9;; /* Two additions XFR8 + XFR9 and YFR8 + YFR9 AND Two subtractions XFR8 - XFR9 and YFR8 - YFR9 */ /* Same register must be used either side of + and – operators */ 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Steps to optimize Get the algorithm to work in “C” Determine how much time is available If Timing already okay – quit Determine maximum number of each type of operation (add, subtract, multiple, memory fetches) Divide the calculated maximum by the number of available resources for that type of operation The largest division result is the – in theory – number of cycles needed for the algorithm If that minimum time is more than 100% of the time available – find a new algorithm If that minimum time is less than 40% of the time available – perhaps you can optimize the code to meet the speed requirements 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code optimization – 32 bit integers or 32-bit floats 2 * SIZE additions 2 * SIZE Memory fetches Left fetched on J-bus And done in X-compute Right fetched on K-bus And done in Y-compute SIZE / 2 cycles in theory 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

STAGE 1 Get the C++ code to work 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Stage 2 – Rewrite in simplest format Note naming convention Single operation per line Note other changes 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 3 -- Unwrap the loop Again Note naming convention 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 4 Overlap the first and second parts of loops Note The “C++” code goes no faster, but using this format for translating into parallel assembly code will Step 1 -- 4 * N Step 3 – 8 * (N / 2) + 2 Step 4 – 6 * (N / 2) + 2 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5A - Rearrange “start-up” and ending code “Software” Pipeline Move first read outside Need to add “extra read” at the end of the loop Timing 2 + (N/2 – 1) * 6 Need to adjust loop start (Is it done correctly? Are we “one-out”) CAUTION – NEED TO FIX 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Step 5B - Rearrange “start-up” and ending code Can now parallel additional adds and memory fetches Note loop still in error 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 1 -- Get the loop control correct BUFFER_SIZE = 1 BUFFER_SIZE = 2 BUFFER_SIZE = 4 BUFFER_SIZE = 5 BUFFER_SIZE = 8 BUFFER_SIZE = 128 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Unrecognized second key error What is it? How do you fix it? 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 2 -- Rewrite the code when it is known that BUFFER_SIZE = 129 But loop only handles 128 Since 129 / 2 = 128 / 2 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code to this point is SISD parallel optimization SISD – single instruction single data Using X_compute block and J memory bus Next stage – SIMD – single instruction multiple data Using X_compute block and J memory bus for left Using Y_compute block and K memory bus for right Will need similar but different code when you are doing FIR in Lab. 3 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 3 -- BUFFER_SIZE = 128 Rewrite so that X and Y ops done together 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise 4 -- BUFFER_SIZE = 128 Rewrite so that expect no data dependency stalls Leave this one for a while until we have handled multiple memory accesses as answer may changes 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Tackled today Exercise 1 Exercise 2 Exercise 3 Exercise 4 Solving the loop problem SIZE = 128 Exercise 2 Solving the loop problem SIZE = 127 Exercise 3 Moving from SISD to SIMD mode, SIZE = 128 Incomplete Exercise 4 Removing any expected stalls – left for later 2/23/2019 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada