Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
6/2/20151 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta,
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.
Computer Architecture Chapter (14): Processor Structure and Function
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Software and Hardware Circular Buffer Operations
TigerSHARC processor General Overview.
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
ENCM K Interrupts Theory and Practice
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
ENCM515 Standard and Custom FIR filters for Lab. 4
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
M. R. Smith, University of Calgary, Canada ucalgary.ca
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
-- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
* 2000/08/1307/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
General Optimization Issues
Tutorial on Post Lab. 1 Quiz Practice for parallel operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
ENCM515 Standard and Custom FIR filters
Presentation transcript:

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 2 / two days To be tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

ADSP-2106x -- Parallelism opportunities Ability for parallel memory operation, One each on pm, dm and instruction cache busses Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr Zero overhead loops Instruction pipeline issues Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 4 / two days Compiler is only -- somewhat useful See article in course notes from Embedded System Design Sept./October 2000 Need to get a systematic process to provide Parallelism without pain Need to know what to worry about and what not to Lab 3 -- Implement FIR filter in Parallel -- Help provided Lab. Library version of FFT, custom version of Burg Algorithm (AR modeling)

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 5 / two days Basic code development -- any system Write the “C” code for the function void Convert(float *temperature, int N) which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market) Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 6 / two days Parallel Instruction Code Development Write the 21k assembly code for the function void Convert(float *temperature, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7 / two days Standard “C” code void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / ; temperature++ } Standard Warning -- What does optimizing compiler do with 9 / 5 becomes 1 or 1.8?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 8 / two days Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 9 / two days style load/store “C” code void Convert(register float *temperature, register int N) { register int count; register float *pt = temperature; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = scratch * (9 / 5); scratch = scratch + 32; *pt = scratch; pt++; }

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 10 / two days Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 11 / two days Assembly code PROLOGUE Appropriate defines to make easy reading of code Saving of non-volatile registers BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are valid EPILOGUE Recover non-volatile registers

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 12 / two days Straight conversion -- PROLOGUE // void Convert( reg float *temperature, reg int N ) {.segment/pm seg_pmco;.global _Convert; _Convert: // register int count = GARBAGE; #define countR1 scratchR1 //register float *pt = temperature; #define pt scratchDMpt pt = INPAR1; //float scratch = GARBAGE; #define scratchF2 F2 // For the CURRENT code -- no volatile // registers are needed -- may not remain true

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 13 / two days Straight conversion of code //for (count = 0; count < N; count++) { LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: //scratch = *pt; scratchF2 = dm(0, pt);// Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR1 (R4) is dead -- can reuse as F4 #define constantF4 F4// Must be float constantF4 = 1.8 // No division, Use register constant scratchF2 = scratchF2 * constantF4; // scratch = scratch + 32; #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; // *pt = scratch; pt++; dm(pt, 1) = scratchF2; LOOP_END: 5 magic lines of code // NOT F0 = 32 gives F0 = 1 *

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 14 / two days Avoid this error LCNTR = INPAR2, DO LOOP_END UNTIL LCE: scratchF2 = dm(0, pt); scratchF2 = scratchF2 * constantF4; F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; LOOP_END:dm(pt, 1) = scratchF2; INTENDED LAST LINE OF LOOP LCNTR = INPAR2, DO LOOP_END UNTIL LCE: scratchF2 = dm(0, pt); scratchF2 = scratchF2 * constantF4; F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, 1) = scratchF2; LOOP_END:Rest of the code STILL LAST LINE OF LOOP First line of “rest of code” has now become part of loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 15 / two days Process to avoid the error This particularly error is going to be very easy to make as the “Rest of the code” is going to look very similar to the “loop internals” once we have taken account of the ALU/FPU pipeline to maximize parallelism SUGGESTED APPROACH TO AVOID THIS TIME WASTING ERROR LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(0, pt); scratchF2 = scratchF2 * constantF4; F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, 1) = scratchF2; LOOP_END:Rest of the code This was a process adopted from the compiler output -- the concept of a label was beyond most people in ENCM415

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 16 / two days Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 17 / two days Speed rules for memory access scratch = dm(0, pt); scratch = dm(pt, 0);// Not ++ as to be re-used dm(pt, 1) = scratch; Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant Must use Modify registers to store these constants. Several useful constants placed in modify registers (DAG1 and DAG2) during “C-code” initialization (if linked in) scratch = dm(pt, zeroDM);// Not ++ as to be re-used dm(pt, plus1DM) = scratch; Can’t use PREMODIFY PERIOD Can’t use POST MODIFY OPERATIONS with CONSTANTS

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 18 / two days Speed rules IF you want adds and multiplys to occur on the same line F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description (bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 19 / two days When should we worry about the register assignment? #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE: scratchF2 = dm(pt, 0);// Not ++ as to be re-used // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, 1) = F0_32; LOOP_END:

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 20 / two days Check on required register use #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 21 / two days Register re-assignment -- Step 1 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 -- OKAY LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4// Must be float -- OKAY constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 -- SOURCES okay here Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; -- WRONG to use F0 here -- ADDITION scratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST early Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; -- OKAY

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 22 / two days Register re-assignment -- Step 2 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 23 / two days Fix poor coding practice -- “C” or assembly #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; MOVE OUTSIDE LOOP scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; MOVE OUTSIDE LOOP scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 24 / two days Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 25 / two days Resource Management -- Chart1 -- Basic code LOOPEND: -1 UNTIL LCE In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource dm 2 cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 26 / two days Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 27 / two days Un-roll the loop For various methods on “unrolling the loop” see papers by Jeanne Anne Booth Final Exam question -- What are relative advantages of the various techniques (with examples)?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 28 / two days Resource 2 -- unroll the loop -- 5 times here Each pass through the loop involves Read Multiply Add Write

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 29 / two days Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions Model used -- depends on where operands are relative to equals sign ‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU THESE PHASES ARE ‘CONCEPTS’ RATHER THAN “ IMPLEMENTED’ Reading

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 30 / two days Resource Management 4 Check what can be moved in parallel with other instructions OKAY TO MOVE F2 src freed up before F2 dest occurs OKAY TO MOVE Empty spot if can move * and + instructs which this instruction MUST follow NO !!! or just possible NO? Why a problem? F2 =

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 31 / two days Memory resource availability Move up F2 = dm(pt, ZERODM) from second loop into first loop However now we have a possible conflict about which F2 should be used for the dm(pt, plus1DM) = F2 instruction if we further optimize by trying to fill the other empty delay slots -- see next slide

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 32 / two days Resource management Overlapping two parts of the loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 33 / two days Resource Management 5 -- What’s up, Doc? Attempting to fill all unused resource availability Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers Problem may/may not exist with this simple example but very likely to exist in more complex algorithm

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 34 / two days Resource 6 -- Solution -- Save and then use F9

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 35 / two days Resource Management 7 -- Some parallelism possible with Read, Mult, Add and Write mixed across 5 loop comps. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 36 / two days WRONG -- CONCEPT GOOD, IMPLEMENTATION BAD as we are no longer indexing correctly through the data. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 37 / two days Need 1 resource to be maxed out Otherwise algorithm is inefficient Have to try a lot of different approaches Here is my code

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 38 / two days Resource Management 8 Unroll the loop a bit more -- 9 loop components DM BUS USAGE NOW MAXed OUT (after a while) CODE PATTERN APPEARING

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 39 / two days Now to to “reroll the loop” The loop is currently just straight line coded. Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations. Three components of “rerolled loop” for loop of form “count = 0, count <N” Fill the ALU/FPU pipeline (typically 1 stage from loop) Overlap N - 2 stages Empty the ALU/FPU pipeline (typically 1 stage)

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 40 / two days Resource Management 9 Identify the loop components LOOP BODY FILL ALU/FPU PIPE EMPTY ALU/FPU PIPELINE

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 41 / two days Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : FILL USE EMPTY ALU/FPU PIPE

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 42 / two days Speed improvements BEFORE START LOOP EXIT ENTRY 4 + N* = * N NOW with 2-fold loop unfolding START LOOP EXIT ENTRY (N – 2) * 5 / = * N NOW with 3-fold loop unfolding START LOOP EXIT ENTRY (N – 2) * 6 / = * N Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 43 / two days Question to Ask We now know the final code Should we have made the substitution F2 to F9? Who cares -- do it anyway as more likely to be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably overlap the save and recovery of the non-volatile R9 with other instructions! Will the code work?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 44 / two days Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 Only works if (N - 2) / 3 is an integer.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 45 / two days Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism