Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Slides:



Advertisements
Similar presentations
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Advertisements

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
6/2/20151 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta,
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.
واشوقاه إلى رمضان مرحباً رمضان
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Software and Hardware Circular Buffer Operations
TigerSHARC processor General Overview.
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
ENCM K Interrupts Theory and Practice
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
ENCM515 Standard and Custom FIR filters for Lab. 4
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
M. R. Smith, University of Calgary, Canada ucalgary.ca
* M. R. Smith, University of Calgary, Alberta,
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of TigerSHARC processor ADSP-TS101 Compute Operations
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
-- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Getting serious about “going fast” on the TigerSHARC
* L. E. Turner and M. R. Smith, University of Calgary, Alberta, Canada
Explaining issues with DCremoval( )
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
General Optimization Issues
Tutorial on Post Lab. 1 Quiz Practice for parallel operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of SHARC processor ADSP-2106X Memory Operations
Setting up VisualDSP environment Lab. 0
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
ENCM515 Standard and Custom FIR filters
Presentation transcript:

Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 2 / 33 To be tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

ADSP-21XXX -- Parallelism opportunities Ability for parallel memory operation, One each on pm, dm and instruction cache busses Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr Zero overhead loops Instruction pipeline issues Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc.

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 4 / 33 Compiler is only -- somewhat useful See article in course notes from Embedded System Design Sept./October 2000 Need to get a systematic process to provide Parallelism without pain Lab. Library version of FFT, custom version of Burg Algorithm (AR modeling)

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 5 / 33 Basic code development -- any system Write the “C” code for the function void Conjugate(float *re_pt, float *im_pt, int N) Real and imaginary components in different arrays Performs input = a + jb output = a - jb Convert the code to ADSP 21XXX/68K etc. assembly code, following the standard coding and documentation practices

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 6 / 33 Parallel Instruction Code Development Write the 21k assembly code for the function void Conjugate(float *re_pt, float *im_pt, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization.

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7 / 33 Standard “C” code void Conjugate(float *re_pt, float *im_pt, int N) { int count; for (count = 0; count < N; count++) { *im_pt = - *im_pt; im_pt++; } void Conjugate_V2(float *re_pt, pm float *im_pt, int N) { int count; for (count = 0; count < N; count++) { *im_pt = - *im_pt; im_pt++; }

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 8 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 9 / 33 21XXX-style load/store “C” code void Conjugate(register float *in_pt, register float *out_pt register int N) { register int count; register float *pt = out_pt; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = -scratch *pt = scratch; pt++; }

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 10 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 11 / 33 Assembly code PROLOGUE Appropriate defines to make easy reading of code Saving of non-volatile registers BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are valid EPILOGUE Recover non-volatile registers

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 12 / 33 Straight conversion -- PROLOGUE // void Conjugate( reg float *in, *out, reg int N ) {.segment/pm seg_pmco;.global _Conjugate; _Conjugate: // register int count = GARBAGE; #define countR1 scratchR1 //register float *pt = out_pt; #define pt scratchDMpt pt = INPAR2; // dead <- R8, can re-use #define scratchF8 F8 // float scratch = GARBAGE // For the CURRENT code -- no volatile registers are needed

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 13 / 33 Straight conversion of code //for (count = 0; count < N; count++) { LCNTR = INPAR3, DO LOOP_LAST UNTIL LCE: // Dead <- INPAR3 scratchF8 = dm(0, pt);//scratch = *pt; // Not ++ as pt re-used scratchF8 = -scratchF8;// scratch = -scratch LOOP_LAST: dm(pt, 1) = scratchF8; // *pt = scratch; pt++; 5 magic lines of code

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 14 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 15 / 33 Speed rules for memory access scratch = dm(0, pt); scratch = dm(pt, 0);// Not ++ as to be re-used dm(pt, 1) = scratch; Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant Must use Modify registers to store these constants. Several useful constants placed in modify registers (DAG1 and DAG2) during “C-code” initialization (if linked in) scratch = dm(pt, zeroDM);// Not ++ as to be re-used dm(pt, plus1DM) = scratch; Can’t use PREMODIFY PERIOD Can’t use POST MODIFY OPERATIONS with CONSTANTS

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 16 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 17 / 33 Resource Management -- Chart1 -- Basic code In theory -- if we could find out how - and dm in parallel DATA-BUS is limiting resource dm 2 cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough? MULTIPLIERADDERDM BUSPM BUS Pt = INPAR2 Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F8 = dm( ) F8 = -F8 LOOP_LASTdm( ) = F8

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 18 / 33 Resource Management – Chart 2 -- Basic code In theory -- if we could find out how - and dm and pm in parallel 1 cycle loop possible MORE COMPLEX EXAMPLE – MAY BE LESS OBVIOUS IS THIS SUFFICIENT – IF NOT PROCEED NO FURTHER MULTIPLIERADDERDM BUSPM BUS Pt = INPAR2 Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F8 = dm( ) F8 = -F8 LOOP_LASTpm( ) = F8

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 19 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 20 / 33 Unroll the loop ADDERDM BUS F8 = dm( )First time F8 = -F8Into loop dm( ) = F8 F8 = dm( )2 nd time F8 = -F8 dm( ) = F8 F8 = dm( )3 rd time F8 = -F8 dm( ) = F8

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 21 / 33 Unrolled loop -- rewrite ADDERDM BUS F4 = dm( )First time F4 = -F4 dm( ) = F4Don’t fight possible register conflicts F8 = dm( ) F8 = -F4 dm( ) = F8 F4 = dm( ) F4 = -F4 dm( ) = F4

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 22 / 33 Unrolled loop – Move into stalls ADDERDM BUS F4 = dm( )First time F4 = -F4 dm( ) = F4 F8 = dm( ) F8 = -F4 dm( ) = F8 F4 = dm( ) F4 = -F4 dm( ) = F4

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 23 / 33 Unrolled more optimized loop ADDERDM BUS F4 = dm( )1 F4 = -F4F8 = dm( )1 and 2 F8 = -F8dm( ) = F41 and 2 dm( ) = F82 F4 = dm( )3 F4 = -F4F8 = dm( )3 and 4 F8 = -F8dm( ) = F43 and 4 dm( ) = F84

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 24 / 33 Need 1 of the resources to be maxed out. Otherwise algorithm is inefficient May have to try a lot of different approaches

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 25 / 33 Unrolled loop – identifying repeat ADDERDM BUS F4 = dm( )Repeating F4 = -F4F8 = dm( )pattern F8 = -F8dm( ) = F4 dm( ) = F8 F4 = dm( ) F4 = -F4F8 = dm( ) F8 = -F8dm( ) = F4 dm( ) = F8

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21XXX Copyright 26 / 33 Now to to “reroll the loop” The loop is currently just straight line coded. Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations.

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 27 / 33 Re-rolled loop ADDERDM BUS INPAR3 = PASS INPAR3 IF EQ JUMP ENDLOOP; INPAR3 = ASHIFT INPAR3 BY -1 LCNT = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F4 = dm( ) F4 = -F4F8 = dm( ) F8 = -F8dm( ) = F4 LOOP_LASTdm( ) = F8

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 28 / 33 ISSUES Before Went N times around the loop Loop was of size 3 Total count was 4 + N *3 + 5 cycles NOW Went N/2 times around the loop Loop was of size 4 Total count was 3 + N/2 * cycles BUT N MUST BE KNOWN TO BE EVEN N = 2 K, where K = 0, 1, 2, 3, 4, 5, etc

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 29 / 33 Your job Rewrite the code if N is known to be odd N = 2K + 1, where K = 0, 1, 2, 3, 4, 5, 6, 7 Rewrite the code if N could be either odd or even Rewrite the code if the imaginary part of the number could be in program memory Rewrite the code if you “can’t leave” the array in place – meaning you must move both the real and imaginary parts while conjugating

1/28/2016 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 30 / 33 Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism