Download presentation
Presentation is loading. Please wait.
Published byGarey Harmon Modified over 9 years ago
1
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary.ca This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.
2
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 2 / 33 To be tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort
3
ADSP-21XXX -- Parallelism opportunities Ability for parallel memory operation, One each on pm, dm and instruction cache busses Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr Zero overhead loops Instruction pipeline issues Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc.
4
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 4 / 33 Compiler is only -- somewhat useful See article in course notes from Embedded System Design Sept./October 2000 Need to get a systematic process to provide Parallelism without pain Lab. Library version of FFT, custom version of Burg Algorithm (AR modeling)
5
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 5 / 33 Basic code development -- any system Write the “C” code for the function void Conjugate(float *re_pt, float *im_pt, int N) Real and imaginary components in different arrays Performs input = a + jb output = a - jb Convert the code to ADSP 21XXX/68K etc. assembly code, following the standard coding and documentation practices
6
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 6 / 33 Parallel Instruction Code Development Write the 21k assembly code for the function void Conjugate(float *re_pt, float *im_pt, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization.
7
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 7 / 33 Standard “C” code void Conjugate(float *re_pt, float *im_pt, int N) { int count; for (count = 0; count < N; count++) { *im_pt = - *im_pt; im_pt++; } void Conjugate_V2(float *re_pt, pm float *im_pt, int N) { int count; for (count = 0; count < N; count++) { *im_pt = - *im_pt; im_pt++; }
8
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 8 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop
9
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 9 / 33 21XXX-style load/store “C” code void Conjugate(register float *in_pt, register float *out_pt register int N) { register int count; register float *pt = out_pt; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = -scratch *pt = scratch; pt++; }
10
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 10 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop
11
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 11 / 33 Assembly code PROLOGUE Appropriate defines to make easy reading of code Saving of non-volatile registers BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are valid EPILOGUE Recover non-volatile registers
12
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 12 / 33 Straight conversion -- PROLOGUE // void Conjugate( reg float *in, *out, reg int N ) {.segment/pm seg_pmco;.global _Conjugate; _Conjugate: // register int count = GARBAGE; #define countR1 scratchR1 //register float *pt = out_pt; #define pt scratchDMpt pt = INPAR2; // dead <- R8, can re-use #define scratchF8 F8 // float scratch = GARBAGE // For the CURRENT code -- no volatile registers are needed
13
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 13 / 33 Straight conversion of code //for (count = 0; count < N; count++) { LCNTR = INPAR3, DO LOOP_LAST UNTIL LCE: // Dead <- INPAR3 scratchF8 = dm(0, pt);//scratch = *pt; // Not ++ as pt re-used scratchF8 = -scratchF8;// scratch = -scratch LOOP_LAST: dm(pt, 1) = scratchF8; // *pt = scratch; pt++; 5 magic lines of code
14
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 14 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop
15
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 15 / 33 Speed rules for memory access scratch = dm(0, pt); scratch = dm(pt, 0);// Not ++ as to be re-used dm(pt, 1) = scratch; Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant Must use Modify registers to store these constants. Several useful constants placed in modify registers (DAG1 and DAG2) during “C-code” initialization (if linked in) scratch = dm(pt, zeroDM);// Not ++ as to be re-used dm(pt, plus1DM) = scratch; Can’t use PREMODIFY PERIOD Can’t use POST MODIFY OPERATIONS with CONSTANTS
16
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 16 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop
17
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 17 / 33 Resource Management -- Chart1 -- Basic code In theory -- if we could find out how - and dm in parallel DATA-BUS is limiting resource dm 2 cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough? MULTIPLIERADDERDM BUSPM BUS Pt = INPAR2 Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F8 = dm( ) F8 = -F8 LOOP_LASTdm( ) = F8
18
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 18 / 33 Resource Management – Chart 2 -- Basic code In theory -- if we could find out how - and dm and pm in parallel 1 cycle loop possible MORE COMPLEX EXAMPLE – MAY BE LESS OBVIOUS IS THIS SUFFICIENT – IF NOT PROCEED NO FURTHER MULTIPLIERADDERDM BUSPM BUS Pt = INPAR2 Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F8 = dm( ) F8 = -F8 LOOP_LASTpm( ) = F8
19
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 19 / 33 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop
20
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 20 / 33 Unroll the loop ADDERDM BUS F8 = dm( )First time F8 = -F8Into loop dm( ) = F8 F8 = dm( )2 nd time F8 = -F8 dm( ) = F8 F8 = dm( )3 rd time F8 = -F8 dm( ) = F8
21
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 21 / 33 Unrolled loop -- rewrite ADDERDM BUS F4 = dm( )First time F4 = -F4 dm( ) = F4Don’t fight possible register conflicts F8 = dm( ) F8 = -F4 dm( ) = F8 F4 = dm( ) F4 = -F4 dm( ) = F4
22
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 22 / 33 Unrolled loop – Move into stalls ADDERDM BUS F4 = dm( )First time F4 = -F4 dm( ) = F4 F8 = dm( ) F8 = -F4 dm( ) = F8 F4 = dm( ) F4 = -F4 dm( ) = F4
23
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 23 / 33 Unrolled more optimized loop ADDERDM BUS F4 = dm( )1 F4 = -F4F8 = dm( )1 and 2 F8 = -F8dm( ) = F41 and 2 dm( ) = F82 F4 = dm( )3 F4 = -F4F8 = dm( )3 and 4 F8 = -F8dm( ) = F43 and 4 dm( ) = F84
24
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 24 / 33 Need 1 of the resources to be maxed out. Otherwise algorithm is inefficient May have to try a lot of different approaches
25
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 25 / 33 Unrolled loop – identifying repeat ADDERDM BUS F4 = dm( )Repeating F4 = -F4F8 = dm( )pattern F8 = -F8dm( ) = F4 dm( ) = F8 F4 = dm( ) F4 = -F4F8 = dm( ) F8 = -F8dm( ) = F4 dm( ) = F8
26
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX Copyright smithmr@ucalgary.ca 26 / 33 Now to to “reroll the loop” The loop is currently just straight line coded. Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations.
27
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 27 / 33 Re-rolled loop ADDERDM BUS INPAR3 = PASS INPAR3 IF EQ JUMP ENDLOOP; INPAR3 = ASHIFT INPAR3 BY -1 LCNT = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE F4 = dm( ) F4 = -F4F8 = dm( ) F8 = -F8dm( ) = F4 LOOP_LASTdm( ) = F8
28
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 28 / 33 ISSUES Before Went N times around the loop Loop was of size 3 Total count was 4 + N *3 + 5 cycles NOW Went N/2 times around the loop Loop was of size 4 Total count was 3 + N/2 * 4 + 5 cycles BUT N MUST BE KNOWN TO BE EVEN N = 2 K, where K = 0, 1, 2, 3, 4, 5, etc
29
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 29 / 33 Your job Rewrite the code if N is known to be odd N = 2K + 1, where K = 0, 1, 2, 3, 4, 5, 6, 7 Rewrite the code if N could be either odd or even Rewrite the code if the imaginary part of the number could be in program memory Rewrite the code if you “can’t leave” the array in place – meaning you must move both the real and imaginary parts while conjugating
30
1/28/2016 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 30 / 33 Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.