* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Slides:



Advertisements
Similar presentations
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Advertisements

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
6/2/20151 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September.
Process for systematic conversion of a design in “C-pseudo code” to SHARC assembly code M. Smith, Electrical and Computer Engineering, University.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
Just enough information to program a Blackfin Familiarization assignment for the Analog Devices’ VisualDSP++ Integrated Development Environment.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
واشوقاه إلى رمضان مرحباً رمضان
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Software and Hardware Circular Buffer Operations
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
A Play Core Timer Interrupts
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
Generating a software loop with memory accesses
ENCM K Interrupts Theory and Practice
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
ENCM515 Standard and Custom FIR filters for Lab. 4
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
M. R. Smith, University of Calgary, Canada ucalgary.ca
* M. R. Smith, University of Calgary, Alberta,
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
TigerSHARC processor and evaluation board
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Understanding the TigerSHARC ALU pipeline
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
-- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
* 2000/08/1307/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these.
Getting serious about “going fast” on the TigerSHARC
Explaining issues with DCremoval( )
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
General Optimization Issues
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
Tutorial on Post Lab. 1 Quiz Practice for parallel operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
ENCM515 Standard and Custom FIR filters
Presentation transcript:

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Learning about code optimizations on CISC, RISC and DSP processors Using optimization information from SHARC “C” compiler M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary.ca *

To be tackled today The “C/C++” compiler knows how to generate assembler code What can we learn from the compiler as tutor? “C/C++” routines can use many parameters How did the Wind River DiabData 68K compiler do it? How does VisualDSP SHARC 21K CROSS_CODE compiler do it? Process to generate assembler from “C” (general) 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Example code with 5 parameters to pass Useful concept in many DSP algorithms Calculate A-T B A where A and B are matrices and A-T is complex conjugate transpose #include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = *in_im; out_re++; out_im++; in_re++; in_im++; } // NEW -- returning a parameter return(count); // Actually returning (float) count } 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Process for VisualDSP “C” to assembly code Use Lab. 0 Project as the test bed Add “conjugate.c” file to project Select PROJECT | PROJECT OPTIONS | COMPILER | SAVE TEMPORARY FILES Select PROJECT | PROJECT OPTIONS | COMPILER | GENERATE DEBUG INFORMATION FORCE A REBUILD Bring up “conjugate.c” into the source window “add a space and save” the file to force the recompile Click on “Build file” Examine the file conjugate.s 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

File “conjugate.s” – unreadable modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; MEANS WHAT? dm(-4,i6)=r4; DOING WHAT? dm(-3,i6)=r8; dm(-2,i6)=r12; i0=r4; i1=r8; i2=r12; i3=dm(1,i6); r6=dm(2,i6); ! line 8 r3=0; _L$250001: // Used Version 4.1 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Make file “conjugate.s” -- readable Copy conjugate.s to fileconjugate.asm and ADD 1 line .include “04reverse_engineer.h” Add to fileconjugate.asm file modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; dm(-4,i6)=r4; dm(-3,i6)=r8; Add fileconjugate.asm to your project Why the extension change? Modify ASSEMBLER OPTIONS Preprocess Only Don’t Enable Debugging information Do Enable Verbose Output DON’T FORGET TO UNMODIFY LATER Assemble and look for file “fileconjugate.is” 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

04reverse_engineer.h looks like this #define i6 FP #define i7 CTOPstack #define m5 zeroDM #define m13 zeroPM #define m6 plus1DM #define m14 plus1PM #define m7 minus1DM #define m15 minus1PM #define r0 retvalueR0 #define r1 scratchR1 #define r2 scratchR2 #define r4 OUTorINPAR1_R4 #define r8 OUTorINPAR2_R8 #define r12 OUTorINPAR3_R12 #define i4 scratchDMpt_I4 #define m4 scratchMDmodify_M4 #define i12 scratchPMpt_I12 #define m12 scratchMPmodify_M12 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

File “fileconjugate.is” -- more readable? fileconjugate.asm modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; dm(-4,i6)=r4; dm(-3,i6)=r8; dm(-2,i6)=r12; i0=r4; i1=r8; i2=r12; i3=dm(1,i6); r6=dm(2,i6); fileconjugate.is modify(CTOPstack ,-10); dm(-11,FP )=r3; scratchR2 =i3; dm(-6,FP )=scratchR2 ; dm(-4,FP )=OUTorINPAR1_R4 ; dm(-3,FP )=OUTorINPAR2_R8 ; dm(-2,FP )=OUTorINPAR3_R12; i0=OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; i2=OUTorINPAR3_R12; i3=dm(1,FP ); r6=dm(2,FP ); 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Standard format for “C” function EPILOGUE BODY PROLOGUE Lets identify the recognizable EPILOGUE and PROLOGUE stuff from the .is SHARC code and delete. What’s left is the code we want to learn about 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

PROLOGUE, EPILOGUE #include “math.h” float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { PROLOGUE -- establish stack frame, save non-volatile registers build local variables int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = -*in_im; out_re++; out_im++; in_re++; in_im++; } EPILOGUE -- recover non-volatile registers, destroy stack, return value? return(count); 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Review SDS PROLOGUE Stack handling, prologue, epilogue very similar across processors Dump the SDS assembly code equivalent to “C” code We are somewhat familiar with 68k code Very roughly translate to 21061 code so that we have some idea of what to expect when we start examining 21061 code. Programming 21K as if it was a 68K RISC processor is also a good starting point (even if not speed efficient) 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

PROLOGUE 68K processor (SDS) ;float MakeConjugate(float *in_re, float *in_im, ; float *out_re, float *out_im, int number) { SECTION code XDEF _MakeConjugate _MakeConjugate LINK FP,#-0xC MOVE.L A1,-4(FP) ; 8(FP) <- in_re, 0xC(FP) ; 0x14(FP) <- out_im, 0x18(FP) <- number ; int count; -8(FP) <- count ; for (count = 0; count < number; count++) { _81 MOVEQ.L #0,D0 MOVE.L D0,-8(FP) BRA _11 ; *out_re = *in_re; _2 MOVE.L 8(FP),A1 MOVE.L 0x10(FP),A0 MOVE.L (A1),(A0) ; *out_im = -*in_im; _83 MOVE.L 0xC(FP),A1 MOVE.L 0x14(FP),A0 EORI.B #0x80,(A0) 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

PROLOGUE 68K processor (SDS) ;float MakeConjugate(float *in_re, float *in_im, ; float *out_re, float *out_im, int number) { SECTION code  .segment/pm seg_pmco XDEF _MakeConjugate  .global _MakeConjugate _MakeConjugate LINK FP,#-0xC  FP and SP change to make stack frame MOVE.L A1,-4(FP)  SAVING volatile register A1? ; 8(FP) <- in_re, 0xC(FP) <- in_im, 0x10(FP) <- out_re INPAR1,2,3 ; 0x14(FP) <- out_im, 0x18(FP) <- number INPAR4 and INPAR5 ; int count; -8(FP) <- count  above return address ; for (count = 0; count < number; count++) { _81 MOVEQ.L #0,D0 MOVE.L D0,-8(FP) BRA _11 ; *out_re = *in_re; _2 MOVE.L 8(FP),A1  grabbing an INPAR1 and then INPAR3 MOVE.L 0x10(FP),A0  A0 = dm(4, FP) MOVE.L (A1),(A0) ; *out_im = -*in_im; _83 MOVE.L 0xC(FP),A1  grabbing an INPAR2 and then INPAR4 MOVE.L 0x14(FP),A0 EORI.B #0x80,(A0) 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

SHARC PROLOGUE FROM .is file -- none-optimized .section /pm seg_pmco; .section /dm seg_dmda; __EPC_data: __EPC_text: .global _MakeConjugate; _MakeConjugate: modify(CTOPstack ,-10); dm(-11,FP ) = r3; dm(-10,FP ) = r6; scratchR2 = i0; dm(-9,FP ) = scratchR2 ; scratchR2 = i1; dm(-8,FP ) =scratchR2 ; scratchR2 = i2; dm(-7,FP ) = scratchR2 ; scratchR2 = i3; dm(-6,FP ) = scratchR2 ; dm(-4,FP ) = OUTorINPAR1_R4 ; dm(-3,FP ) = OUTorINPAR2_R8 ; dm(-2,FP ) = OUTorINPAR3_R12;  Stupid automatic code  EH?  EH? Build stack frame Save non-volatile registers -- OK! Note -- MUST save non-volatile dm index registers IN 2 STEPS Saving copies of incoming parameters. EH? 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

68 K -- EPILOGUE -- non-optimize ; return(count); _812 MOVE.L -8(FP),D0  (float) count -- needs subroutine PEA -0xC(FP)  address of a stack location BSR __sltos  some library call to do “float” op .STK -4  Eh? -- Auto assembler stack adjustment LEA -0xC(FP),A1  Eh? ; dead <- in_re  I like this feature !!! ; dead <- in_im Makes for easy hand code optimization ; dead <- out_re ; dead <- out_im ; dead <- number ;} _13 MOVE.L -4(FP),A0 MOVE.L (A1),(A0) UNLK FP  throw away stack frame RTS  Note A1 not recovered from stack -- Eh?? 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

21K Epilogue -- not optimized OUTorINPAR3_R12=r3; F0=float OUTorINPAR3_R12;  (float) count? jump(pc,_L$316000);  auto generate GARBAGE _L$316000: r3=dm(-11,FP );  Recover non-volatile r6=dm(-10,FP ); i0=dm(-9,FP ); i1=dm(-8,FP ); i2=dm(-7,FP ); i3=dm(-6,FP );  Start return to “C” scratchPMpt =dm(minus1DM ,FP );  Not standard assembler code format  SHARC SEMICOLONS ARE KEY jump(plus1PM ,scratchPMpt )(DB); rframe; nop; ___EPC_text_end: 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Example code with 5 parameters to pass #include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = -*in_im; out_re++; out_im++; in_re++; in_im++; } return(count); 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Remaining code is what we want INPAR1 -- float *in_re INPAR2 -- float *in_im INPAR3 -- float *out_re INPAR4? -- float *out_im INPAR5? -- int number post increment in use R2 used to move float only bit pattern in memory means anything Float OP -- single cycle R0 used to move float here NOTE NO DELAYED BRANCHES Using R12/INPAR3 as scratchR12 Using R0/F0 for return -- note conversion i0=OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; i2=OUTorINPAR3_R12; i3=dm(1,FP ); r6=dm(2,FP ); r3=0; _L$250001: comp(r3, r6); if ge jump(pc,_L$250003); scratchR2 =dm(i0,plus1DM ); dm(i2,plus1DM )=scratchR2 ; scratchR1 =dm(i1,plus1DM ); F0=-F1; dm(i3,plus1DM )=retvalueR0 ; r3=r3+1; jump(pc,_L$250001); _L$250003: OUTorINPAR3_R12=r3; F0=float OUTorINPAR3_R12; 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Code changes on optimization Previous code examples are for when the compiler does not perform code optimization. Next slides look at what happens if we get compiler to optimize code translation Do we get flavour of what we need to learn about SHARC SUPER-SCALAR RISC architecture? This is sort of approach needed in Lab. 3 when optimize FIR filters where we invoke the super-scalar capability of the SHARC processor 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

68K prologue optimized _MakeConjugate LINK FP,#-0xC <- STACK FRAME MOVEM.L D7/A2/A3/A4/A5,-(SP) <- CUSTOM INSTR MOVE.L A1,-4(FP) ; Prepare for register operations not memory operations ; A2 <- out_im; A3 <- out_re; A4 <- in_im ; A5 <- in_re ; 8(FP) <- in_re <- inpars stored above return address _81 MOVE.L 8(FP),A5 ; 0xC(FP) <- in_im MOVE.L 0xC(FP),A4 ; 0x10(FP) <- out_re MOVE.L 0x10(FP),A3 ; 0x14(FP) <- out_im MOVE.L 0x14(FP),A2 ; 0x18(FP) <- number ; int count; ; D7 <- count <- register operations ; for (count = 0; count < number; count++) { _82 MOVEQ.L #0,D7 <- optimized code _83 BRA _13 <- test at bottom ; *out_re = *in_re; _4 MOVE.L (A5),(A3) <- optimized code ; *out_im = -*in_im; _85 MOVE.L (A4),(A2) EORI.B #0x80,(A2) 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

SHARC Code changes on optimization Can we improve 04reverse_engineer.h with new knowledge? #define dm(1,i6) INPAR4 -- NOTE not FP #define dm(2,i6) INPAR5 EXPECTED ANSWER -- USEFUL SOME OF THE TIME ACTUAL ANSWER UNEXPECTED! -- NEVER [Error pp0012] ".\04reverse_engineer.h":1 Expected an identifier MEANS WHAT -- I never defined anything? Have not been able to fix the trick 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

21K Optimized Prologue .global _MakeConjugate; _MakeConjugate: scratchR2 =i1; modify(CTOPstack ,-9); dm(-10,FP )=r3; r3=r3 xor r3; dm(-8,FP )=scratchR2 ; scratchR2 =i2; dm(-7,FP )=scratchR2 ; scratchR2 =i3; dm(-6,FP )=scratchR2 ; dm(-9,FP )=r6; i1=OUTorINPAR1_R4 ; i3=OUTorINPAR2_R8 ; scratchDMpt =OUTorINPAR3_R12 ; i2=dm(1,FP ); r6=dm(2,FP ); Not saving i1 to memory immediately -- Why not? Make 0 the hard way? Why? -- using SHIFT ALU not R3 = 0 means that might be able to optimize (parallel instructions) under different conditions as the value 0 requires 32 bits in opcode 15 lines compared to 16 before Not much saving! 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

68K epilogue -- optimized ; return(count); _814 MOVE.L D7,-8(FP) MOVE.L -8(FP),D0 PEA -0xC(FP) BSR __sltos .STK -4 LEA -0xC(A6),A1 ; dead <- count ;} _15 MOVE.L -4(FP),A0 MOVE.L (A1),(A0) MOVEM.L (SP)+,D7/A2/A3/A4/A5 <- recover UNLK FP <- destroy frame RTS 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

21K EPILOGUE I do not see any changes _L$250003: r6=dm(-9,FP ); F0=float r3; r3=dm(-10,FP ); i1=dm(-8,FP ); i2=dm(-7,FP ); i3=dm(-6,FP ); scratchPMpt =dm(minus1DM ,FP ); jump(plus1PM ,scratchPMpt )(DB); rframe; nop; I do not see any changes 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

21K Body -- optimized code -- Version 4.1 i1=OUTorINPAR1 _R4; i3=OUTorINPAR2_R8 ; scratchDMpt_I4 =OUTorINPAR3_R12 ; i2=dm(1,FP ); r6=dm(2,FP ); r6=pass r6; if le jump(pc,_L$250003); scratchR2 =dm(i1,plus1DM ); lcntr=r6, do(pc,_L$316001-1)until lce; _L$250001: dm(scratchDMpt_I4 ,plus1DM )=scratchR2 ; scratchR1 =dm(i3,plus1DM ); F0=-F1; dm(i2,plus1DM )=retvalue ; _L$316001: r3=r6; _L$250003: r6=dm(-9,FP ); F0=float r3; Get INPAR4 Get INPAR5 and then test NEW Hardware loop with pipeline preload ALU not INSTRUCTION <- Line _L$316001 - 1 used in lcntr, do-until lce 5 lines of code in loop compared to 9 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Example code with 5 parameters to pass #include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { Going for real optimization (Lab. 4) float MakeConjugate(float pm *in_re, float dm *in_im, float dm *out_re, float pm *out_im, int number) { Trying to invoke parallel data and program memory operations 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Prologue changes .global _MakeConjugate; _MakeConjugate: modify(CTOPstack ,-7); dm(-8,FP )=r3; r3=r3 xor r3; dm(-7,FP )=r6; scratchR2 =i1; r6=dm(2,FP ); dm(-6,FP )=scratchR2 ; r6=pass r6; i13=dm(1,FP ); DAG2!! scratchPMpt_I12 =OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; scratchDMpt_I4 =OUTorINPAR3_R12 ; 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

Really optimized? Not what expected. if le jump(pc,_L$250003); scratchR2 =pm(scratchPMpt_I4 ,plus1PM ); lcntr=r6, do(pc,_L$316001-1)until lce; _L$250001: dm(scratchDMpt_I4 ,plus1DM )=scratchR2 ; scratchR1 =dm(i1,plus1DM ); F0=-F1; pm(i13,plus1PM )=retvalue ; scratchR2 =pm(scratchPMpt_I12, plus1PM ); _L$316001: QUESTIONS -- Can we further optimize THIS code by having pm and dm operations at the same time (see Lab. 3) or is the compiler correct that this is the best we can do? See Articles from Embedded Systems Magazine (Sept/Oct 2000) on what is needed to make this work 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

We now have VisualDSP++ 3.0 SP1 Does the new compiler make better use of the program and data buses or must we still hand optimize the code for parallel operations – We will look at optimization issues during the assignments? #include <math.h> float MakeConjugate(dm float *in_re, pm float *in_im, pm float *out_re, dm float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = *in_im; out_re++; out_im++; in_re++; in_im++; } return(count); // NEW -- returning a parameter 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca

To be tackled today The “C” compiler knows how to generate assembler What can we learn from the “C” compiler as tutor? “C” routines can use many parameters How does Wind River DiabData 68K compiler do it? How does White Mountain SHARC 21K compiler do it? Process to generate assembler from “C” (general) VisualDSP requirements Using -S compiler option and look at .s file Printing (Best directly from Visual DSP NOT Notepad) Reverse engineering the .s file for easier reading by using reverse_clanguage_register_defines.i file and the assembler preprocessor to produce .is file. 9/13/2019 ENCM515 -- Learning from the “C” compiler Copyright smithmr@ucalgary.ca