Download presentation
Presentation is loading. Please wait.
Published byFelicia Randall Modified over 5 years ago
1
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Learning about code optimizations on CISC, RISC and DSP processors Using optimization information from SHARC “C” compiler M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca *
2
To be tackled today The “C/C++” compiler knows how to generate assembler code What can we learn from the compiler as tutor? “C/C++” routines can use many parameters How did the Wind River DiabData 68K compiler do it? How does VisualDSP SHARC 21K CROSS_CODE compiler do it? Process to generate assembler from “C” (general) 9/13/2019 ENCM Learning from the “C” compiler Copyright
3
Example code with 5 parameters to pass
Useful concept in many DSP algorithms Calculate A-T B A where A and B are matrices and A-T is complex conjugate transpose #include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = *in_im; out_re++; out_im++; in_re++; in_im++; } // NEW -- returning a parameter return(count); // Actually returning (float) count } 9/13/2019 ENCM Learning from the “C” compiler Copyright
4
Process for VisualDSP “C” to assembly code
Use Lab. 0 Project as the test bed Add “conjugate.c” file to project Select PROJECT | PROJECT OPTIONS | COMPILER | SAVE TEMPORARY FILES Select PROJECT | PROJECT OPTIONS | COMPILER | GENERATE DEBUG INFORMATION FORCE A REBUILD Bring up “conjugate.c” into the source window “add a space and save” the file to force the recompile Click on “Build file” Examine the file conjugate.s 9/13/2019 ENCM Learning from the “C” compiler Copyright
5
File “conjugate.s” – unreadable
modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; MEANS WHAT? dm(-4,i6)=r4; DOING WHAT? dm(-3,i6)=r8; dm(-2,i6)=r12; i0=r4; i1=r8; i2=r12; i3=dm(1,i6); r6=dm(2,i6); ! line 8 r3=0; _L$250001: // Used Version 4.1 9/13/2019 ENCM Learning from the “C” compiler Copyright
6
Make file “conjugate.s” -- readable
Copy conjugate.s to fileconjugate.asm and ADD 1 line .include “04reverse_engineer.h” Add to fileconjugate.asm file modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; dm(-4,i6)=r4; dm(-3,i6)=r8; Add fileconjugate.asm to your project Why the extension change? Modify ASSEMBLER OPTIONS Preprocess Only Don’t Enable Debugging information Do Enable Verbose Output DON’T FORGET TO UNMODIFY LATER Assemble and look for file “fileconjugate.is” 9/13/2019 ENCM Learning from the “C” compiler Copyright
7
04reverse_engineer.h looks like this
#define i6 FP #define i7 CTOPstack #define m5 zeroDM #define m13 zeroPM #define m6 plus1DM #define m14 plus1PM #define m7 minus1DM #define m15 minus1PM #define r0 retvalueR0 #define r1 scratchR1 #define r2 scratchR2 #define r4 OUTorINPAR1_R4 #define r8 OUTorINPAR2_R8 #define r12 OUTorINPAR3_R12 #define i4 scratchDMpt_I4 #define m4 scratchMDmodify_M4 #define i12 scratchPMpt_I12 #define m12 scratchMPmodify_M12 9/13/2019 ENCM Learning from the “C” compiler Copyright
8
File “fileconjugate.is” -- more readable?
fileconjugate.asm modify(i7,-10); dm(-11,i6)=r3; r2=i3; dm(-6,i6)=r2; dm(-4,i6)=r4; dm(-3,i6)=r8; dm(-2,i6)=r12; i0=r4; i1=r8; i2=r12; i3=dm(1,i6); r6=dm(2,i6); fileconjugate.is modify(CTOPstack ,-10); dm(-11,FP )=r3; scratchR2 =i3; dm(-6,FP )=scratchR2 ; dm(-4,FP )=OUTorINPAR1_R4 ; dm(-3,FP )=OUTorINPAR2_R8 ; dm(-2,FP )=OUTorINPAR3_R12; i0=OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; i2=OUTorINPAR3_R12; i3=dm(1,FP ); r6=dm(2,FP ); 9/13/2019 ENCM Learning from the “C” compiler Copyright
9
Standard format for “C” function
EPILOGUE BODY PROLOGUE Lets identify the recognizable EPILOGUE and PROLOGUE stuff from the .is SHARC code and delete. What’s left is the code we want to learn about 9/13/2019 ENCM Learning from the “C” compiler Copyright
10
PROLOGUE, EPILOGUE #include “math.h” float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { PROLOGUE -- establish stack frame, save non-volatile registers build local variables int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = -*in_im; out_re++; out_im++; in_re++; in_im++; } EPILOGUE -- recover non-volatile registers, destroy stack, return value? return(count); 9/13/2019 ENCM Learning from the “C” compiler Copyright
11
Review SDS PROLOGUE Stack handling, prologue, epilogue very similar across processors Dump the SDS assembly code equivalent to “C” code We are somewhat familiar with 68k code Very roughly translate to code so that we have some idea of what to expect when we start examining code. Programming 21K as if it was a 68K RISC processor is also a good starting point (even if not speed efficient) 9/13/2019 ENCM Learning from the “C” compiler Copyright
12
PROLOGUE 68K processor (SDS)
;float MakeConjugate(float *in_re, float *in_im, ; float *out_re, float *out_im, int number) { SECTION code XDEF _MakeConjugate _MakeConjugate LINK FP,#-0xC MOVE.L A1,-4(FP) ; 8(FP) <- in_re, 0xC(FP) ; 0x14(FP) <- out_im, 0x18(FP) <- number ; int count; -8(FP) <- count ; for (count = 0; count < number; count++) { _81 MOVEQ.L #0,D0 MOVE.L D0,-8(FP) BRA _11 ; *out_re = *in_re; _2 MOVE.L 8(FP),A1 MOVE.L 0x10(FP),A0 MOVE.L (A1),(A0) ; *out_im = -*in_im; _83 MOVE.L 0xC(FP),A1 MOVE.L 0x14(FP),A0 EORI.B #0x80,(A0) 9/13/2019 ENCM Learning from the “C” compiler Copyright
13
PROLOGUE 68K processor (SDS)
;float MakeConjugate(float *in_re, float *in_im, ; float *out_re, float *out_im, int number) { SECTION code .segment/pm seg_pmco XDEF _MakeConjugate .global _MakeConjugate _MakeConjugate LINK FP,#-0xC FP and SP change to make stack frame MOVE.L A1,-4(FP) SAVING volatile register A1? ; 8(FP) <- in_re, 0xC(FP) <- in_im, 0x10(FP) <- out_re INPAR1,2,3 ; 0x14(FP) <- out_im, 0x18(FP) <- number INPAR4 and INPAR5 ; int count; -8(FP) <- count above return address ; for (count = 0; count < number; count++) { _81 MOVEQ.L #0,D0 MOVE.L D0,-8(FP) BRA _11 ; *out_re = *in_re; _2 MOVE.L 8(FP),A1 grabbing an INPAR1 and then INPAR3 MOVE.L 0x10(FP),A0 A0 = dm(4, FP) MOVE.L (A1),(A0) ; *out_im = -*in_im; _83 MOVE.L 0xC(FP),A1 grabbing an INPAR2 and then INPAR4 MOVE.L 0x14(FP),A0 EORI.B #0x80,(A0) 9/13/2019 ENCM Learning from the “C” compiler Copyright
14
SHARC PROLOGUE FROM .is file -- none-optimized
.section /pm seg_pmco; .section /dm seg_dmda; __EPC_data: __EPC_text: .global _MakeConjugate; _MakeConjugate: modify(CTOPstack ,-10); dm(-11,FP ) = r3; dm(-10,FP ) = r6; scratchR2 = i0; dm(-9,FP ) = scratchR2 ; scratchR2 = i1; dm(-8,FP ) =scratchR2 ; scratchR2 = i2; dm(-7,FP ) = scratchR2 ; scratchR2 = i3; dm(-6,FP ) = scratchR2 ; dm(-4,FP ) = OUTorINPAR1_R4 ; dm(-3,FP ) = OUTorINPAR2_R8 ; dm(-2,FP ) = OUTorINPAR3_R12; Stupid automatic code EH? EH? Build stack frame Save non-volatile registers -- OK! Note -- MUST save non-volatile dm index registers IN 2 STEPS Saving copies of incoming parameters. EH? 9/13/2019 ENCM Learning from the “C” compiler Copyright
15
68 K -- EPILOGUE -- non-optimize
; return(count); _812 MOVE.L -8(FP),D0 (float) count -- needs subroutine PEA -0xC(FP) address of a stack location BSR __sltos some library call to do “float” op .STK -4 Eh? -- Auto assembler stack adjustment LEA -0xC(FP),A1 Eh? ; dead <- in_re I like this feature !!! ; dead <- in_im Makes for easy hand code optimization ; dead <- out_re ; dead <- out_im ; dead <- number ;} _13 MOVE.L -4(FP),A0 MOVE.L (A1),(A0) UNLK FP throw away stack frame RTS Note A1 not recovered from stack -- Eh?? 9/13/2019 ENCM Learning from the “C” compiler Copyright
16
21K Epilogue -- not optimized
OUTorINPAR3_R12=r3; F0=float OUTorINPAR3_R12; (float) count? jump(pc,_L$316000); auto generate GARBAGE _L$316000: r3=dm(-11,FP ); Recover non-volatile r6=dm(-10,FP ); i0=dm(-9,FP ); i1=dm(-8,FP ); i2=dm(-7,FP ); i3=dm(-6,FP ); Start return to “C” scratchPMpt =dm(minus1DM ,FP ); Not standard assembler code format SHARC SEMICOLONS ARE KEY jump(plus1PM ,scratchPMpt )(DB); rframe; nop; ___EPC_text_end: 9/13/2019 ENCM Learning from the “C” compiler Copyright
17
Example code with 5 parameters to pass
#include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = -*in_im; out_re++; out_im++; in_re++; in_im++; } return(count); 9/13/2019 ENCM Learning from the “C” compiler Copyright
18
Remaining code is what we want
INPAR1 -- float *in_re INPAR2 -- float *in_im INPAR3 -- float *out_re INPAR4? -- float *out_im INPAR5? -- int number post increment in use R2 used to move float only bit pattern in memory means anything Float OP -- single cycle R0 used to move float here NOTE NO DELAYED BRANCHES Using R12/INPAR3 as scratchR12 Using R0/F0 for return -- note conversion i0=OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; i2=OUTorINPAR3_R12; i3=dm(1,FP ); r6=dm(2,FP ); r3=0; _L$250001: comp(r3, r6); if ge jump(pc,_L$250003); scratchR2 =dm(i0,plus1DM ); dm(i2,plus1DM )=scratchR2 ; scratchR1 =dm(i1,plus1DM ); F0=-F1; dm(i3,plus1DM )=retvalueR0 ; r3=r3+1; jump(pc,_L$250001); _L$250003: OUTorINPAR3_R12=r3; F0=float OUTorINPAR3_R12; 9/13/2019 ENCM Learning from the “C” compiler Copyright
19
Code changes on optimization
Previous code examples are for when the compiler does not perform code optimization. Next slides look at what happens if we get compiler to optimize code translation Do we get flavour of what we need to learn about SHARC SUPER-SCALAR RISC architecture? This is sort of approach needed in Lab. 3 when optimize FIR filters where we invoke the super-scalar capability of the SHARC processor 9/13/2019 ENCM Learning from the “C” compiler Copyright
20
68K prologue optimized _MakeConjugate LINK FP,#-0xC <- STACK FRAME MOVEM.L D7/A2/A3/A4/A5,-(SP) <- CUSTOM INSTR MOVE.L A1,-4(FP) ; Prepare for register operations not memory operations ; A2 <- out_im; A3 <- out_re; A4 <- in_im ; A5 <- in_re ; 8(FP) <- in_re <- inpars stored above return address _81 MOVE.L 8(FP),A5 ; 0xC(FP) <- in_im MOVE.L 0xC(FP),A4 ; 0x10(FP) <- out_re MOVE.L 0x10(FP),A3 ; 0x14(FP) <- out_im MOVE.L 0x14(FP),A2 ; 0x18(FP) <- number ; int count; ; D7 <- count <- register operations ; for (count = 0; count < number; count++) { _82 MOVEQ.L #0,D7 <- optimized code _83 BRA _ <- test at bottom ; *out_re = *in_re; _4 MOVE.L (A5),(A3) <- optimized code ; *out_im = -*in_im; _85 MOVE.L (A4),(A2) EORI.B #0x80,(A2) 9/13/2019 ENCM Learning from the “C” compiler Copyright
21
SHARC Code changes on optimization
Can we improve 04reverse_engineer.h with new knowledge? #define dm(1,i6) INPAR NOTE not FP #define dm(2,i6) INPAR5 EXPECTED ANSWER -- USEFUL SOME OF THE TIME ACTUAL ANSWER UNEXPECTED! -- NEVER [Error pp0012] ".\04reverse_engineer.h":1 Expected an identifier MEANS WHAT -- I never defined anything? Have not been able to fix the trick 9/13/2019 ENCM Learning from the “C” compiler Copyright
22
21K Optimized Prologue .global _MakeConjugate; _MakeConjugate:
scratchR2 =i1; modify(CTOPstack ,-9); dm(-10,FP )=r3; r3=r3 xor r3; dm(-8,FP )=scratchR2 ; scratchR2 =i2; dm(-7,FP )=scratchR2 ; scratchR2 =i3; dm(-6,FP )=scratchR2 ; dm(-9,FP )=r6; i1=OUTorINPAR1_R4 ; i3=OUTorINPAR2_R8 ; scratchDMpt =OUTorINPAR3_R12 ; i2=dm(1,FP ); r6=dm(2,FP ); Not saving i1 to memory immediately -- Why not? Make 0 the hard way? Why? -- using SHIFT ALU not R3 = 0 means that might be able to optimize (parallel instructions) under different conditions as the value 0 requires 32 bits in opcode 15 lines compared to 16 before Not much saving! 9/13/2019 ENCM Learning from the “C” compiler Copyright
23
68K epilogue -- optimized
; return(count); _814 MOVE.L D7,-8(FP) MOVE.L -8(FP),D0 PEA -0xC(FP) BSR __sltos .STK -4 LEA -0xC(A6),A1 ; dead <- count ;} _15 MOVE.L -4(FP),A0 MOVE.L (A1),(A0) MOVEM.L (SP)+,D7/A2/A3/A4/A5 <- recover UNLK FP <- destroy frame RTS 9/13/2019 ENCM Learning from the “C” compiler Copyright
24
21K EPILOGUE I do not see any changes _L$250003: r6=dm(-9,FP );
F0=float r3; r3=dm(-10,FP ); i1=dm(-8,FP ); i2=dm(-7,FP ); i3=dm(-6,FP ); scratchPMpt =dm(minus1DM ,FP ); jump(plus1PM ,scratchPMpt )(DB); rframe; nop; I do not see any changes 9/13/2019 ENCM Learning from the “C” compiler Copyright
25
21K Body -- optimized code -- Version 4.1
i1=OUTorINPAR1 _R4; i3=OUTorINPAR2_R8 ; scratchDMpt_I4 =OUTorINPAR3_R12 ; i2=dm(1,FP ); r6=dm(2,FP ); r6=pass r6; if le jump(pc,_L$250003); scratchR2 =dm(i1,plus1DM ); lcntr=r6, do(pc,_L$ )until lce; _L$250001: dm(scratchDMpt_I4 ,plus1DM )=scratchR2 ; scratchR1 =dm(i3,plus1DM ); F0=-F1; dm(i2,plus1DM )=retvalue ; _L$316001: r3=r6; _L$250003: r6=dm(-9,FP ); F0=float r3; Get INPAR4 Get INPAR and then test NEW Hardware loop with pipeline preload ALU not INSTRUCTION <- Line _L$ used in lcntr, do-until lce 5 lines of code in loop compared to 9 9/13/2019 ENCM Learning from the “C” compiler Copyright
26
Example code with 5 parameters to pass
#include <math.h> float MakeConjugate(float *in_re, float *in_im, float *out_re, float *out_im, int number) { Going for real optimization (Lab. 4) float MakeConjugate(float pm *in_re, float dm *in_im, float dm *out_re, float pm *out_im, int number) { Trying to invoke parallel data and program memory operations 9/13/2019 ENCM Learning from the “C” compiler Copyright
27
Prologue changes .global _MakeConjugate; _MakeConjugate:
modify(CTOPstack ,-7); dm(-8,FP )=r3; r3=r3 xor r3; dm(-7,FP )=r6; scratchR2 =i1; r6=dm(2,FP ); dm(-6,FP )=scratchR2 ; r6=pass r6; i13=dm(1,FP ); DAG2!! scratchPMpt_I12 =OUTorINPAR1_R4 ; i1=OUTorINPAR2_R8 ; scratchDMpt_I4 =OUTorINPAR3_R12 ; 9/13/2019 ENCM Learning from the “C” compiler Copyright
28
Really optimized? Not what expected.
if le jump(pc,_L$250003); scratchR2 =pm(scratchPMpt_I4 ,plus1PM ); lcntr=r6, do(pc,_L$ )until lce; _L$250001: dm(scratchDMpt_I4 ,plus1DM )=scratchR2 ; scratchR1 =dm(i1,plus1DM ); F0=-F1; pm(i13,plus1PM )=retvalue ; scratchR2 =pm(scratchPMpt_I12, plus1PM ); _L$316001: QUESTIONS -- Can we further optimize THIS code by having pm and dm operations at the same time (see Lab. 3) or is the compiler correct that this is the best we can do? See Articles from Embedded Systems Magazine (Sept/Oct 2000) on what is needed to make this work 9/13/2019 ENCM Learning from the “C” compiler Copyright
29
We now have VisualDSP++ 3.0 SP1
Does the new compiler make better use of the program and data buses or must we still hand optimize the code for parallel operations – We will look at optimization issues during the assignments? #include <math.h> float MakeConjugate(dm float *in_re, pm float *in_im, pm float *out_re, dm float *out_im, int number) { int count; for (count = 0; count < number; count++) { *out_re = *in_re; *out_im = *in_im; out_re++; out_im++; in_re++; in_im++; } return(count); // NEW -- returning a parameter 9/13/2019 ENCM Learning from the “C” compiler Copyright
30
To be tackled today The “C” compiler knows how to generate assembler
What can we learn from the “C” compiler as tutor? “C” routines can use many parameters How does Wind River DiabData 68K compiler do it? How does White Mountain SHARC 21K compiler do it? Process to generate assembler from “C” (general) VisualDSP requirements Using -S compiler option and look at .s file Printing (Best directly from Visual DSP NOT Notepad) Reverse engineering the .s file for easier reading by using reverse_clanguage_register_defines.i file and the assembler preprocessor to produce .is file. 9/13/2019 ENCM Learning from the “C” compiler Copyright
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.