M. R. Smith, University of Calgary, Canada ucalgary.ca

M. R. Smith, University of Calgary, Canada smithmr @ ucalgary.ca
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. SHARC ECOLOGY 201 Using a Project Management Tool to handle Microprocessor Resources M. R. Smith, University of Calgary, Canada ucalgary.ca SHARC2001 Workshop, Boston *

Series of Talks and Workshops
CACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code. SQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions . SHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code. SHARC Ecology 201 – Workshop on SQUISH-DSP and CACHE-DSP tools. 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Material covered Efficiency of assembly code produced by the optimizing VisualDSP++ compiler depends on design/form of the “C/C++” algorithm. Simple code example and a variety of design formats for speed Need to further improve speed of code developed by optimizing compiler or through custom development processes Use of the tool SquishDSP to assist in identifying dependencies in your code and possible find parallelization of instructions Speed improvement is algorithm and design dependent, but we have doubled the speed of code produced by the VisualDSP++ compiler. Further tests are needed to see if the improvements scale for more complex DSP algorithms. This tutorial was developed for teaching purposes and some parts “may provide BGOs” for people familiar with concepts 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Typical but simple DSP algorithm
Note -- loop, memory intensive, multiplication and addition intensive, use of constants -- typical DSP stuff. Note use of both “dm” and “pm” arrays Uses “known” constant array size as that provides better opportunities for optimizing compiler than “variable” size of array passed in as a parameter to the subroutine. 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

VisualDSP++ output Much more parallel ADSP2106X code than was available from VisualDSP 4.1 1 calculations in each loop Average 2 cycles/calculation 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Alternate source code -- larger loops
Approach 1 For (count < N / 2) Begin 1; …... 5; 6; ….. End Loop May lead to more parallel instructions in the ‘middle’ of the new of the longer loop May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) Variation needed if N is not a factor of 2 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Unroll the loop Anticipated tighter code from variant 1 on ADSP2106X
Chose second format as thought the approach might be useful on Hammerhead ADSP2116X in SIMD mode. GOOD for SIMD? 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Variant 1 -- Double loop using count++
2 calculations in each loop Average 5 cycles/calculation VERY POOR OPTIMIZATION Unexpected software loop increases overhead 2 cycles per loop 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Variant 2 -- using index [count + 1]
Very impressed in some ways 6 calculations in each loop Average 2 cycles/calculation OPTIMIZATION NO BETTER THAN ORIGINAL SINGLE LOOP EXAMPLE BUT LOOK EASY TO FURTHER REDUCE LOOP CYCLE COUNT AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL OPS 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Variant 2 -- using index [count + 1]
EASY TO REDUCE CYCLES AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL SHARC OPS FOR EXAMPLE Move pm(i13, m12) down one cycle allows a parallel operation F12=F0*F4, F1=F11+F12; One cycle decrease already 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Further speed improvement?
By playing around with the code, I thought I could get the code down to 1 cycle per calculation. However, even with this simple code, I was not sure whether I was handling all the data dependencies correctly. Would be impossible with a larger code sequence. I therefore decided to move the code into Microsoft Project which is a business scheduling tool, rather than write my own scheduler! Hence the tool SquishDSP 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

SquishDSP SquishDSP V1.0 reformatted the ADSP2106X code into something suitable for input into Microsoft Project. The reformatting process identified a few dependencies between instructions. It basically allowed Microsoft Project to work “by default” knowing that the compiler had already “ordered things in a semi-reasonable way. Worked extremely well, but a few instructions were out of place and had to be moved by hand. Okay if you knew what to look for and what to expect. Unlikely to work on “long loops” or with hand custom coded -- my specialty. SquishDSP V2.0 identifies most of the dependencies before the code is submitted to Microsoft Project. The results with SquishDSP 2.0 are given here.. Contact Mike Smith for further infomation 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 1 -- Develop the initial code -- process.c
Notes LOOP SIZE -- FIXED as a constant MAXSIZE and not a variable Use of both DM and PM data busses in “C” program. Double loop of code with index registers. This [count] then [count+1] form of double loop was chosen from several variants tried. 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 2 -- Pass through VisualDSP++
Note in “process.s” that compiler has unrolled the loop further -- 6 calculations performed per loop Initially work with “loop component” only in next stages 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 3A -- First Stages of SquishDSP
Pass 1 -- Replace “commas” in instruction that are not instruction separators. This was initially to get the code into a .CSV format but is currently retained as a reliable approach to prepare for Pass 2. Pass 2 -- Identify, and break up all parallel instructions into single instructions taking care of “local dependencies”, retain original instructions 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 3B -- First Stages of SquishDSP
Pass 3 -- Add dependency information in a Microsoft Project compatible format Pass 4 -- Reformat into a totally Project compatible format, and “pretty format” to restore original ADSP2106X style of syntax 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 4A -- Input into “Microsoft Project”
Select “txt -- Default Task Information” and import file 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Display in ‘non-leveled’ mode
Select TOOLS | Resource Levelling | Clear Leveling -- Note the highly overused resources using SquishDSP V1.0 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Display in ‘leveled’ mode
Select TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources even when using SquishDSP 1.0 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 4B -- Display in ‘non-leveled’ mode
Select TOOLS | Resource Levelling | Clear Leveling -- Note there are now only a few overused resources as Project has already been able to resolve most conflicts with SquishDSP 2.0 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 4C -- Display in ‘leveled’ mode
Select TOOLS | Resource Levelling | Level Now -- Note the proper rescheduling of resources 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 4D -- Sort the tasks by “Start” date
Click in “Task Name” base and select “Sort | Ascending | Start” 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 4E -- Prepare ‘rescheduledproject.txt”
Cut and paste “Task Name, Duration, Start” into notepad file 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Steps inside Microsoft Project
Input “microsoftproject.txt” using “txt -- Default Task Information” Select TOOLS | Resource Levelling | Clear Leveling -- Note the overused resources Select TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources Click “Task Name Bar” -- Select SORT | Ascending | START Cut and paste columns “Task Name, Duration, Start” into Notepad file “rescheduledproject.txt” Tried saving file directly from Project, then sorting the tasks by date etc. Project interface was very clumsy for this type of files. (I don’t know how to access “.mpp” formatted files.) In addition, Project did a better job of SORT | Ascending | START 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 5 -- Second Stage of SquishDSP
Pass 6 performs the following operations Based on ‘Start date information’ from the Microsoft project files, regroup instructions into parallel instructions Check to see if the syntax of the registers is correct for parallel operations on the ADSP2106X If the syntax is not correct, break up the instructions into valid instructions and send out appropriate error messages Correct syntax for parallel operations means Post-modify using modify registers on all memory operations Multiplication using registers R(0, 1, 2, 3) * R(4, 5, 6, 7) Addition/Subtraction using R(8, 9, 10, 11) +/- R(12, 13, 14, 15) Float and Integer data registers recognized as equivalent Parallel + and - operations are not currently recognized as valid. 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Step 5 -- Second Stage of SquishDSP
Original code was a loop of 12 cycles This one is of 8 cycles “Original” code available for checking 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Some side issues You can model different processor architectures quite easily Suppose you have single cycle addition but double cycle multiplication. Simply set the task duration for each use of the MULTIPLIER to 2. Adjustments to Microsoft Project -- Fine detail Set to “Don’t split tasks to allow activities to occur on different days”. Not applicable at the moment. Other “fine details” to come 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Current Approach to Optimization
Original starting code For (count < N) Begin 1; 2; 3; 4; 5; 6; End Loop Optimized code For (count < N) Begin 1, 2A; 2B, 3A; 3B, 4; 5, 6; End Loop 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Alternate source code -- larger loops
Approach 1 For (count < N / 2) Begin 1; …... 5; 6; ….. End Loop May lead to more parallel instructions in the ‘middle’ of the new of the longer loop May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) Variation needed if N is not a factor of 2 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Double Loop with N != 2 * p F1=F11+F12, r0=dm(i2,m1);
F13=F0*F4, r2=dm(i4,m4), pm(i13,m12)=r1; ..... F8=F11+F13; F12=F2*F4, pm(i12,m9)=r8; lcntr=10, do(pc,_L$ )until lce; _L$ : //end loop _L$ ; -- end double loop _L$ : 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Adjust ‘lcntr’ values In this example, the lcntr value was originally 21. We must use lcntr = 10 for the new double loop and cut and paste the original loop outside the new loop to ensure that the total overall loop count is valid. You can now see why the task of developing an optimizing compiler is not trivial. The optimizing compiler must be able to handle the general case reliably! 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Double loop re-optimized
12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Optimization results SquishDSP SquishDSP
Original code -- loop of 12 cycles with 6 sets of operations per loop loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time Double original loop -- loop of 24 cycles with 12 sets of operations per loop loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time Overall code length 20 cycles (14 in loop and 6 outside) SquishDSP SquishDSP 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Source code re-arrangement
We can identify that some of the internal stages of the new rescheduled code are running totally parallel -- 4 operations per code. This suggests that rescheduling the loop operations will allow the generation of a highly efficient loop. Rescheduling the loop means bring out instructions from the loop and delaying all write operations until late in the loop To ensure accurate rearrangement of the code, perhaps we should change the priorities on the “pm” Microsoft Project tasks to be “As Late as Possible” rather than move by hand as was done in this example. Note that compiler has already done some moving 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Alternate starting points
Approach 2 1; 2; For (count < N) Begin 3; 4; 5; 6; End Loop Possible adjustment of index registers Valid approach if instructions 1 and 2 do not make any “permanent changes”. “Permanent changes” means no WRITING to external memory May require adjustment to registers after the loop because of the extra instructions -- particularly index registers that are post-modified. 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Removed code from loop till first “write operation”

Moved 3 pm( ) write operations later in loop
These can now be moved outside the loop 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

How many instructions to move?
Very easy to make minor changes to original code “process.s” open in a NotePad window, save the file, reactivate and quickly bring the file into Microsoft Project for examination. Turned out that bringing “just two” instructions out of the loop was the best solution. SquishDSP 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Optimum loop configuration

Optimum result -- 1 calculation per loop -- Double VisualDSP++ speed
This loop is now just 6 cycles for 6 calculations Speed improvement will be very algorithm dependent 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Savings are very algorithm dependent
Original code -- loop of 12 cycles with 6 sets of operations per loop loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time Double original loop -- loop of 24 cycles with 12 sets of operations per loop loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time Overall code length 20 cycles (14 in loop and 6 outside) Original code with 2 instructions extracted -- loop of 12 cycles with 6 sets of operations loop of 6 Cycles with 6 sets of operations per loop -- increased efficiency of 50% -- processor at maximum pipeline capability. Overall code length 8 cycles (6 in loop and 2 outside) SquishDSP SquishDSP SquishDSP 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Real life is not as simple as this
Loops from Optimizing compiler already have instructions inside and outside the loop 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Real “final” source code (without stack operations)
Code has been adjusted for original instructions outside the loop 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

SquishDSP Final Output

Conclusions 1 Useful with the critical inner loops of DSP algorithms
Handling Cache Conflicts Come to Cache-DSP talk tomorrow morning Use of Primavera PV3 tool with special macros How handle instructions inside Delay Slots of jump instructions (especially conditional instructions with other parallel instructions) 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Conclusions -- 2 SquishDSP SquishDSP SquishDSP
In we have a simple tool that appears to do a good job on further optimizing the output from the current version of VisualDSP++. Even when the equivalent features are added into a later version of VisualDSP++ then will still be useful for optimizing “hand-code” Further work means more testing on Is the tool “really” doing the job we think it is, or is it missing vital dependencies? Does it give back something useful for larger source files? Can we remove the dependency on the intermediate stage using Microsoft Project? GUI interface is very useful. SquishDSP SquishDSP SquishDSP 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

Acknowledgements Financial support of Natural Sciences and Engineering Research Council (NSERC) of Canada and University of Calgary Financial support from Analog Devices Dr. Mike Smith is ADI University Professor 2001/2002 Future financial support from Alberta Provincial Government through Alberta Software Engineering Research Consortium (ASERC) 12/5/2018 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

SquishDSP For further information on this ADSP2106X utility Contact -- Dr. Mike Smith

M. R. Smith, University of Calgary, Canada ucalgary.ca

Similar presentations

Presentation on theme: "M. R. Smith, University of Calgary, Canada ucalgary.ca"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M. R. Smith, University of Calgary, Canada ucalgary.ca

Similar presentations

Presentation on theme: "M. R. Smith, University of Calgary, Canada ucalgary.ca"— Presentation transcript:

Similar presentations

About project

Feedback