* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca *

To be tackled today What’s the problem?
Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

ADSP-2106x -- Parallelism opportunities
* ADSP-2106x -- Parallelism opportunities 07/16/96 DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 CACHE MEMORY 32 x 48 PROGRAM SEQUENCER PMD BUS DMD BUS 24 PMA BUS PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Zero overhead loops Instruction pipeline issues Ability for parallel memory operation, One each on pm, dm and instruction cache busses Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc. Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr *

Compiler is only -- somewhat useful
See article in course notes from Embedded System Design Sept./October 2000 Need to get a systematic process to provide Parallelism without pain Need to know what to worry about and what not to Lab 3 -- Implement FIR filter in Parallel -- Help provided Lab. Library version of FFT, custom version of Burg Algorithm (AR modeling) ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Basic code development -- any system
Write the “C” code for the function void Convert(float *temperature, int N) which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market) Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Parallel Instruction Code Development
Write the 21k assembly code for the function void Convert(float *temperature, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization. ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Standard “C” code Standard Warning -- What does optimizing
void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / ; temperature++ } Standard Warning -- What does optimizing compiler do with 9 / 5 becomes 1 or 1.8? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

21061-style load/store “C” code
void Convert(register float *temperature, register int N) { register int count; register float *pt = temperature; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = scratch * (9 / 5); scratch = scratch + 32; *pt = scratch; pt++; } ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Assembly code PROLOGUE BODY EPILOGUE
Appropriate defines to make easy reading of code Saving of non-volatile registers BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are valid EPILOGUE Recover non-volatile registers ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Straight conversion -- PROLOGUE
// void Convert(reg float *temperature, reg int N) { .segment/pm seg_pmco; .global _Convert; _Convert: // register int count = GARBAGE; #define countR1 scratchR1 // register float *pt = temperature; #define pt scratchDMpt pt = INPAR1; // float scratch = GARBAGE; #define scratchF2 F2 // For the CURRENT code -- no volatile // registers are needed -- may not remain true ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Straight conversion of code
// for (count = 0; count < N; count++) { LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: // scratch = *pt; scratchF2 = dm(0, pt); // Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR1 (R4) is dead -- can reuse as F4 #define constantF4 F4 // Must be float constantF4 = // No division, Use register constant scratchF2 = scratchF2 * constantF4; // scratch = scratch + 32; #define F0_32 F0 // Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; // *pt = scratch; pt++; dm(pt, 1) = scratchF2; LOOP_END: 5 magic lines of code // NOT F0 = 32 gives F0 = 1 * ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Avoid this error LCNTR = INPAR2, DO LOOP_END UNTIL LCE:
scratchF2 = dm(0, pt); scratchF2 = scratchF2 * constantF4; F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; LOOP_END: dm(pt, 1) = scratchF2; INTENDED LAST LINE OF LOOP dm(pt, 1) = scratchF2; LOOP_END: Rest of the code STILL LAST LINE OF LOOP First line of “rest of code” has now become part of loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Process to avoid the error
This particularly error is going to be very easy to make as the “Rest of the code” is going to look very similar to the “loop internals” once we have taken account of the ALU/FPU pipeline to maximize parallelism SUGGESTED APPROACH TO AVOID THIS TIME WASTING ERROR LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(0, pt); scratchF2 = scratchF2 * constantF4; F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, 1) = scratchF2; LOOP_END: Rest of the code This was a process adopted from the compiler output -- the concept of a label was beyond most people in ENCM415 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Speed rules for memory access
scratch = dm(0, pt); scratch = dm(pt, 0); // Not ++ as to be re-used dm(pt, 1) = scratch; Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant Must use Modify registers to store these constants. Several useful constants placed in modify registers (DAG1 and DAG2) during “C-code” initialization (if linked in) scratch = dm(pt, zeroDM); // Not ++ as to be re-used dm(pt, plus1DM) = scratch; Can’t use PREMODIFY PERIOD Can’t use POST MODIFY OPERATIONS with CONSTANTS ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Speed rules IF you want adds and multiplys to occur on the same line
F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description (bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4; ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

When should we worry about the register assignment?
#define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE: scratchF2 = dm(pt, 0); // Not ++ as to be re-used // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 #define F0_32 F0 // Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, 1) = F0_32; LOOP_END: ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Check on required register use
#define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0 // Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Register re-assignment -- Step 1
#define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 -- OKAY LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 // Must be float -- OKAY constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 -- SOURCES okay here Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0 // Must be float F0_32 = 32.0; -- WRONG to use F0 here -- ADDITION scratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST early Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; -- OKAY ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Register re-assignment -- Step 2
#define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12 // INPAR3 is available F12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Fix poor coding practice -- “C” or assembly
#define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; MOVE OUTSIDE LOOP scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12 // INPAR3 is available F12_32 = 32.0; MOVE OUTSIDE LOOP scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management -- Chart1 -- Basic code
* 07/16/96 -1 UNTIL LCE LOOPEND: In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource dm cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019 *

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Un-roll the loop For various methods on “unrolling the loop” see papers by Jeanne Anne Booth Final Exam question -- What are relative advantages of the various techniques (with examples)? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource 2 -- unroll the loop -- 5 times here
Each pass through the loop involves Read Multiply Add Write ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions Reading THESE PHASES ARE ‘CONCEPTS’ RATHER THAN “ IMPLEMENTED’ Reading Reading Reading Reading Reading Reading Reading Model used -- depends on where operands are relative to equals sign ‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

OKAY TO MOVE F2 src freed up
Resource Management 4 Check what can be moved in parallel with other instructions OKAY TO MOVE F2 src freed up before F2 dest occurs F2 = OKAY TO MOVE Empty spot if can move * and + instructs which this instruction MUST follow NO !!! or just possible NO? Why a problem? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Memory resource availability
Move up F2 = dm(pt, ZERODM) from second loop into first loop However now we have a possible conflict about which F2 should be used for the dm(pt, plus1DM) = F2 instruction if we further optimize by trying to fill the other empty delay slots -- see next slide ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource management Overlapping two parts of the loop
ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management 5 -- What’s up, Doc
Resource Management 5 -- What’s up, Doc? Attempting to fill all unused resource availability Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers Problem may/may not exist with this simple example but very likely to exist in more complex algorithm ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource 6 -- Solution -- Save and then use F9
ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management 7 -- Some parallelism possible with Read, Mult, Add and Write mixed across 5 loop comps. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

WRONG -- CONCEPT GOOD, IMPLEMENTATION BAD as we are no longer indexing correctly through the data.
Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Need 1 resource to be maxed out Otherwise algorithm is inefficient
Have to try a lot of different approaches Here is my code ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management 8 Unroll the loop a bit more -- 9 loop components
DM BUS USAGE NOW MAXed OUT (after a while) CODE PATTERN APPEARING ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Now to to “reroll the loop”
The loop is currently just straight line coded. Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations. Three components of “rerolled loop” for loop of form “count = 0, count <N” Fill the ALU/FPU pipeline (typically 1 stage from loop) Overlap N - 2 stages Empty the ALU/FPU pipeline (typically 1 stage) ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource Management 9 Identify the loop components
FILL ALU/FPU PIPE LOOP BODY EMPTY ALU/FPU PIPELINE ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource 9 -- Final code version
FILL USE EMPTY ALU/FPU PIPE -1 UNTIL LCE LOOPEND: ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Speed improvements BEFORE START LOOP EXIT ENTRY
4 + N* = * N NOW with 2-fold loop unfolding START LOOP EXIT ENTRY (N – 2) * 5 / = * N NOW with 3-fold loop unfolding (N – 2) * 6 / = * N Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Question to Ask Will the code work? We now know the final code
Should we have made the substitution F2 to F9? Who cares -- do it anyway as more likely to be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably overlap the save and recovery of the non-volatile R9 with other instructions! Will the code work? ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Resource 9 -- Final code version
-1 UNTIL LCE LOOPEND: N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 Only works if (N - 2) / 3 is an integer. ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

Tackled today What’s the problem?
Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7/20/2019

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Similar presentations

Presentation on theme: "* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Similar presentations

Presentation on theme: "* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items."— Presentation transcript:

Similar presentations

About project

Feedback